Science-NLP

A project designed to demonstrate NLP techniques, like named entity recognition, topic modeling, and data engineering flows, like converting PDF documents to .txt files.

I store my personal collection of publications, (mostly clustered around space science and related fields) in the pdf_files folder. The process_and_store.py script converts these pdf files to .txt files, stores them in the txt_files folder, and proceeds to process them using the Processor object contained in the Processing.py file. It then collects all the data created and stores it as a JSON file.

To use this repo for your own collection of PDF documents:

First clone the repo

git clone https://github.com/mkirby1995/Science-NLP

Then run `process_and_store.py`

python process_and_store.py

The `remove_files.py` script can be used to remove the `.txt` files in the `txt_files` folder, the `documents.json` file, and if line 14 is uncommented the `.pdf` file in the `pdf_files` folder.

python remove_files.py

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Notebooks		Notebooks
mallet-2.0.8		mallet-2.0.8
pdf_files		pdf_files
txt_files		txt_files
.gitignore		.gitignore
LICENSE		LICENSE
Processing.py		Processing.py
README.md		README.md
documents.json		documents.json
pdf_converter		pdf_converter
playground.ipynb		playground.ipynb
process_and_store.py		process_and_store.py
remove_files.py		remove_files.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Science-NLP

To use this repo for your own collection of PDF documents:

First clone the repo

Then run `process_and_store.py`

The `remove_files.py` script can be used to remove the `.txt` files in the `txt_files` folder, the `documents.json` file, and if line 14 is uncommented the `.pdf` file in the `pdf_files` folder.

About

Releases

Packages

Languages

License

mkirby42/Science-NLP

Folders and files

Latest commit

History

Repository files navigation

Science-NLP

To use this repo for your own collection of PDF documents:

First clone the repo

Then run process_and_store.py

The remove_files.py script can be used to remove the .txt files in the txt_files folder, the documents.json file, and if line 14 is uncommented the .pdf file in the pdf_files folder.

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Then run `process_and_store.py`

The `remove_files.py` script can be used to remove the `.txt` files in the `txt_files` folder, the `documents.json` file, and if line 14 is uncommented the `.pdf` file in the `pdf_files` folder.

Packages