PDFScraping

This notebook is part exploration part tutorial resource for public employees or those working with civic data who need to access, organize, analyze information or metadata trapped in PDF documents.

"This one's for you..." If you need to digitize records, extract information from scanned forms, this is for you. If you have ever had to manually enter an Excel spreadsheet or had to ask an intern to do it to get at information in a document or report. this is for you. If you do not have a paid service for Optical Character Recognition on scanned documents, this is for you. If you are working with information you or your department is not comfortable exposing to one of the many free but limited online services that require you to upload your documents into some data-ether for conversion and limit the number of documents you can convert, this is for you.

Potential Problem Statements or Use Cases

Your department, division or organization:

wants to make data publicly available on a certain government service but the data is only available as PDF.
needs to answer a question about a frequency, standard deviation or some distribution about the occurrence of something that captured in a PDF--like what is the average dollar amount of a change order submitted by some department
pays rental fees for space that is occupied by boxes of paper that could be more efficiently stored if scanned, digitized and stored in a database based on some feature of the document data or structure.
participates in investigations of events or incidents and needs to process a significant amount of scanned PDF documents to prioritize them based on some element of the document metadata or existence of a particular element or string in a scanned document.
somehow.....got their hands on like 2,156 resumes (scanned & paper) and wants to email each of these people to inform them about the launch of a new government fellowship program opportunity to potentially increase the competitiveness of the applicant pool.

Fact: Manual entry sucks and should be abolished!

Getting Started

go get the repo, clone it & explore it

Installing

The requirements.txt file contains the dependencies necessary to run this program from your machine. It also includes the installation of packages required to convert the jupyter notebook code into a slideshow though this is not necessary or related to any of the PDF stuff. You can create a virtual environment and run in your terminal the command: '''pip install -r requirements'''

This assumes that you have created a project folder, changed into that directory, have created a virtual environment there, activated the virtual environment for your project, and are running the above command in that location.

for example:
- $ mkdir yourprojectfoldername
- $ virtualenv yourvenvname
- $ source yourvenvname/bin/activate
- $ pip install -r requirements

Presentation Slides

Google Slides for Python Meet Up | Center of Excellence Talk

TODO:

- create initial project readme with problem statement
- update additional presentation

Authors

Babila Lima Business Process Improvement Office

Acknowledgements

Nothing happens in a vacum--except carpet cleaning, and even that is debatable. A special thank you goes to the folks that were banging their heads on this problem and lifted them long enough to ask the magic question, is there a way to automate the boring stuff.

Special shoutout to Evan Cook for asking the original question that sparked this mini-project: 'How do I get the data from page 4 of this 72 page document?'

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
.gitignore		.gitignore
10-11-2018-brl1906-exploring-pdf-data-scrapers.ipynb		10-11-2018-brl1906-exploring-pdf-data-scrapers.ipynb
10-13-2018-brl1906-virtualenv-information.ipynb		10-13-2018-brl1906-virtualenv-information.ipynb
10-15-2018-brl1906-working-with-camelot.ipynb		10-15-2018-brl1906-working-with-camelot.ipynb
10-16-2018-brl1906-working-with-pytesseract.ipynb		10-16-2018-brl1906-working-with-pytesseract.ipynb
10-16-2018-brl1906-working-with-tabula-py.ipynb		10-16-2018-brl1906-working-with-tabula-py.ipynb
10-17-2018-brl1906-working-with-pdf2image.ipynb		10-17-2018-brl1906-working-with-pdf2image.ipynb
10-18-2018-brl1906-working-with-textract.ipynb		10-18-2018-brl1906-working-with-textract.ipynb
2-7-2019-brl1906-baltimore-police-pdf-data-from-justin.ipynb		2-7-2019-brl1906-baltimore-police-pdf-data-from-justin.ipynb
page4-10-23-2018.pdf		page4-10-23-2018.pdf
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDFScraping

Potential Problem Statements or Use Cases

Getting Started

Installing

Presentation Slides

TODO:

Authors

Acknowledgements

About

Releases

Packages

Languages

brl1906/PDFScrapingTour

Folders and files

Latest commit

History

Repository files navigation

PDFScraping

Potential Problem Statements or Use Cases

Getting Started

Installing

Presentation Slides

TODO:

Authors

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages