PDF decompressing

The aim is to be able to detect the paragraphs and headings in a PDF. This allows for a much cleaner input for pdf mining.

Why important?

Reading a PDF as a human is easy. An average PDF file logically represents the information visually. However, a PDF file is stored without structure. This becomes appearant when copy pasting a PDF file to a wordproccesor: headings are not always recognized, lines are broken and footers are also copied. This mess also translates to reading a PDF through most packages for PDF mining. This means that automation of text mining can produce weird text output form a PDF file, which has an impact on the accuracy of text analytics.

A PDF file can contain multiple types of information. Some example of information types that can be found in a PDF are:

Images
Tables
Graphs
Headers
Footers
Paragraphs
Lists

With this project I primarily focus on converting a PDF file to readable text with correct headings. The aim is to have a package that can generally translate a PDF to paragraphs, headings and a title. This means that the code needs to understand when a paragraph continues at the next page, removes standard text at the top and bottom of the text and understand the hierarchy of headings.

How constructed?

Just human logic, no fancy machine/deep learning at this moment. Maybe I can inspire someone or be of service by producing a test sample for a model.

How use?

Install required packagaes, import file to project and call function pdf_to_dict('./location/to/file.pdf'). Output for now, a list of dictionaries.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.gitignore		.gitignore
Coalitieakkoord-2022-2026-1.pdf		Coalitieakkoord-2022-2026-1.pdf
Coalitieakkoord-Samen-werken-aan-Delft.pdf		Coalitieakkoord-Samen-werken-aan-Delft.pdf
Example_pdf.pdf		Example_pdf.pdf
LICENSE.txt		LICENSE.txt
README.md		README.md
Raadsprogramma 2022-2026 Aan de slag voor Aalsmeer.pdf		Raadsprogramma 2022-2026 Aan de slag voor Aalsmeer.pdf
main.py		main.py
output.json		output.json
requirements.txt		requirements.txt
test.ipynb		test.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF decompressing

Why important?

How constructed?

How use?

About

Releases

Packages

Languages

License

kcvanderlinden/PDF-decompressing

Folders and files

Latest commit

History

Repository files navigation

PDF decompressing

Why important?

How constructed?

How use?

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages