Skip to content

This plugin is used in a wordpress backend to extract text from pdf files and convert it into blogs.

Notifications You must be signed in to change notification settings

augustin-wien/PDF-Parser

Repository files navigation

PDF-Parser

A software to parse print-ready newspaper PDF's.
After extracting articles of each page, each article is published via the Wordpress API.

Installation guide

LocalWP

Install
First, for a local development setup you need to install LocalWP.

Create website
Second, you need to create a local wordpress website to be able to publish posts via the Parser.
Here is a detailed How-To-Video for more information on LocalWP.
Important: Remember your username and URL since these credentials are needed for the parser to communicate.

Generate Application Password
Third, in your Wordpress Backend navigate to Users > Profile. Scroll down to the “Application Passwords” heading.
Add a new Application Password and copy your new password to a safe place like your Password Manager.
Screenshot Detailed instructions can be read here and watched here

Create several categories
In your wordpress backend, click on the tab 'Posts' to the below submenu 'Categories'.
Now create the following categories step by step without the quotation marks:
"editorial","augustiner:in","einsicht","das wahre leben","cover","cover","cover","cover","tun & lassen","tun & lassen","tun & lassen","tun & lassen","vorstadt","vorstadt","lokalmatador:in nº","vorstadt","art.ist.in","art.ist.in","art.ist.in","art.ist.in","dichter innenteil","dichter innenteil","dichter innenteil","dichter innenteil","augustinchen","augustinchen"

PDF-Parser

Clone
Clone this repository on your local machine.

Install required packages
We assume you have Python installed and use a virtual environment like venv.
In your virtual environment, run:

pip install -r requirements.txt

Create .env file
In the main directory copy .env.example into .env
Then change all the credentials to your specific needs as such:

  • WORDPRESS_URL: The URL of your Wordpress API like http:https://localhost:10014/wp-json/wp/v2/. This is the URL, you should have remembered during the local Wordpress site creation.
  • WP_API_USER: Your Wordpress username, you were supposed to remember during the local Wordpress site creation and you log in with.
  • WP_API_KEY: Your Application password you saved somewhere secret after generating it.

Start the app
Next, run in your terminal:

uvicorn main:app --reload

Visit the app
In your browser visit localhost:8000

Upload your PDF
In your browser upload your PDF file and check the results on your local Wordpress site.
GUI of PDF-Parser

Development

VSCode extensions

For development we use VSCode. Further, to keep the code style consistent we use the following VSCode extensions:

Please install all of these extensions.
In the next step please open your VSCode user settings i.e. your settings.json via CTRL + SHIFT + P and click on Preferences: Open User Settings (JSON).
In this file please add the following code below:

  // Python specific settings
  "[python]": {
    "editor.formatOnType": true,
    // Set the python formatter to black
    "editor.defaultFormatter": "ms-python.black-formatter"
  },
  // Python linter settings
  "pylint.args": ["--max-line-length=120"],
  "flake8.args": ["--max-line-length=120"]

Update requirements.txt

In case you add a new package for this project, first install pipreqs with:

pip install pipreqs

Being in the project directory and to overwrite the current requirement.txt run

pipreqs ./ --force

Important sidenote
Pipreqs uses the package fitz but for our CI to run adequate, we need to specify PyMuPDF.
So please undo the change in this line afterwards.

Finally, check the rest of the file via git to make sure everything is fine and push it.

Linting locally

To run everything as it happens in the Github Actions and fix it afterwards locally, do the following.

Pylint To run pylint, use this command:

pylint $(git ls-files '*.py') --max-line-length=120

If pylint is not installed, run:

pip install pylint

Flake8 To run flake8, use this command:

flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics

If flake8 is not installed, run:

pip install flake8

Black To run black formatter, use this command:

black --check .

If black is not installed, run:

pip install black

Testing locally

To run everything as it happens in the Github Actions and fix it afterwards locally, do the following.

Pytest To run pytest, use this command:

pytest -v

If pytest is not installed, run:

pip install pytest

About

This plugin is used in a wordpress backend to extract text from pdf files and convert it into blogs.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published