Skip to content

harmonydata/pdf-questionnaire-extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The Harmony Project logo

🌐 harmonydata.ac.uk Harmony | LinkedIn Harmony | X Harmony | Instagram Harmony | Facebook Harmony | YouTube

Harmony on Twitter

Harmony PDF extraction

PyPI package my badge License tests Current Release Version pypi Version version number PyPi downloads forks docker

Data for PDF Kaggle competition

See our competition on Kaggle at: https://www.kaggle.com/competitions/harmony-pdf-and-word-questionnaires-extract

Read about our Kaggle competition on our blog.

Entering the Kaggle competition

Requirements: Python 3.10 or greater

  1. Create an account on Kaggle.

  2. Install Kaggle on your computer:

pip install kaggle
  1. On the Kaggle website, download your kaggle.json file and put it in your home folder under .kaggle/kaggle.json.

  2. Download and unzip the competition data:

kaggle competitions download -c harmony-pdf-and-word-questionnaires-extract-v2
unzip harmony-pdf-and-word-questionnaires-extract-v2.zip 
  1. Run create_sample_submission.py in the folder containing your data to create your train and test predictions:

To generate predictions for the training data and write to train_predictions.csv:

python create_sample_submission.py train

To evaluate the train predictions:

python evaluate_train_results.py
  1. To modify the prediction logic or inject your own model, you can edit the function dummy_extract_questions.

  2. To generate predictions for the test data and write to submission.csv:

python create_sample_submission.py test
  1. Submit your CSV file to Kaggle
kaggle competitions submit -c harmony-pdf-and-word-questionnaires-extract-v2 -f submission.csv -m "Message"

Testing the existing models

Go into notebooks folder and run

python model_0x_baseline_extract_everything.py

to run each model in that folder.

Then to evaluate a model, run:

python evaluate.py 0x

Here are the scores of model 01 and model 02, for comparison:

Model 01 (baseline, just extracting text)
Mean precision = 0.11, mean recall = 0.28
	Precision over all instances = 0.05, recall over all instances = 0.30

Model 02 (current Harmony 0.5.0)
Mean precision = 0.52, mean recall = 0.53
	Precision over all instances = 0.37, recall over all instances = 0.44

How PDFs are extracted

Harmony relies on two libraries to extract questionnaire items from PDFs:

  1. Apache Tika - to get plain text
  2. PDF Table Extractor Node.js library by Ronny Wang - to get tabular data

This repo contains the training data and training scripts.

The withheld test annotations are in this private repo: https://github.com/harmonydata/pdf-questionnaire-extraction-test-annotations

Preprocessing all the PDFs

Some raw PDFs have been provided.

  1. Install all the requirements: pip install -r requirements.txt
  2. Download and start Apache Tika in a command line: java -jar tika-server-standard-2.8.0.jar
  3. In folder notebooks, run python preprocess_pdf_to_text.py
  4. In folder notebooks, run python preprocess_pdf_to_tables.py

This will populate data/preprocessed_text and data/preprocessed_tables, which can be used to train the model.

‎😃💁 Who worked on Harmony?

Harmony is a collaboration project between Ulster University, University College London, the Universidade Federal de Santa Maria, and Fast Data Science. Harmony is funded by Wellcome as part of the Wellcome Data Prize in Mental Health.

The core team at Harmony is made up of:

📜 License

MIT License. Copyright (c) 2023 Ulster University (https://www.ulster.ac.uk)

📜 How do I cite Harmony?

McElroy, E., Moltrecht, B., Ploubidis, G.B., Scopel Hoffman, M., Wood, T.A., Harmony [Computer software], Version 1.0, accessed at https://harmonydata.ac.uk/app. Ulster University (2023)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published