Harmony PDF extraction

Data for PDF Kaggle competition

See our competition on Kaggle at: https://www.kaggle.com/competitions/harmony-pdf-and-word-questionnaires-extract

Read about our Kaggle competition on our blog.

Entering the Kaggle competition

Requirements: Python 3.10 or greater

Create an account on Kaggle.
Install Kaggle on your computer:

pip install kaggle

On the Kaggle website, download your kaggle.json file and put it in your home folder under .kaggle/kaggle.json.
Download and unzip the competition data:

kaggle competitions download -c harmony-pdf-and-word-questionnaires-extract-v2
unzip harmony-pdf-and-word-questionnaires-extract-v2.zip

Run create_sample_submission.py in the folder containing your data to create your train and test predictions:

To generate predictions for the training data and write to train_predictions.csv:

python create_sample_submission.py train

To evaluate the train predictions:

python evaluate_train_results.py

To modify the prediction logic or inject your own model, you can edit the function dummy_extract_questions.
To generate predictions for the test data and write to submission.csv:

python create_sample_submission.py test

Submit your CSV file to Kaggle

kaggle competitions submit -c harmony-pdf-and-word-questionnaires-extract-v2 -f submission.csv -m "Message"

Testing the existing models

Go into notebooks folder and run

python model_0x_baseline_extract_everything.py

to run each model in that folder.

Then to evaluate a model, run:

python evaluate.py 0x

Here are the scores of model 01 and model 02, for comparison:

Model 01 (baseline, just extracting text)
Mean precision = 0.11, mean recall = 0.28
	Precision over all instances = 0.05, recall over all instances = 0.30

Model 02 (current Harmony 0.5.0)
Mean precision = 0.52, mean recall = 0.53
	Precision over all instances = 0.37, recall over all instances = 0.44

How PDFs are extracted

Harmony relies on two libraries to extract questionnaire items from PDFs:

Apache Tika - to get plain text
PDF Table Extractor Node.js library by Ronny Wang - to get tabular data

This repo contains the training data and training scripts.

The withheld test annotations are in this private repo: https://github.com/harmonydata/pdf-questionnaire-extraction-test-annotations

Preprocessing all the PDFs

Some raw PDFs have been provided.

Install all the requirements: pip install -r requirements.txt
Download and start Apache Tika in a command line: java -jar tika-server-standard-2.8.0.jar
In folder notebooks, run python preprocess_pdf_to_text.py
In folder notebooks, run python preprocess_pdf_to_tables.py

This will populate data/preprocessed_text and data/preprocessed_tables, which can be used to train the model.

‎😃💁 Who worked on Harmony?

Harmony is a collaboration project between Ulster University, University College London, the Universidade Federal de Santa Maria, and Fast Data Science. Harmony is funded by Wellcome as part of the Wellcome Data Prize in Mental Health.

The core team at Harmony is made up of:

Dr Bettina Moltrecht, PhD (UCL)
Dr Eoin McElroy (University of Ulster)
Dr George Ploubidis (UCL)
Dr Mauricio Scopel Hoffmann (Universidade Federal de Santa Maria, Brazil)
Thomas Wood (Fast Data Science)

📜 License

📜 How do I cite Harmony?

McElroy, E., Moltrecht, B., Ploubidis, G.B., Scopel Hoffman, M., Wood, T.A., Harmony [Computer software], Version 1.0, accessed at https://harmonydata.ac.uk/app. Ulster University (2023)

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
pdf_training_data		pdf_training_data
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Harmony PDF extraction

Data for PDF Kaggle competition

Entering the Kaggle competition

Testing the existing models

How PDFs are extracted

Preprocessing all the PDFs

‎😃💁 Who worked on Harmony?

📜 License

📜 How do I cite Harmony?

About

Releases

Packages

License

harmonydata/pdf-questionnaire-extraction

Folders and files

Latest commit

History

Repository files navigation

Harmony PDF extraction

Data for PDF Kaggle competition

Entering the Kaggle competition

Testing the existing models

How PDFs are extracted

Preprocessing all the PDFs

‎😃💁 Who worked on Harmony?

📜 License

📜 How do I cite Harmony?

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages