This repository contains various methods to perform summarization of scientific articles. It's still on an experimental stage so don't expect it to work as it should.
Reads the pdf using pypdf and performs minimal sanitization:
- removes pdf annotations
- removes URLs and e-mails
- removes - character (hyphen)
- ignores text after References section
You can export the content to a .txt file using export class method.
Its base class is PDFToTextConverter. I implemented three options summarize text:
First, I tokenized the text and using frequency analysis I found the most important sentences in the document. Then, I used sshleifer/distilbart-cnn-12-6 to the target sentences (after resizing the chunks to fit the model) which is the default model for summarization tasks using the transformers library. Because, many words were incorrectly merged together, I used wordninja which probabilistically splits concatenated words using NLP to make final corrections in the document. To make the process faster I tried utilizing concurrent features as much as I could.
I chose BigBird, google/bigbird-pegasus-large-arxiv, available via hugging face. Note: It runs very slowly...
There is an already existent implementation of text summarization in this repository so I simply integrated their solution.