Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data/input		data/input
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
test.sh		test.sh

Repository files navigation

Paper Summarizer

This repository contains various methods to perform summarization of scientific articles. It's still on an experimental stage so don't expect it to work as it should.

PDFToTextConverter

Reads the pdf using pypdf and performs minimal sanitization:

removes pdf annotations
removes URLs and e-mails
removes - character (hyphen)
ignores text after References section

You can export the content to a .txt file using export class method.

PDFSummarizer

Its base class is PDFToTextConverter. I implemented three options summarize text:

NLTK + sshleifer/distilbart-cnn-12-6

First, I tokenized the text and using frequency analysis I found the most important sentences in the document. Then, I used sshleifer/distilbart-cnn-12-6 to the target sentences (after resizing the chunks to fit the model) which is the default model for summarization tasks using the transformers library. Because, many words were incorrectly merged together, I used wordninja which probabilistically splits concatenated words using NLP to make final corrections in the document. To make the process faster I tried utilizing concurrent features as much as I could.

Big Bird Pegasus

I chose BigBird, google/bigbird-pegasus-large-arxiv, available via hugging face. Note: It runs very slowly...

sumy

There is an already existent implementation of text summarization in this repository so I simply integrated their solution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Paper Summarizer

PDFToTextConverter

PDFSummarizer

NLTK + sshleifer/distilbart-cnn-12-6

Big Bird Pegasus

sumy

About

Releases

Packages

Languages

Mirtia/Paper-Summarizer

Folders and files

Latest commit

History

Repository files navigation

Paper Summarizer

PDFToTextConverter

PDFSummarizer

NLTK + sshleifer/distilbart-cnn-12-6

Big Bird Pegasus

sumy

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages