Skip to content

Commit

Permalink
readme and sanitization
Browse files Browse the repository at this point in the history
  • Loading branch information
Mirtia committed Apr 19, 2023
1 parent f2344d0 commit 9038163
Show file tree
Hide file tree
Showing 4 changed files with 46 additions and 12 deletions.
28 changes: 24 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,30 @@

This repository contains various methods to perform summarization of scientific articles. It's still on an experimental stage so don't expect it to work as it should.

## Methods
## PDFToTextConverter

### NLTK + sshleifer/distilbart-cnn-12-6
Reads the pdf using [**pypdf**](https://github.com/py-pdf/pypdf) and performs minimal sanitization:

### Big Bird Pegasus
- removes **pdf** annotations
- removes **URLs** and **e-mails**
- removes **-** character (hyphen)
- ignores text after **References** section

### sumy
You can export the content to a .txt format using **export** class method.

## PDFSummarizer

Its base class in **PDFToTextConverter**. I explored three options to summarize text:

- ### NLTK + sshleifer/distilbart-cnn-12-6

First, I tokenized the text and using **frequency analysis** I found the most important sentences in the document. Then, I used [**sshleifer/distilbart-cnn-12-6**](https://huggingface.co/sshleifer/distilbart-cnn-12-6) to the target sentences (after resizing the chunks to fit the model) which is default for summarization tasks using the **transformers** library. Because, many words were incorrectly merged together, I used [**wordninja**](https://github.com/keredson/wordninja) which probabilistically splits concatenated words using **NLP** to make final corrections in the document. To make the process faster I tried using **concurrent** features as much as I could.

- ### Big Bird Pegasus

I chose BigBird, [**google/bigbird-pegasus-large-arxiv**](https://huggingface.co/google/bigbird-pegasus-large-arxiv), available via hugging face.
Note: It runs very slowly...

- ### sumy

There is an already existent implementation of text summarization in this [**repository**](https://github.com/miso-belica/sumy) so I simply integrated their solution.
16 changes: 13 additions & 3 deletions src/converter.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ class PDFToTextConverter:
filename (str): The path to the .pdf file.
text (str): The content of the .pdf file.
"""
email_pattern = r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"
url_pattern = r"http[s]?:https://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"

def __init__(self, filename: str) -> None:
self.filename = self._validate_file(filename)
Expand All @@ -30,9 +32,17 @@ def _read_file(self, filename: str) -> str:
reader = pypdf.PdfReader(f)
writer = pypdf.PdfWriter(clone_from=reader)
writer.remove_annotations(subtypes=None)

return " ".join(page.extract_text().replace("-", "")
for page in writer.pages)
return self._remove_noise(" ".join(page.extract_text()
for page in writer.pages))

def _remove_noise(self, text):
index = text.lower().rfind("references") or text.lower().rfind(
"bibliography")
if (index != -1):
text = text[:index]
text = re.sub(self.url_pattern, "", re.sub(self.email_pattern, "",
text))
return text.replace("-", "")

def export(self, filename: str) -> None:
with open(filename, mode="w", encoding="utf-8") as f:
Expand Down
9 changes: 6 additions & 3 deletions src/main.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import argparse
import nltk_summarizer
import converter
import transformers_summarizer
import sumy_summarizer

Expand All @@ -25,9 +26,11 @@ def main():
args = parser.parse_args()

if args.mode == "nltk":
summarizer_NTLK = nltk_summarizer.PDFSummarizer(args.file)
summarizer_NTLK.summarize()
summarizer_NTLK.export(args.output)
summarizer_text = converter.PDFToTextConverter(args.file)
summarizer_text.export(args.output)
# summarizer_NTLK = nltk_summarizer.PDFSummarizer(args.file)
# summarizer_NTLK.summarize()
# summarizer_NTLK.export(args.output)
elif args.mode == "pegasus":
summarizer_pegasus = transformers_summarizer.PDFSummarizer(args.file)
summarizer_pegasus.summarize()
Expand Down
5 changes: 3 additions & 2 deletions test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ mkdir -p data/output
echo -e "Testing summarization methods ...\n"
echo -e "\nSummarization nltk:\n"
time python src/main.py -f data/input/${title}.pdf -o data/output/${title}_nltk.txt -m nltk
# time python src/main.py -f data/input/${title}.pdf -o data/output/${title}_pegasus.txt -m pegasus
echo -e "\nSummarization sumy:\n"
time python src/main.py -f data/input/${title}.pdf -o data/output/${title}_summy.txt -m sumy
time python src/main.py -f data/input/${title}.pdf -o data/output/${title}_summy.txt -m sumy
# echo -e "\nSummarization pegasus:\n"
# time python src/main.py -f data/input/${title}.pdf -o data/output/${title}_pegasus.txt -m pegasus

0 comments on commit 9038163

Please sign in to comment.