readme and sanitization

Mirtia · Apr 19, 2023 · 9038163 · 9038163
1 parent f2344d0
commit 9038163
Show file tree

Hide file tree

Showing 4 changed files with 46 additions and 12 deletions.
diff --git a/README.md b/README.md
@@ -2,10 +2,30 @@
 
 This repository contains various methods to perform summarization of scientific articles. It's still on an experimental stage so don't expect it to work as it should.
 
-## Methods
+## PDFToTextConverter
 
-### NLTK + sshleifer/distilbart-cnn-12-6
+Reads the pdf using [**pypdf**](https://github.com/py-pdf/pypdf) and performs minimal sanitization:
 
-### Big Bird Pegasus
+- removes **pdf** annotations
+- removes **URLs** and **e-mails**
+- removes **-** character (hyphen)
+- ignores text after **References** section
 
-### sumy
+You can export the content to a .txt format using **export** class method.
+
+## PDFSummarizer
+
+Its base class in **PDFToTextConverter**.  I explored three options to summarize text:
+
+- ### NLTK + sshleifer/distilbart-cnn-12-6
+
+First, I tokenized the text and using **frequency analysis** I found the most important sentences in the document. Then, I used [**sshleifer/distilbart-cnn-12-6**](https://huggingface.co/sshleifer/distilbart-cnn-12-6) to the target sentences (after resizing the chunks to fit the model) which is default for summarization tasks using the **transformers** library. Because, many words were incorrectly merged together, I used [**wordninja**](https://github.com/keredson/wordninja) which probabilistically splits concatenated words using **NLP**  to make final corrections in the document. To make the process faster I tried using **concurrent** features as much as I could.
+
+- ### Big Bird Pegasus
+
+I chose BigBird, [**google/bigbird-pegasus-large-arxiv**](https://huggingface.co/google/bigbird-pegasus-large-arxiv), available via hugging face.
+Note: It runs very slowly...
+
+- ### sumy
+
+There is an already existent implementation of text summarization in this [**repository**](https://github.com/miso-belica/sumy) so I simply integrated their solution.
diff --git a/src/converter.py b/src/converter.py
@@ -11,6 +11,8 @@ class PDFToTextConverter:
         filename (str): The path to the .pdf file.
         text (str): The content of the .pdf file.
     """
+    email_pattern = r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"
+    url_pattern = r"http[s]?:https://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
 
     def __init__(self, filename: str) -> None:
         self.filename = self._validate_file(filename)
@@ -30,9 +32,17 @@ def _read_file(self, filename: str) -> str:
             reader = pypdf.PdfReader(f)
             writer = pypdf.PdfWriter(clone_from=reader)
             writer.remove_annotations(subtypes=None)
-
-        return " ".join(page.extract_text().replace("-", "")
-                        for page in writer.pages)
+        return self._remove_noise(" ".join(page.extract_text()
+                                  for page in writer.pages))
+
+    def _remove_noise(self, text):
+        index = text.lower().rfind("references") or text.lower().rfind(
+            "bibliography")
+        if (index != -1):
+            text = text[:index]
+        text = re.sub(self.url_pattern, "", re.sub(self.email_pattern, "",
+                                                   text))
+        return text.replace("-", "")
 
     def export(self, filename: str) -> None:
         with open(filename, mode="w", encoding="utf-8") as f:

diff --git a/src/main.py b/src/main.py
@@ -1,5 +1,6 @@
 import argparse
 import nltk_summarizer
+import converter
 import transformers_summarizer
 import sumy_summarizer
 
@@ -25,9 +26,11 @@ def main():
     args = parser.parse_args()
 
     if args.mode == "nltk":
-        summarizer_NTLK = nltk_summarizer.PDFSummarizer(args.file)
-        summarizer_NTLK.summarize()
-        summarizer_NTLK.export(args.output)
+        summarizer_text = converter.PDFToTextConverter(args.file)
+        summarizer_text.export(args.output)
+        # summarizer_NTLK = nltk_summarizer.PDFSummarizer(args.file)
+        # summarizer_NTLK.summarize()
+        # summarizer_NTLK.export(args.output)
     elif args.mode == "pegasus":
         summarizer_pegasus = transformers_summarizer.PDFSummarizer(args.file)
         summarizer_pegasus.summarize()

diff --git a/test.sh b/test.sh
@@ -5,6 +5,7 @@ mkdir -p data/output
 echo -e "Testing summarization methods ...\n"
 echo -e "\nSummarization nltk:\n"
 time python src/main.py -f data/input/${title}.pdf -o data/output/${title}_nltk.txt -m nltk
-# time python src/main.py -f data/input/${title}.pdf -o data/output/${title}_pegasus.txt -m pegasus
 echo -e "\nSummarization sumy:\n"
-time python src/main.py -f data/input/${title}.pdf -o data/output/${title}_summy.txt -m sumy
+time python src/main.py -f data/input/${title}.pdf -o data/output/${title}_summy.txt -m sumy
+# echo -e "\nSummarization pegasus:\n"
+# time python src/main.py -f data/input/${title}.pdf -o data/output/${title}_pegasus.txt -m pegasus