Scans PDFs for links written in plaintext and checks if they are active or returns an error code. It then generates a report of its findings. Extract references (PDF, URL, DOI, arXiv) and metadata from a PDF.
Check out our sister project, Rotting Research, for a web app implementation of this project.
- Extract references and metadata from a given PDF.
- Detects PDF, URL, arXiv and DOI references.
- Archives valid links using Internet Archive's Wayback Machine (using the -a flag).
- Checks for valid SSL certificate.
- Find broken hyperlinks (using the -c flag).
- Output as text or JSON (using the -j flag).
- Extract the PDF text (using the --text flag).
- Use as command-line tool or Python package.
- Works with local and online PDFs.
Grab a copy of the code with pip:
pip install linkrot
linkrot can be used to extract info from a PDF in two ways:
- Command line/Terminal tool
linkrot
- Python library
import linkrot
linkrot [pdf-file-or-url]
Run linkrot -h to see the help output:
linkrot -h
usage:
linkrot [-h] [-d OUTPUT_DIRECTORY] [-c] [-j] [-v] [-t] [-o OUTPUT_FILE] [--version] pdf
Extract metadata and references from a PDF, and optionally download all referenced PDFs.
pdf (Filename or URL of a PDF file)
-h, --help (Show this help message and exit)
-d OUTPUT_DIRECTORY, --download-pdfs OUTPUT_DIRECTORY (Download all referenced PDFs into specified directory)
-c, --check-links (Check for broken links)
-j, --json (Output infos as JSON (instead of plain text))
-v, --verbose (Print all references (instead of only PDFs))
-t, --text (Only extract text (no metadata or references))
-a, --archive (Archive actvice links)
-o OUTPUT_FILE, --output-file OUTPUT_FILE (Output to specified file instead of console)
--version (Show program's version number and exit)
For testing purposes, you can find PDF samples in shared MEGA folder](https://mega.nz/folder/uwBxVSzS#lpBtSz49E9dqHtmrQwp0Ig).
linkrot https://example.com/example.pdf -t
linkrot https://example.com/example.pdf -t -o pdf-text.txt
linkrot https://example.com/example.pdf -c
Import the library:
import linkrot
Create an instance of the linkrot class like so:
pdf = linkrot.linkrot("filename-or-url.pdf") #pdf is the instance of the linkrot class
Now the following function can be used to extract specific data from the pdf:
Arguments: None
Usage:
metadata = pdf.get_metadata() #pdf is the instance of the linkrot class
Return type: Dictionary <class 'dict'>
Information Provided: All metadata, secret metadata associated with the PDF including Creation date, Creator, Title, etc...
Arguments: None
Usage:
text = pdf.get_text() #pdf is the instance of the linkrot class
Return type: String <class 'str'>
Information Provided: The entire content of the PDF in string form.
Arguments:
reftype: The type of reference that is needed
values: 'pdf', 'url', 'doi', 'arxiv'.
default: Provides all reference types.
sort: Whether reference should be sorted or not
values: True or False.
default: Is not sorted.
Usage:
references_list = pdf.get_references() #pdf is the instance of the linkrot class
Return type: Set <class 'set'>
of <linkrot.backends.Reference object>
linkrot.backends.Reference object has 3 member variables:
- ref: actual URL/PDF/DOI/ARXIV
- reftype: type of reference
- page: page on which it was referenced
Information Provided: All references with their corresponding type and page number.
Arguments:
reftype: The type of reference that is needed
values: 'pdf', 'url', 'doi', 'arxiv'.
default: Provides all reference types.
sort: Whether reference should be sorted or not
values: True or False.
default: Is not sorted.
Usage:
references_dict = pdf.get_references_as_dict() #pdf is the instance of the linkrot class
Return type: Dictionary <class 'dict'>
with keys 'pdf', 'url', 'doi', 'arxiv' that each have a list <class 'list'>
of refs of that type.
Information Provided: All references in their corresponding type list.
Arguments:
target_dir: The path of the directory to which the reference PDFs should be downloaded
Usage:
pdf.download_pdfs("target-directory") #pdf is the instance of the linkrot class
Return type: None
Information Provided: Downloads all the reference PDFs to the specified directory.
Import:
from linkrot.downloader import sanitize_url, get_status_code, check_refs
Arguments:
url: The url to be sanitized.
Usage:
new_url = sanitize_url(old_url)
Return type: String <class 'str'>
Information Provided: URL is prefixed with 'https://' if it was not before and makes sure it is in utf-8 format.
Arguments:
url: The url to be checked for its status.
Usage:
status_code = get_status_code(url)
Return type: String <class 'str'>
Information Provided: Checks if the URL is active or broken.
Arguments:
refs: set of linkrot.backends.Reference objects
verbose: whether it should print every reference with its code or just the summary of the link checker
max_threads: number of threads for multithreading
Usage:
check_refs(pdf.get_references()) #pdf is the instance of the linkrot class
Return type: None
Information Provided: Prints references with their status code and a summary of all the broken/active links on terminal.
Import:
from linkrot.extractor import extract_urls, extract_doi, extract_arxiv
Get pdf text:
text = pdf.get_text() #pdf is the instance of the linkrot class
Arguments:
text: String of text to extract urls from
Usage:
urls = extract_urls(text)
Return type: Set <class 'set'>
of URLs
Information Provided: All URLs in the text
Arguments:
text: String of text to extract arXivs from
Usage:
arxiv = extract_arxiv(text)
Return type: Set <class 'set'>
of arxivs
Information Provided: All arXivs in the text
Arguments:
text: String of text to extract DOIs from
Usage:
doi = extract_doi(text)
Return type: Set <class 'set'>
of DOIs
Information Provided: All DOIs in the text
To view our code of conduct please visit our Code of Conduct page.
This program is licensed with an GPLv3 License.