DocScraper | Web Documentation Scraper and PDF Generator

Introduction

This Elixir-based CLI application scrapes an entire technical documentation website and exports the content into a single PDF file. It's designed to help developers and technical writers collect and compile documentation from web-based sources into a portable format for offline use or archival purposes.

Features

Crawls and scrapes an entire website starting from a given URL
Extracts text content from web pages
Generates a single PDF file containing all scraped content
Provides a simple command-line interface

Prerequisites

Before you begin, ensure you have the following installed:

Elixir (version 1.11 or later)
Erlang (version 22 or later)
wkhtmltopdf (for PDF generation)

Setup

Clone this repository:

git clone https://github.com/drekunia/doc_scraper.git
cd doc_scraper

Install dependencies:

mix deps.get

Compile the project:

mix compile

Build the escript:

mix escript.build

Usage

Run the scraper using the following command:

./doc_scraper [URL] [OUTPUT_FILE]

For example:

./doc_scraper https://docs.example.com/ example_docs.pdf

If no arguments are provided, it will default to scraping "https://example.com" and output to "output.pdf" in the current directory.

Potential Improvements

Implement rate limiting to be more respectful to the scraped websites.
Add support for robots.txt to ensure ethical scraping.
Improve error handling and logging.
Add support for JavaScript-rendered content using a headless browser.
Implement a progress bar or more detailed progress reporting.
Add options for selective scraping (e.g., specific sections of a website).
Improve PDF formatting and add a table of contents.
Add support for authentication to scrape protected content.
Implement multithreading for faster scraping of large websites.
Add unit and integration tests.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Disclaimer

Please ensure you have permission to scrape and reproduce content from websites. This tool is intended for personal use or use with websites you own or have explicit permission to scrape. Always review and comply with a website's terms of service and robots.txt file.

Troubleshooting

If you encounter issues with PDF generation:

Ensure wkhtmltopdf is properly installed and accessible in your system's PATH.
Check that you have write permissions in the directory where you're trying to save the PDF.
If you're still having trouble, try generating an HTML file first, then manually convert it to PDF using wkhtmltopdf.

Contact

If you have any questions or feedback, please open an issue in this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
lib		lib
test		test
.formatter.exs		.formatter.exs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
mix.exs		mix.exs
mix.lock		mix.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocScraper | Web Documentation Scraper and PDF Generator

Introduction

Features

Prerequisites

Setup

Usage

Potential Improvements

Contributing

License

Disclaimer

Troubleshooting

Contact

About

Releases

Packages

Languages

License

drekunia/doc_scraper

Folders and files

Latest commit

History

Repository files navigation

DocScraper | Web Documentation Scraper and PDF Generator

Introduction

Features

Prerequisites

Setup

Usage

Potential Improvements

Contributing

License

Disclaimer

Troubleshooting

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages