This Elixir-based CLI application scrapes an entire technical documentation website and exports the content into a single PDF file. It's designed to help developers and technical writers collect and compile documentation from web-based sources into a portable format for offline use or archival purposes.
- Crawls and scrapes an entire website starting from a given URL
- Extracts text content from web pages
- Generates a single PDF file containing all scraped content
- Provides a simple command-line interface
Before you begin, ensure you have the following installed:
- Elixir (version 1.11 or later)
- Erlang (version 22 or later)
- wkhtmltopdf (for PDF generation)
- Clone this repository:
git clone https://github.com/drekunia/doc_scraper.git
cd doc_scraper
- Install dependencies:
mix deps.get
- Compile the project:
mix compile
- Build the escript:
mix escript.build
Run the scraper using the following command:
./doc_scraper [URL] [OUTPUT_FILE]
For example:
./doc_scraper https://docs.example.com/ example_docs.pdf
If no arguments are provided, it will default to scraping "https://example.com" and output to "output.pdf" in the current directory.
- Implement rate limiting to be more respectful to the scraped websites.
- Add support for
robots.txt
to ensure ethical scraping. - Improve error handling and logging.
- Add support for JavaScript-rendered content using a headless browser.
- Implement a progress bar or more detailed progress reporting.
- Add options for selective scraping (e.g., specific sections of a website).
- Improve PDF formatting and add a table of contents.
- Add support for authentication to scrape protected content.
- Implement multithreading for faster scraping of large websites.
- Add unit and integration tests.
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
Please ensure you have permission to scrape and reproduce content from websites. This tool is intended for personal use or use with websites you own or have explicit permission to scrape. Always review and comply with a website's terms of service and robots.txt
file.
If you encounter issues with PDF generation:
- Ensure
wkhtmltopdf
is properly installed and accessible in your system's PATH. - Check that you have write permissions in the directory where you're trying to save the PDF.
- If you're still having trouble, try generating an HTML file first, then manually convert it to PDF using
wkhtmltopdf
.
If you have any questions or feedback, please open an issue in this repository.