thepi.pe

Extract markdown and visuals from PDFs URLs, slides, videos, and more, ready for multimodal LLMs. ⚡

thepi.pe is an API that can scrape multimodal data via thepipe.scrape from a wide range of sources. It is built to interface with LLMs such as GPT-4o, and works out-of-the-box with any LLM or vector databases. thepi.pe can be used right away with a hosted GPU cloud, or it can be self-hosted.

Features 🌟

Extract markdown, tables, and images from any document or webpage
Extract complex structured data from any document or webpage
Works out-of-the-box with all LLMs and RAG frameworks
AI-native filetype detection, layout analysis, and structured data extraction
Multimodal scraping for video, audio, and image sources

Get started in 5 minutes 🚀

thepi.pe can read a wide range of filetypes and web sources, so it requires a few dependencies. It also requires a strong machine (16GB+ VRAM for optimal PDF & video response times) for AI extraction features. For these reasons, we host a REST API that works out-of-the-box at thepi.pe.

Hosted API (Python)

⚠️ Warning. The docs and functionality in this repo differ significantly from the current working version on pip. To use a working version, please refer to the API docs, and not these docs.

pip install thepipe-api
setx THEPIPE_API_KEY=your_api_key
setx OPENAI_API_KEY=your_openai_key

from thepipe.scraper import scrape_file
from openai import OpenAI

# scrape markdown, tables, visuals
chunks = scrape_file(filepath="paper.pdf")

# call LLM with clean, comprehensive data
client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4o",
    messages=thepipe.chunks_to_messages(chunks),
)

Local Installation (Python)

For a local installation, you can use the following command:

pip install thepipe-api[local]

And append local=True to your API calls:

chunks = scrape_url(url="https://example.com", local=True)

You can also use The Pipe from the command line:

thepipe path/to/folder --include_regex .*\.tsx

Supported File Types 📚

Source	Input types	Multimodal	Notes
Webpage	URLs starting with `http`, `https`, `ftp`	✔️	Scrapes markdown, images, and tables from web pages. `ai_extraction` available for AI layout analysis
PDF	`.pdf`	✔️	Extracts page markdown and page images. `ai_extraction` available for AI layout analysis
Word Document	`.docx`	✔️	Extracts text, tables, and images
PowerPoint	`.pptx`	✔️	Extracts text and images from slides
Video	`.mp4`, `.mov`, `.wmv`	✔️	Uses Whisper for transcription and extracts frames
Audio	`.mp3`, `.wav`	✔️	Uses Whisper for transcription
Jupyter Notebook	`.ipynb`	✔️	Extracts markdown, code, outputs, and images
Spreadsheet	`.csv`, `.xls`, `.xlsx`	❌	Converts each row to JSON format, including row index for each
Plaintext	`.txt`, `.md`, `.rtf`, etc	❌	Simple text extraction
Image	`.jpg`, `.jpeg`, `.png`	✔️	Uses pytesseract for OCR in text-only mode
ZIP File	`.zip`	✔️	Extracts and processes contained files
Directory	any `path/to/folder`	✔️	Recursively processes all files in directory
YouTube Video	YouTube video URLs starting with `https://youtube.com` or `https://www.youtube.com`.	✔️	Uses pytube for video download and Whisper for transcription. For consistent extraction, you may need to modify your `pytube` installation to send a valid user agent header (see this issue).
Tweet	URLs starting with `https://twitter.com` or `https://x.com`	✔️	Uses unofficial API, may break unexpectedly
GitHub Repository	GitHub repo URLs starting with `https://github.com` or `https://www.github.com`	✔️	Requires GITHUB_TOKEN environment variable

How it works 🛠️

thepi.pe uses computer vision models and heuristics to extract clean content from the source and process it for downstream use with language models, or vision transformers. The output from thepi.pe is a list of chunks containing all content within the source document. These chunks can easily be converted to a prompt format that is compatible with any LLM or multimodal model with thepipe.chunks_to_messages, which gives the following format:

[
  {
    "role": "user",
    "content": [
      {
        "type": "text",
        "text": "..."
      },
      {
        "type": "image_url",
        "image_url": {
          "url": "data:image/jpeg;base64,..."
        }
      }
    ]
  }
]

You can feed these messages directly into the model, or alternatively you can use thepipe_api.chunk_by_document, thepipe_api.chunk_by_page, thepipe_api.chunk_by_section, thepipe_api.chunk_semantic to chunk these messages for a vector database such as ChromaDB or a RAG framework. A chunk can be converted to LlamaIndex Document/ImageDocument with .to_llamaindex.

⚠️ It is important to be mindful of your model's token limit. GPT-4o does not work with too many images in the prompt (see discussion here). To remedy this issue, either use an LLM with a larger context window, extract larger documents with text_only=True, or embed the chunks into vector database.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

thepi.pe

Extract markdown and visuals from PDFs URLs, slides, videos, and more, ready for multimodal LLMs. ⚡

Features 🌟

Get started in 5 minutes 🚀

Hosted API (Python)

Local Installation (Python)

Supported File Types 📚

How it works 🛠️

Sponsors

Files

README.md

Latest commit

History

README.md

File metadata and controls

thepi.pe

Extract markdown and visuals from PDFs URLs, slides, videos, and more, ready for multimodal LLMs. ⚡

Features 🌟

Get started in 5 minutes 🚀

Hosted API (Python)

Local Installation (Python)

Supported File Types 📚

How it works 🛠️

Sponsors