Skip to content

Turn any webpage into structured data using LLMs

License

Notifications You must be signed in to change notification settings

mishushakov/llm-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

82 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Scraper

Screenshot 2024-04-20 at 23 11 16

LLM Scraper is a TypeScript library that allows you to extract structured data from any webpage using LLMs.

Important

Code-generation is now supported in LLM Scraper.

Tip

Under the hood, it uses function calling to convert pages to structured data. You can find more about this approach here.

Features

  • Supports Local (Ollama, GGUF), OpenAI, Vercel AI SDK Providers
  • Schemas defined with Zod
  • Full type-safety with TypeScript
  • Based on Playwright framework
  • Streaming objects
  • NEW Code-generation
  • Supports 4 formatting modes:
    • html for loading raw HTML
    • markdown for loading markdown
    • text for loading extracted text (using Readability.js)
    • image for loading a screenshot (multi-modal only)

Make sure to give it a star!

Screenshot 2024-04-20 at 22 13 32

Getting started

  1. Install the required dependencies from npm:

    npm i zod playwright llm-scraper
    
  2. Initialize your LLM:

    OpenAI

    npm i @ai-sdk/openai
    
    import { openai } from '@ai-sdk/openai'
    
    const llm = openai.chat('gpt-4o')

    Groq

    npm i @ai-sdk/openai
    
    import { createOpenAI } from '@ai-sdk/openai'
    const groq = createOpenAI({
      baseURL: 'https://api.groq.com/openai/v1',
      apiKey: process.env.GROQ_API_KEY,
    })
    
    const llm = groq('llama3-8b-8192')

    Ollama

    npm i ollama-ai-provider
    
    import { ollama } from 'ollama-ai-provider'
    
    const llm = ollama('llama3')

    GGUF

    import { LlamaModel } from 'node-llama-cpp'
    
    const llm = new LlamaModel({ modelPath: 'model.gguf' })
  3. Create a new scraper instance provided with the llm:

    import LLMScraper from 'llm-scraper'
    
    const scraper = new LLMScraper(llm)

Example

In this example, we're extracting top stories from HackerNews:

import { chromium } from 'playwright'
import { z } from 'zod'
import { openai } from '@ai-sdk/openai'
import LLMScraper from 'llm-scraper'

// Launch a browser instance
const browser = await chromium.launch()

// Initialize LLM provider
const llm = openai.chat('gpt-4o')

// Create a new LLMScraper
const scraper = new LLMScraper(llm)

// Open new page
const page = await browser.newPage()
await page.goto('https://news.ycombinator.com')

// Define schema to extract contents into
const schema = z.object({
  top: z
    .array(
      z.object({
        title: z.string(),
        points: z.number(),
        by: z.string(),
        commentsURL: z.string(),
      })
    )
    .length(5)
    .describe('Top 5 stories on Hacker News'),
})

// Run the scraper
const { data } = await scraper.run(page, schema, {
  format: 'html',
})

// Show the result from LLM
console.log(data.top)

await page.close()
await browser.close()

Streaming

Replace your run function with stream to get a partial object stream (Vercel AI SDK only).

// Run the scraper in streaming mode
const { stream } = await scraper.stream(page, schema)

// Stream the result from LLM
for await (const data of stream) {
  console.log(data.top)
}

Code-generation

Using the generate function you can generate re-usable playwright script that scrapes the contents according to a schema.

// Generate code and run it on the page
const { code } = await scraper.generate(page, schema)
const result = await page.evaluate(code)
const data = schema.parse(result)

// Show the parsed result
console.log(data.news)

Contributing

As an open-source project, we welcome contributions from the community. If you are experiencing any bugs or want to add some improvements, please feel free to open an issue or pull request.