🛰️ Scrape It Now!

A website to scrape? There's a simple way.

Features

Shared:

Decoupled architecture with Azure Queue Storage
Executable as a CLI with a standalone binary
Idenpotent operations that can be run in parallel
Scraped content is stored in Azure Blob Storage

Scraper:

Indexer:

AI Search index is created automatically
Chunk markdown while keeping the content coherent
Embed chunks with OpenAI embeddings
Indexed content is semantically searchable with Azure AI Search

Installation

From binary

Download the latest release from the releases page. Binaries are available for Linux, macOS and Windows.

For configuring the CLI (including authentication to the backend services), use environment variables, a .env file or command line options.

From sources

Application must be run with Python 3.12 or later. If this version is not installed, an easy way to install it is pyenv.

# Download the source code
git clone https://github.com/clemlesne/scrape-it-now.git
# Move to the directory
cd scrape-it-now
# Run install scripts
make install dev
# Run the CLI
scrape-it-now --help

How to use

Scrape a website

Run a job

Basic usage:

export AZURE_STORAGE_CONNECTION_STRING=xxx
scrape-it-now scrape run https://nytimes.com

Example output:

❯ Starting scraping job 7yz91ma
Queued 71/71 links for referrer https://www.google.com/search (1)
3 workers started
Browser chromium launched
Processing new messages
...
Queued 15/28 links for referrer https://www.nytimes.com/2024/08/15/business/economy/kamala-harris-inflation-price-gouging.html (2)
Scraped https://www.nytimes.com/2024/08/15/business/economy/kamala-harris-inflation-price-gouging.html (2)

Most frequent options are:

`Options`	Description	`Environment variable`
`--azure-storage-connection-string` `-ascs`	Azure Storage connection string	`AZURE_STORAGE_CONNECTION_STRING`
`--job-name` `-jn`	Job name	`JOB_NAME`

For documentation on all available options, run:

scrape-it-now scrape run --help

Show job status

Basic usage:

export AZURE_STORAGE_CONNECTION_STRING=xxx
scrape-it-now scrape status [job_name]

Example output:

❯ {"created_at":"2024-08-16T15:33:06.602922Z","last_updated":"2024-08-16T16:17:51.571136Z","network_used_mb":5.650620460510254,"processed":1263,"queued":3120}

Most frequent options are:

`Options`	Description	`Environment variable`
`--azure-storage-connection-string` `-ascs`	Azure Storage connection string	`AZURE_STORAGE_CONNECTION_STRING`

For documentation on all available options, run:

scrape-it-now scrape status --help

Index a scraped website

Run a job

Basic usage:

export AZURE_OPENAI_API_KEY=xxx
export AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME=xxx
export AZURE_OPENAI_EMBEDDING_DIMENSIONS=xxx
export AZURE_OPENAI_EMBEDDING_MODEL_NAME=xxx
export AZURE_OPENAI_ENDPOINT=xxx
export AZURE_SEARCH_API_KEY=xxx
export AZURE_SEARCH_ENDPOINT=xxx
export AZURE_STORAGE_CONNECTION_STRING=xxx
scrape-it-now index run [job_name]

Example output:

❯ Starting indexing job 7yz91ma
5 workers started
Processing new messages
...
434b227 chunked into 6 parts
434b227 is indexed
f001b3e chunked into 86 parts
f001b3e is already indexed

Most frequent options are:

`Options`	Description	`Environment variable`
`--azure-openai-api-key` `-aoak`	Azure OpenAI API key	`AZURE_OPENAI_API_KEY`
`--azure-openai-embedding-deployment-name` `-aoedn`	Azure OpenAI embedding deployment name	`AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME`
`--azure-openai-embedding-dimensions` `-aoed`	Azure OpenAI embedding dimensions	`AZURE_OPENAI_EMBEDDING_DIMENSIONS`
`--azure-openai-embedding-model-name` `-aoemn`	Azure OpenAI embedding model name	`AZURE_OPENAI_EMBEDDING_MODEL_NAME`
`--azure-openai-endpoint` `-aoe`	Azure OpenAI endpoint	`AZURE_OPENAI_ENDPOINT`
`--azure-search-api-key` `-asak`	Azure Search API key	`AZURE_SEARCH_API_KEY`
`--azure-search-endpoint` `-ase`	Azure Search endpoint	`AZURE_SEARCH_ENDPOINT`
`--azure-storage-connection-string` `-ascs`	Azure Storage connection string	`AZURE_STORAGE_CONNECTION_STRING`

For documentation on all available options, run:

scrape-it-now index run --help

Architecture

Scrape

graph LR
  cli["CLI"]
  web["Website"]

  subgraph "Azure Queue Storage"
    to_chunk["To chunk"]
    to_scrape["To scrape"]
  end

  subgraph "Azure Blob Storage"
    subgraph "Container"
      job["job"]
      scraped["scraped"]
      state["state"]
    end
  end

  cli -- 1. Pull message --> to_scrape
  cli -- 2. Get cache --> scraped
  cli -- 3. Browse --> web
  cli -- 4. Update cache --> scraped
  cli -- 5. Push state --> state
  cli -- 6. Add message --> to_scrape
  cli -- 7. Add message --> to_chunk
  cli -- 8. Update state --> job

Index

graph LR
  ai_search["Azure AI Search"]
  cli["CLI"]
  embeddings["Azure OpenAI Embeddings"]

  subgraph "Azure Queue Storage"
    to_chunk["To chunk"]
  end

  subgraph "Azure Blob Storage"
    subgraph "Container"
      scraped["scraped"]
    end
  end

  cli -- 1. Pull message --> to_chunk
  cli -- 2. Get cache --> scraped
  cli -- 3. Chunk --> cli
  cli -- 4. Embed --> embeddings
  cli -- 5. Push to search --> ai_search

Advanced usage

Source environment variables

To configure easily the CLI, source environment variables from a .env file. For example, for the --azure-storage-connection-string option:

AZURE_STORAGE_CONNECTION_STRING=xxx

For arguments that accept multiple values, use a space-separated list. For example, for the --whitelist option:

WHITELIST=learn\.microsoft\.com,^/(?!en-us).*,^/[^/]+/answers/,^/[^/]+/previous-versions/ go\.microsoft\.com,.*

Broswer binary installation

Browser binaries are automatically downloaded or updated at each run. Browser is Chromium and it is not configurable (feel free to open an issue if you need another browser), it weights around 450MB.

The cache directoty depends on the operating system:

~/.config/scrape-it-now (Unix)
~/Library/Application Support/scrape-it-now (macOS)
C:\Users\<user>\AppData\Roaming\scrape-it-now (Windows)

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.devcontainer		.devcontainer
.github		.github
app		app
cicd		cicd
resources		resources
.editorconfig		.editorconfig
.env.example		.env.example
.gitignore		.gitignore
.gitmodules		.gitmodules
.python-version		.python-version
.syft.yaml		.syft.yaml
.version.config		.version.config
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🛰️ Scrape It Now!

Features

Installation

From binary

From sources

How to use

Scrape a website

Run a job

Show job status

Index a scraped website

Run a job

Architecture

Scrape

Index

Advanced usage

Source environment variables

Broswer binary installation

About

Releases 5

Packages

Languages

License

clemlesne/scrape-it-now

Folders and files

Latest commit

History

Repository files navigation

🛰️ Scrape It Now!

Features

Installation

From binary

From sources

How to use

Scrape a website

Run a job

Show job status

Index a scraped website

Run a job

Architecture

Scrape

Index

Advanced usage

Source environment variables

Broswer binary installation

About

Resources

License

Stars

Watchers

Forks

Releases 5

Packages 0

Languages

Packages