Elastic Open Web Crawler

This repository contains code for the Elastic Open Web Crawler. Open Crawler enables users to easily ingest web content into Elasticsearch.

Important

The Open Crawler is currently in tech-preview. Tech-preview features are subject to change and are not covered by the support SLA of generally available (GA) features. Elastic plans to promote this feature to GA in a future release.

Open Crawler v0.1 is confirmed to be compatible with Elasticsearch v8.13.0 and above.

User workflow

Indexing web content with the Open Crawler requires:

Running an instance of Elasticsearch (on-prem, cloud, or serverless)
Cloning of the Open Crawler repository (see Setup)
Configuring a crawler config file (see Configuring crawlers)
Using the CLI to begin a crawl job (see CLI commands)

Execution logic

Open Crawler runs crawl jobs on command based on config files in the config directory. Each URL endpoint found during the crawl will result in one document to be indexed into Elasticsearch.

Open Crawler performs crawl jobs in a multithreaded environment, where one thread will be used to visit one URL endpoint. The crawl results from these are added to a pool of results. These are indexed into Elasticsearch using the _bulk API once the pool reaches a configurable threshold.

Setup

Prerequisites

A running instance of Elasticsearch is required to index documents into. If you don't have this set up yet, you can sign up for an Elastic Cloud free trial or check out the quickstart guide for Elasticsearch.

Connecting to Elasticsearch

Open Crawler will attempt to use the _bulk API to index crawl results into Elasticsearch. To facilitate this connection, Open Crawler needs to have either an API key or a username/password configured to access the Elasticsearch instance. If using an API key, ensure that the API key has read and write permissions to access the index configured in output_index.

Elasticsearch documentation for managing API keys for more details
elasticsearch.yml.example file for all of the available Elasticsearch configurations for Crawler

Creating an API key

Here is an example of creating an API key with minimal permissions for Open Crawler. This will return a JSON with an `encoded` key. The value of `encoded` is what Open Crawler can use in its configuration.

POST /_security/api_key
{
  "name": "my-api-key",
  "role_descriptors": { 
    "my-crawler-role": {
      "cluster": ["all"],
      "indices": [
        {
          "names": ["my-crawler-index-name"],
          "privileges": ["all"]
        }
      ]
    }
  },
  "metadata": {
    "application": "my-crawler"
  }
}

Running Open Crawler from Docker

Open Crawler has a Dockerfile that can be built and run locally.

Clone the repository: git clone https://github.com/elastic/crawler.git
Build the image docker build -t crawler-image .
Run the container docker run -i -d --name crawler crawler-image
- -i allows the container to stay alive so CLI commands can be executed inside it
- -d allows the container to run "detached" so you don't have to dedicate a terminal window to it
Confirm that CLI commands are working docker exec -it crawler bin/crawler version
- Execute other CLI commands from outside of the container by prepending docker exec -it crawler <command>
Create a config file for your crawler. See Configuring crawlers for next steps. See Configuring crawlers for next steps.

Running Open Crawler from source

Tip

We recommend running from source only if you are actively developing Open Crawler.

Instructions for running from source

ℹ️ Open Crawler uses both JRuby and Java. We recommend using version managers for both. When developing Open Crawler we use rbenv and jenv. There are instructions for setting up these env managers here:

Clone the repository: git clone https://github.com/elastic/crawler.git

Go to the root of the Open Crawler directory and check the expected Java and Ruby versions are being used:

# should output the same version as `.ruby-version`
$ ruby --version

# should output the same version as `.java-version`
$ java --version

If the versions seem correct, you can install dependencies:
```
$ make install
```
You can also use the env variable CRAWLER_MANAGE_ENV to have the install script automatically check whether rbenv and jenv are installed, and that the correct versions are running on both: Doing this requires that you use both rbenv and jenv in your local setup.
```
$ CRAWLER_MANAGE_ENV=true make install
```

Configuring Crawlers

See CONFIG.md for in-depth details on Open Crawler configuration files.

Crawler Document Schema and Mappings

See DOCUMENT_SCHEMA.md for information regarding the Elasticsearch document schema and mappings.

CLI Commands

Open Crawler does not have a graphical user interface. All interactions with Open Crawler take place through the CLI. When given a command, Open Crawler will run until the process is finished. OpenCrawler is not kept alive in any way between commands.

See CLI.md for a full list of CLI commands available for Crawler.

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
.buildkite		.buildkite
.github		.github
bin		bin
config		config
docs		docs
lib		lib
script		script
spec		spec
vendor		vendor
.backportrc.json		.backportrc.json
.bundler-version		.bundler-version
.gitignore		.gitignore
.java-version		.java-version
.jrubyrc		.jrubyrc
.rspec		.rspec
.rubocop.yml		.rubocop.yml
.ruby-version		.ruby-version
Brewfile		Brewfile
Dockerfile		Dockerfile
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
Jarfile		Jarfile
Jars.lock		Jars.lock
LICENSE		LICENSE
Makefile		Makefile
NOTICE.txt		NOTICE.txt
README.md		README.md
catalog-info.yaml		catalog-info.yaml
product_version		product_version

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Elastic Open Web Crawler

User workflow

Execution logic

Setup

Prerequisites

Connecting to Elasticsearch

Running Open Crawler from Docker

Running Open Crawler from source

Configuring Crawlers

Crawler Document Schema and Mappings

CLI Commands

About

Releases 1

Packages

Contributors 6

Languages

License

elastic/crawler

Folders and files

Latest commit

History

Repository files navigation

Elastic Open Web Crawler

User workflow

Execution logic

Setup

Prerequisites

Connecting to Elasticsearch

Running Open Crawler from Docker

Running Open Crawler from source

Configuring Crawlers

Crawler Document Schema and Mappings

CLI Commands

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 6

Languages

Packages