Skip to content

Latest commit

 

History

History

docsearch

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

How to Update the Search Index

This guide contains information for managing the search index for the Couchbase documentation site. It covers where the search index lives and how to update it. This procedure described here is used by the CI job defined by the Jenkinsfile in this directory. This document is helpful to understand how the CI job works, or how to perform the update manually, if necessary.

Overview

The search index for the documentation is hosted by Algolia. The index, named prod_docs_couchbase, is stored in the Couchbase Algolia account. The index is populated by the docsearch scraper (aka crawler).

The sections below document the prerequisites for running the docsearch scraper and how to run the docsearch scraper to update the index.

Prerequisites

  • git (to clone the docsearch-scraper repository)

  • pipenv (to manage a local Python installation and packages)

  • Chrome/Chromium and chromedriver (or Docker)

Setup

To begin, clone the https://github.com/algolia/docsearch-scraper repository using git.

$ git clone https://github.com/algolia/docsearch-scraper &&
  cd "`basename $_`"

Next, create an .env file in the cloned repository to define the application ID (APPLICATION_ID) and write API key (API_KEY). To protect the API key, only the final four characters of the API key are shown here.

APPLICATION_ID=NI1G57N08Q
API_KEY=****************************67dd
Important
The API key used in this file is different than the one used for searching. In the Algolia dashboard, it’s labeled as the Write API Key.

The next step is to set up the Python environment and install the required packages.

$ pipenv install && pipenv shell

If you don’t plan to use the Docker image, you’ll need to install both Chrome (or Chromium) and chromedriver. Run the following command to make sure Chromedriver is installed successfully:

$ chromedriver --version

Finally, you’ll need the docsearch configuration file. This configuration file is located in the playbook repository for the Couchbase documentation. Download the file from https://github.com/couchbase/docs-site/raw/master/docsearch/docsearch-config.json and save it to the cloned repository.

You’re now ready to run the scraper.

Usage

There are three ways to run the scraper:

  • docsearch run (uses local packages and chromedriver)

  • docsearch docker:run (uses local packages and provided Docker image)

  • docker run (uses provided Docker image)

Warning
Rebuilding the index takes about 30 minutes because it has to visit every page in the site.

docsearch run

To update the index, pass the config file to the docsearch run command:

$ ./docsearch run docsearch-config.json

If that succeeds, skip to [Activate Index].

If that command fails, you may need to run it in the provided Docker container.

docsearch docker:run

First, make sure you have Docker running on your machine and that you can list images.

$ docker images

Then, run the docsearch command again, but use the Docker container instead:

$ ./docsearch docker:run docsearch-config.json

The search index is now updated.

docker run

Using Docker, it’s possible to bypass the use of pipenv by invoking docker run directly. First, create a script named scrape with the following contents:

scrape
#!/usr/bin/bash

source .env

docker run \
  -e APPLICATION_ID=$APPLICATION_ID \
  -e API_KEY=$API_KEY \
  -e CONFIG="`cat ${1:-config.json}`" \
  -t --rm algolia/docsearch-scraper \
  /root/run

Then, make it executable:

$ chmod 755 scrape

Finally, run it, passing the configuration file as the first argument:

$ ./scrape docsearch-config.json

The search index is now updated.