Skip to content

Commit

Permalink
more detailed search README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
svebk committed Jan 15, 2020
1 parent 7fd2faf commit 40b0e15
Showing 1 changed file with 140 additions and 24 deletions.
164 changes: 140 additions & 24 deletions setup/components/search/README.md
Original file line number Diff line number Diff line change
@@ -1,46 +1,162 @@
## Image search service
# Image Search Service Overview

[TODO: old text, to be rewritten to better describe current docker-compose, environment file settings (work in progress)]
This component enables the construction of an efficient search index based on [LOPQ](../../../lopq)
and expose the index through a rest API to enable searching for similar images at scale.

The service will first train a search index if one corresponding to the settings of the configuration file is not available
in the storer, either `local` or `S3`.
The service will look for updates listed in the HBase table `HBI_table_updateinfos` and accumulate
the corresponding image features up to `SEARCHLOPQ_nb_train_pca` and will first train a PCA model.
Then it will accumulate `SEARCHLOPQ_nb_train` PCA projected features to train the search index.
These features are stored locally in a LMDB database to potentially allow easy retraining with different settings.
Note that if not enough features are available, the training process will wait for more features to be computed
by the processing pipeline.
We detail below:
* what are the parameters of the environment file
* how the search index is trained and used
* how to start and check the status of the search component with `docker-compose`

Once the model is trained it is saved to the storer.
For a `S3` storer it will be using the AWS profile `ST_aws_profile` and the bucket `ST_bucket_name`.
For a `local` storer it will save in the folder `ST_base_path`.
Then all the updates listed in the HBase table `HBI_table_updateinfos` are indexed, i.e. the LOPQ codes of each image
feature is computed and stored in a local LMDB database, and also pushed to the storer as a backup.
## Environment File

Once all images are indexed the REST API is started and listen to the port `PORT_HOST` (default `80`) defined in the start script.
The name of the endpoint is set in the run script as the parameter `endpoint` (default `cuimgsearch`).
Thus, by default, the API would be accessible at `localhost:80/cuimgsearch`, more details in the API section below.
Multiple example environment files are listed in this folder, the most recent ones being
[.env_merged_sb](.env_merged_sb) and [.env_merged_dlib](.env_merged_dlib) which should be
good starting point for your own configuration.
The environment file parameters are passed to the docker container that generates a JSON
configuration file that is then loaded by the different python classes to adapt their parameters
to the provided configuration.
See the line `command` in the [docker-compose.yml](docker-compose.yml).

We here detail the main parameters that should or can be modified in an environment file.

## Environment file settings
### Paths

### Kafka
You should set the variable `repo_path` to the output of the command
`$(git rev-parse --show-toplevel)`.

Check that the kafka servers, security parameters and topics names are as they should.
You should NOT edit the `indocker_repo_path` variable, but if you do that would require rebuilding
the docker image (see folder [DockerBuild](../../DockerBuild)).

### HBase
You should edit the HBase related parameters in your environment file, to make sure that:

- The HBase host `hbase_host` is correct.
- The HBase tables `table_sha1infos` and `table_updateinfos`.
- The HBase tables `table_sha1infos` and `table_updateinfos` names are correct.
Note that these tables would be created by the pipeline if they do not exist yet.

The other HBase parameters `batch_update_size`, `extr_column_family`, `image_info_column_family`,
`image_buffer_column_family` and `image_buffer_column_name` could be kept as is but should be
consistent with the values used in the processing component.

### Searcher

Search parameters:
- `search_conf_name`: name of the search configuration, will influence the name of the index saved
as well as the codes files.
- `nb_train`: number of training samples for the indexing model
- `nb_min_train`: minimum number of training samples for the indexing model
- `nb_train_pca`: number of training samples for the PCA
- `nb_min_train_pca`: minimum number of training samples for the PCA

LOPQ parameters, see [LOPQ](../../../lopq) for more details:
- `model_type`: currently `lopq_pca` is the only indexing model fully supported
- `lopq_pcadims`: number of dimensions of the PCA
- `lopq_V`: number of clusters per coarse quantizer (bigger value induce lower quantization error)
- `lopq_M`: total number of fine codes (bigger value induce lower quantization error)
- `lopq_subq`: number of clusters to train per subquantizer (256 is a good value)

### Storer
A storer is used to save the trained index model and the computed codes.

If you use the `S3` storer, you should edit the AWS related parameters:

- You should edit the parameters `aws_profile` and `aws_bucket_name` in the environment file so it correspond to
the actual profile and bucket you want to use.
- You should copy the [sample AWS credentials file](../../conf/aws_credentials/credentials.sample) to
- You should copy the [sample AWS credentials file](../../../conf/aws_credentials/credentials.sample) to
`conf/aws_credentials/credentials` and edit it to contain the information corresponding
to the profile `aws_profile`.
to the target profile `aws_profile`.

### API

The API parameters configure at which URL the service could be access, gunicorn parameters and
the expected input type:
* `port_host`: port the service should use
* `endpoint`: name of the service, would change the API URL
* `gunicorn_workers`: number of workers for gunicorn
* `gunicorn_timeout`: startup timeout of gunicorn
* `search_input`: query type, `face` or `image`

### Others

The extraction type variable `extr_type` defines what features are indexed and should be extracted
from the query, this should be currently either `sbpycaffeimg` or `dlibface`.

There is a general `verbose` parameter that would be propagated to most of the python classes and
controls the level of logging. Increase this value to perform debugging, reduce it to lower the
output flow.

The input type variable `input_type` specifies whether the images were pushed from a local source
and then it should be set to `local` or from any other source and should be to something else
e.g. `kafka` or `kinesis`. In this componenet it is mostly used to properly display images
when calling the endpoints `view_similar_byX` like `view_similar_byURL` of the API.

TODO: Check the additional parameter `file_input` and interaction with `input_type`.

## Image search service

### Search Index
The service will first train a search index if one corresponding to the settings of the configuration file is not available
in the storer, which could be either `local` or `S3`.

The service will look for updates listed in the HBase table `table_updateinfos` and accumulate
the corresponding image features up to `nb_train_pca` and will first train a PCA model.
Then it will accumulate `nb_train` PCA projected features to train the search index.
These features are stored locally in a LMDB database to potentially allow easy retraining with
different settings.
If not enough features are yet available in this database, the training process will wait for
more features to be computed by the processing pipeline.
If you have trained your index model and no longer plan to train another index you should delete
the training features LMDB database.

Once the model is trained it is saved to the storer.
For a `S3` storer it will be using the AWS profile `aws_profile` and the bucket `bucket_name`.
For a `local` storer it will save in the folder `base_path`.

### Indexing Updates

Then all the updates listed in the HBase table `HBI_table_updateinfos` are indexed, i.e. the
LOPQ codes of each image feature is computed and stored in a local LMDB database,
and also pushed to the storer as a backup. If you are streaming data into the processing pipeline
you should also query from time to time the update endpoints as listed below.

### API

Once all images are indexed the REST API is started.
The name of the API service is set by the parameter `endpoint` and it listens to the
port `port_host` defined in the environment file.
Thus, the API would be accessible at `localhost:80/cuimgsearch` if `endpoint` is set
to `cuimgsearch` and `port_host` to 80.

More details on how to use the API is provided in the [www](../../../www) folder and you can
check the code of the API in the [API](../../../cufacesearch/cufacesearch/api) folder.


## Docker-compose

The docker-compose file [docker-compose.yml](docker-compose.yml) in this directory should be used
to start an image search component based on the configuration defined in the `.env` file.

Copy (and edit) one of the example environment file `.env_test_sb`, `.env_release_sb`
or `.env_merged_sb` to `.env`.

### Start the docker container
Just run to start the docker container:

`(sudo) docker-compose up`

Once everything is running you can detach the docker-compose process by pressing `Ctrl + z`.
You can check the docker logs with the command:

`(sudo) docker logs -f [docker-container-name]`

### Check the docker container
The `docker-container-name` should be `search_img_search_1` and you can check it with `(sudo) docker ps`.

### Update the index

To check and keep the API up-to-date you need to call the endpoints:
* `status`: return a JSON with information about uptime and number of samples indexed
* `check_new_updates`: would check updates starting from the last update indexed
* `check_all_updates`: would check updates from the beginning of time

0 comments on commit 40b0e15

Please sign in to comment.