Referencing wikipedia articles and implementing a semantic search engine over articles content as a service using AWS infrastructure. Resources: ElasticSearch domain, S3, ECS, Lambda

Semantic search service over wikipedia articles using AWS infrastructure

semantic search


This work shows how to create a semantic search engine over a set of wikipedia pages and deploy it as a service in AWS.

Infrastructure as Code (IaS) is used through AWS CDK.

Here are the descriptions of some directories and files in the repo:

  • src/ gets random pages from wikipedia, enriches them with metadata and uploads them to an s3 bucket. This can be also run using
    python src/ upload-random-pages -n <NUMBER_OF_RANDOM_PAGES_TO_UPLOAD>
  • lambda_indexer/ defines a lambda function that is attached to create events in the S3 bucket. It sends the page content to the embedding service and references the document and its embedding in the ElasticSearch cluster
  • universal-sentence-encoder defines the docker image that is pushed to ECR and then deployed in ECS. It provides a service that given a text returns its embedding. It can be used as follows
     curl -XPOST -d '{"instances": ["text 1 to query", "text 2 to query"]}' https://<EMBEDDER_IP>:8501/v1/models/USE_3:predict | jq
  • src/es is related to the ElasticSearch cluster where the pages are referenced. It contains modules to create the index, index a document, and search the index using knn similarity
  • src/api a sanic server listening on port 8000 with an endpoint /search which takes a text as a query, sends it to the embedding service, takes the returned embedding and sends a query to the ElasticSearch index with that embedding. Finally it returns the results. To use it, run src/api/search_server then execute the following command which requests 3 most similar Wikipedia pages to the text "beautiful painting":
    curl -XGET -d '{"query": "beautiful painting"}' localhost:8000/search?n=3 | jq
    This server is also deployed as an AWS service which can be reached through
    curl -XGET -d '{"query": "beautiful painting"}' <API_SERVICE_IP>:8000/search\?n=3 | jq

Initial setup

The setup is not at production grade. However, reading the makefile is not very complicated.

Start by creating the virtual environment with

make env-create
source .venv/bin/activate

How to deploy stacks

  • bootstrap environment
    make cdk-bootstrap-environment
  • create docker registries
    make docker-create-ecr-embedder
    make docker-create-ecr-api
  • deploy Elasticsearch stack
    make deploy-es
  • in src/config/config.ini update es_url with Elasticsearch endpoint. This can be obtained with make echo-elastic-search-endpoint
  • create the index in the cluster
    make create-es-index
  • build and push embedder image
    make docker-embedder-image && make docker-embedder-push
  • deploy embedding stack
    make deploy-embedder
  • check the embedding service
    • Embedder IP can be obtained with make echo-embedder-ip
    curl -XPOST -d '{"instances": ["toto", "tata"]}' https://<EMBEDDER_IP>:8501/v1/models/USE_3:predict | jq
  • in src/config/config.ini update public_ip with embedding service public IP
    • this is a workaround to pass the embedding service container to the indexer lambda and the API service. A robust solution would be to assign a load balancer with an elastic IP to the embedder serivce, but this would increase the cost and the purpose of this repo is only to showcase the semantic search solution.
  • package indexing lambda
    make lambda-indexer-package
  • deploy WikiReferencing stack (S3 bucket + Lambda function + S3 Notification)
    make deploy-referencing
  • check the referencing stack by sending a batch of Wikipedia pages, you should find json files added to the s3 bucket and the corresponding pages indexed in Elasticsearch index semwiki
    python src/ upload-random-pages -n 5
  • list documents in Elasticsearch index
    • Elasticsearch endpoint can be obtained with make echo-elastic-search-endpoint
    curl -XGET -u 'semwiki:SemWiki21!' -H 'Content-Type: application/json' \
      -d '{"_source": "title", "query": {"match_all": {}}}' https://<ES_ENDPOINT>/semwiki/_search | jq
  • build and push API image
    make docker-api-image && make docker-api-push
  • deploy API service
    make deploy-api
  • test API service
    • Search API IP can be obtained with make echo-api-ip
    curl -XGET -d '{"query": "entertainment"}' https://<API_SERVICE_IP>:8000/search\?n\=3 | \
        jq '.[] | {"title": .title, "url": .url}'


The Elasticsearch service, the embedding service get a new IP everytime their containers are re-instantiated. This makes it difficult to reach them. We should use a load balancer with a fixed public IP to overcome this problem, but since the objective of this project is only to show the main idea of how to implement a semantic search engine, we do not want to have unnecessary costs related to these additional resources.

Consequently, for now, when the embedder is unreachable (from the indexing lambda or the search service), their respective code and docker image have to be updated with the new embedder IP. This happens when the embedder service is killed and re-created.

ElasticSearch cURL requests (saved here to be used for debugging)

GET /_cat/indices?v=true&s=index&pretty

GET /_cat/indices/semwiki?v=true&s=index&pretty

GET /_cat/indices/semwiki?format=json

DELETE /semwiki

GET /semwiki

GET /semwiki/_settings

GET /semwiki/_mappings

GET /semwiki/_stats

GET /semwiki/_doc/17793022

DELETE /semwiki/_doc/00000000?routing=shard-1&pretty

GET /semwiki/_doc/53747466?pretty

GET /semwiki/_search
  "size": 10,
  "_source": [
  "query": {
    "function_score": {
      "functions": [
          "random_score": {
            "seed": "1518707649"

GET /semwiki/_search
  "size": 5,
  "query": {
    "function_score": {
      "query": {
        "match_all": {}
      "functions": [
          "random_score": {}

GET /semwiki/_search
  "_source": [
  "query": {
    "match_all": {}

GET /semwiki/_search
  "size": 10,
  "stored_fields": [
  "_source": [
  "query": {
    "match_all": {}

POST /semwiki/_delete_by_query
  "query": {
    "match_all": {}


