feat: added support for elasticsearch as a datasource #402

pc9 · 2023-08-04T14:57:35Z

Description

Adding support to use elasticsearch as vector database in CustomApp.

to-dos:

need to figure out how to mock elasticsearch client to successfully test both positive and negative test cases.

Type of change

New feature (non-breaking change which adds functionality)
Documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

Test Script (please provide)

import os
from embedchain import CustomApp
from embedchain.config import CustomAppConfig, ElasticsearchDBConfig
from embedchain.models import Providers, EmbeddingFunctions, VectorDatabases

os.environ["OPENAI_API_KEY"] = 'OPENAI_API_KEY'

es_config = ElasticsearchDBConfig(
	# elasticsearch url or list of nodes url with different hosts and ports.
	es_url='https://localhost:9200',
	# pass named parameters supported by Python Elasticsearch client
	basic_auth=("username", "password")
)
config = CustomAppConfig(
	embedding_fn=EmbeddingFunctions.OPENAI, 
	provider=Providers.OPENAI, 
	db_type=VectorDatabases.ELASTICSEARCH, 
	es_config=es_config,
)
es_app = CustomApp(config)

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published in downstream modules
I have checked my code and corrected any misspellings
New and existing unit tests pass locally with my changes

Maintainer Checklist

closes feature request: add support for elasticsearch as a datasource #386
Made sure Checks passed

cachho · 2023-08-04T18:23:09Z

thanks for the PR, will review tomorrow.

Since I'm just going through dependency hell, we have to consider whether we want to require elasticsearch as a package dependency or leave it to the user. #392 #335

cachho

Hey, thanks for your PR. I got quite a few things to say and comment about it.

I know it can sound harsh and ungrateful, but in a review I try to get to the point and not sugar coat it. You definitely did a lot of great work here. Lots of this stuff isn't your fault and is due to the way we implemented the database, which I have criticized myself for (#389).

Please let me know what you think.

I guess the biggest issues are:

I see this as part of CustomApp. OpenSourceApp is ruled out since it's not open source. And more importantly, App and OpenSourceApp are supposed to be opinionated, and trying to not overload the user with choices. @taranjeet please confirm.
Please try to use inheritance to implement the database method in their respective class. On the base class you can raise a NotImplemented error for the methods. I know this also requires you to do quite a bit of work on embedchain. But this is the most maintainable way to deal with this going forward. Thanks for that.

cachho · 2023-08-06T08:35:07Z

docs/advanced/datasource.mdx

+from embedchain import App
+from embedchain.config import AppConfig
+
+os.environ["ES_ENDPOINT"] = "elasticsearch_endpoint"


Here, the question is whether we want to abstract this, take the variables as arguments in something like AppConfig or DbConfig and then declare them as environment variables.

I also agree with @cachho here. There should be a base config class for db, and then for each vector database, we should have specific classes like

ElasticsearchDBConfig

ChromaDBConfig

The user can get all the variables and put them in a config class like

es_config = ElasticsearchDBConfig(endpoint=os.environ[""], ...)

Host and Port are database settings that are already part of the AppConfig, I know I said it myself, but now I think we should just make this an option of AppConfig. Otherwise the user has to import too much boilerplate.

moved config to ElasticsearchDBConfig

cachho · 2023-08-06T08:37:52Z

docs/advanced/datasource.mdx

+os.environ["ES_API_KEY"] = "api_key" # Optional
+
+
+es_app_config = AppConfig(db_type='es')


AppConfig is opinionated. https://docs.embedchain.ai/advanced/app_types#app
Unless we say elasticsearch is the best database out there, AppConfig does not need this configuration option, it's better suited for CustomAppConfig.

agree with @cachho here.

App is opinionated. So we will also choose Chroma database as the default option there.

This can go in CustomAppConfig.

cachho · 2023-08-06T08:38:48Z

docs/advanced/datasource.mdx

+- `Elasticsearch` as vector database can be used by setting `db_type='es'` in `AppConfig`.
+- `ES_ENDPOINT` is mandatory to connect to `Elasticsearch`. 
+- `ES_API_KEY_ID` and `ES_API_KEY` can be configured for authentication and connecting to `Elasticsearch`. 
+- An index with name `embedchain_store_1536` is created if not present.


1536 stands for what?

1536 is dimension of OpenAI embedding model.

cachho · 2023-08-06T08:39:32Z

docs/advanced/datasource.mdx

+- An index with name `embedchain_store_1536` is created if not present.
+
+### OpenSourceApp
+Similarly for Open source app set `db_type='es'`


Open source app is also opinionated. More specifically, it's supposed to be open source. So CustomApp is the only place to go for this.

yes, agree with cachho here.

cachho · 2023-08-06T08:42:24Z

docs/advanced/datasource.mdx

Overall, I think we have to talk about the name of this section. Why not call it what it is, Vector Database? And then the sections are not clear to me. it should probably be

<h2>Vector Database</h2> <h3>ChromaDb</h3> <h3>Elasticsearch</h3>

but saying "Chromadb" is used as default. and then jumping to an Elasticsearch example might be confusing.

agree, have added Vector Database and Elasticsearch heading, I was unsure what to add under ChromaDb so I have skipped it.

cachho · 2023-08-06T09:10:57Z

embedchain/vectordb/elasticsearch_db.py

+ """
+ Elasticsearch as vector database
+ :param embedding_fn: Function to generate embedding vectors.
+ :param config: Optional. elastic search client


I guess I'm not deep enough into elastic search to know why you don't call client client if that's what it is.

removed client named as config param

cachho · 2023-08-06T09:11:27Z

embedchain/vectordb/elasticsearch_db.py

+ """
+ Elasticsearch as vector database
+ :param embedding_fn: Function to generate embedding vectors.
+ :param config: Optional. elastic search client


It's not optional, right?

removed client named as config param

cachho · 2023-08-06T09:12:47Z

pyproject.toml

@@ -90,6 +90,7 @@ youtube-transcript-api = "^0.6.1"
 beautifulsoup4 = "^4.12.2"
 pypdf = "^3.11.0"
 pytube = "^15.0.0"
+elasticsearch = "^8.9.0"


As always, the question is: do we want to make all dependencies mandatory. I guess, you don't have to answer this question and what you do here is fine.

agree here.
we should have elasticsearch like this

pip install embedchain[elasticsearch]

please refer to #407 setup.py and pyproject.toml to see how this is done if you are unsure.

update: using elasticsearch(8.9.0) as optional dependency, can be installed by pip install embedchain[elasticsearch]

cachho · 2023-08-06T09:12:55Z

setup.py

@@ -36,6 +36,7 @@
 "pydantic==1.10.8",
 "replicate==0.9.0",
 "duckduckgo-search==3.8.4",
+ "elasticsearch>=8.0.0",


why is this a lower version?

yes, i think latest version is 8.9.0.
Any specific reason?

update: now using elasticsearch(8.9.0) as optional dependency as @cachho suggested.

cachho · 2023-08-06T09:22:49Z

tests/vectordb/test_elasticsearch_db.py

+from embedchain.vectordb.elasticsearch_db import EsDB
+
+
+class TestEsDB(unittest.TestCase):


I know this isn't a helpful comment, but maybe more positive tests wouldn't hurt. Like testing add, get, reset, and not just testing illegal methods.

yes we should have both positive and negative tests

Need help here, need to figure out how to mock elasticsearch client to successfully test both positive and negative test cases.

@pc9 : can you open a new issue for this?

sure will do that.

taranjeet · 2023-08-09T01:18:04Z

docs/advanced/datasource.mdx

+os.environ["ES_API_KEY"] = "api_key" # Optional
+
+
+es_app_config = AppConfig(db_type='es')


agree with @cachho here.

taranjeet · 2023-08-09T01:46:18Z

docs/advanced/datasource.mdx

+
+## Vector Database
+
+We support `Chromadb` and `Elasticsearch` as two type of vector database. 


We support Chroma and Elasticsearch as two vector databases.
Chroma is used as a default database.

taranjeet · 2023-08-09T01:49:38Z

docs/advanced/datasource.mdx

+from embedchain import App
+from embedchain.config import AppConfig
+
+os.environ["ES_ENDPOINT"] = "elasticsearch_endpoint"


I also agree with @cachho here. There should be a base config class for db, and then for each vector database, we should have specific classes like

ElasticsearchDBConfig

ChromaDBConfig

The user can get all the variables and put them in a config class like

es_config = ElasticsearchDBConfig(endpoint=os.environ[""], ...)

taranjeet · 2023-08-09T01:50:14Z

docs/advanced/datasource.mdx

+os.environ["ES_API_KEY"] = "api_key" # Optional
+
+
+es_app_config = AppConfig(db_type='es')


App is opinionated. So we will also choose Chroma database as the default option there.

This can go in CustomAppConfig.

taranjeet · 2023-08-09T01:53:48Z

embedchain/vectordb/elasticsearch_db.py

+from embedchain.vectordb.base_vector_db import BaseVectorDB
+
+
+class EsDB(BaseVectorDB):


ElasticsearchDB

taranjeet · 2023-08-09T02:10:34Z

embedchain/config/apps/BaseAppConfig.py

@@ -17,29 +28,43 @@ def __init__(self, log_level=None, embedding_fn=None, db=None, host=None, port=N
 :param id: Optional. ID of the app. Document metadata will have this id.
 :param host: Optional. Hostname for the database server.
 :param port: Optional. Port for the database server.
+ :param db_type: Optional. db type to use. Currently [chroma, es] are supported.
+ :param vector_dim: Vector dimension generated by embedding fn


i think we can skip this part. right now vector_dim is not used with chroma as it computes the dimension itself. but in future we may need it.

taranjeet · 2023-08-09T02:13:57Z

embedchain/config/apps/BaseAppConfig.py

 if embedding_fn is None:
 raise ValueError("ChromaDb cannot be instantiated without an embedding function")
+
+ if db_type == "es":
+ from embedchain.vectordb.elasticsearch_db import EsDB


also, i initialize my elasticsearch as

ES_EXTRA_PARAMS = { "http_auth": (ES_USER, ES_PASSWORD), "ca_certs": ES_CA_CERTS, } ES_CONNECTION = Elasticsearch(ES_URL, **ES_EXTRA_PARAMS)

We should have a generic support so that a user can initialize ES in whatever way suits them.

taranjeet · 2023-08-09T02:14:22Z

embedchain/config/apps/BaseAppConfig.py

 from embedchain.vectordb.chroma_db import ChromaDB

- return ChromaDB(embedding_fn=embedding_fn, host=host, port=port)
+ return VectorDb(ChromaDB(embedding_fn=embedding_fn, host=host, port=port))


in VectorDb, b should be capital.
VectorDB

As suggested, VecotrDb class is removed.

taranjeet · 2023-08-09T02:14:52Z

embedchain/config/apps/OpenSourceAppConfig.py

@@ -26,6 +27,8 @@ def __init__(self, log_level=None, host=None, port=None, id=None, model=None):
 host=host,
 port=port,
 id=id,
+ db_type=db_type,
+ vector_dim=384, # vector length created by embedding fn


refer from a variable rather than the value.

update: referring from VectorDimensions under models

taranjeet · 2023-08-09T02:30:25Z

embedchain/embedchain.py

- existing_docs = self.collection.get(
- ids=ids,
- where=where, # optional filter
+ existing_ids = self.db.get(


I didnt get how this part is working.
Also we should not skip where. where can have more variables.

Also we should not skip where. where can have more variables.

This is not implemented yet (#395). Would be too much to ask of him probably.

Update: not skipping where, but currently only using app_id if present in where as filter while fetching from elasticsearch.

… different app

- Remove additional vector db class and add functions in base and inherited db classes - Move vector dimensions and db type in enum classes under models - Support elasticsearch as db type in CustomApp and do not alter App and OpenSourceApp - Add elasticsearch as an optional dependency

taranjeet · 2023-08-11T03:50:39Z

this is good implementation @pc9 . Great work overall.

@cachho : thanks for the review.

merging this PR.

@pc9 : can you open a new issue for adding tests for mocking elasticsearch?

cachho suggested changes Aug 6, 2023

View reviewed changes

taranjeet reviewed Aug 9, 2023

View reviewed changes

pc9 marked this pull request as draft August 9, 2023 03:34

pc9 added 3 commits August 9, 2023 18:21

feat: added support for elasticsearch as a datasource

0168301

fix(es-datasource): created different index with fixed vector dim for…

ab67c9d

… different app

pc9 force-pushed the feature/es-db-support branch from 0faa2b0 to ef0219e Compare August 10, 2023 03:06

fix: Using ElasticsearchDBConfig as es db config, updated documentation

45e071d

pc9 marked this pull request as ready for review August 10, 2023 17:49

pc9 requested review from cachho and taranjeet August 10, 2023 17:49

taranjeet approved these changes Aug 11, 2023

View reviewed changes

taranjeet merged commit 0179141 into mem0ai:main Aug 11, 2023
3 checks passed

pc9 deleted the feature/es-db-support branch August 11, 2023 04:08

		os.environ["ES_API_KEY"] = "api_key" # Optional


		es_app_config = AppConfig(db_type='es')

		from embedchain.vectordb.elasticsearch_db import EsDB


		class TestEsDB(unittest.TestCase):


		## Vector Database

		We support `Chromadb` and `Elasticsearch` as two type of vector database.

		from embedchain.vectordb.base_vector_db import BaseVectorDB


		class EsDB(BaseVectorDB):

feat: added support for elasticsearch as a datasource #402

feat: added support for elasticsearch as a datasource #402

Conversation

pc9 commented Aug 4, 2023 • edited by cachho Loading

Description

Type of change

How Has This Been Tested?

Checklist:

Maintainer Checklist

cachho commented Aug 4, 2023

cachho left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pc9 Aug 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pc9 Aug 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pc9 Aug 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pc9 Aug 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

taranjeet commented Aug 11, 2023

pc9 commented Aug 4, 2023 •

edited by cachho

Loading

pc9 Aug 10, 2023 •

edited

Loading

pc9 Aug 10, 2023 •

edited

Loading

pc9 Aug 10, 2023 •

edited

Loading

pc9 Aug 10, 2023 •

edited

Loading