Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: added support for elasticsearch as a datasource #402

Merged
merged 4 commits into from
Aug 11, 2023

Conversation

pc9
Copy link
Contributor

@pc9 pc9 commented Aug 4, 2023

Description

Adding support to use elasticsearch as vector database in CustomApp.

to-dos:

  • need to figure out how to mock elasticsearch client to successfully test both positive and negative test cases.

Type of change

  • New feature (non-breaking change which adds functionality)
  • Documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

  • Test Script (please provide)
import os
from embedchain import CustomApp
from embedchain.config import CustomAppConfig, ElasticsearchDBConfig
from embedchain.models import Providers, EmbeddingFunctions, VectorDatabases

os.environ["OPENAI_API_KEY"] = 'OPENAI_API_KEY'

es_config = ElasticsearchDBConfig(
	# elasticsearch url or list of nodes url with different hosts and ports.
	es_url='https://localhost:9200',
	# pass named parameters supported by Python Elasticsearch client
	basic_auth=("username", "password")
)
config = CustomAppConfig(
	embedding_fn=EmbeddingFunctions.OPENAI, 
	provider=Providers.OPENAI, 
	db_type=VectorDatabases.ELASTICSEARCH, 
	es_config=es_config,
)
es_app = CustomApp(config)

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I have checked my code and corrected any misspellings
  • New and existing unit tests pass locally with my changes

Maintainer Checklist

@cachho
Copy link
Contributor

cachho commented Aug 4, 2023

thanks for the PR, will review tomorrow.

Since I'm just going through dependency hell, we have to consider whether we want to require elasticsearch as a package dependency or leave it to the user. #392 #335

Copy link
Contributor

@cachho cachho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thanks for your PR. I got quite a few things to say and comment about it.

I know it can sound harsh and ungrateful, but in a review I try to get to the point and not sugar coat it. You definitely did a lot of great work here. Lots of this stuff isn't your fault and is due to the way we implemented the database, which I have criticized myself for (#389).

Please let me know what you think.

I guess the biggest issues are:

  • I see this as part of CustomApp. OpenSourceApp is ruled out since it's not open source. And more importantly, App and OpenSourceApp are supposed to be opinionated, and trying to not overload the user with choices. @taranjeet please confirm.
  • Please try to use inheritance to implement the database method in their respective class. On the base class you can raise a NotImplemented error for the methods. I know this also requires you to do quite a bit of work on embedchain. But this is the most maintainable way to deal with this going forward. Thanks for that.

from embedchain import App
from embedchain.config import AppConfig

os.environ["ES_ENDPOINT"] = "elasticsearch_endpoint"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, the question is whether we want to abstract this, take the variables as arguments in something like AppConfig or DbConfig and then declare them as environment variables.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also agree with @cachho here. There should be a base config class for db, and then for each vector database, we should have specific classes like

  • ElasticsearchDBConfig
  • ChromaDBConfig

The user can get all the variables and put them in a config class like

es_config = ElasticsearchDBConfig(endpoint=os.environ[""], ...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Host and Port are database settings that are already part of the AppConfig, I know I said it myself, but now I think we should just make this an option of AppConfig. Otherwise the user has to import too much boilerplate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved config to ElasticsearchDBConfig

os.environ["ES_API_KEY"] = "api_key" # Optional


es_app_config = AppConfig(db_type='es')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AppConfig is opinionated. https://docs.embedchain.ai/advanced/app_types#app
Unless we say elasticsearch is the best database out there, AppConfig does not need this configuration option, it's better suited for CustomAppConfig.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with @cachho here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

App is opinionated. So we will also choose Chroma database as the default option there.

This can go in CustomAppConfig.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

- `Elasticsearch` as vector database can be used by setting `db_type='es'` in `AppConfig`.
- `ES_ENDPOINT` is mandatory to connect to `Elasticsearch`.
- `ES_API_KEY_ID` and `ES_API_KEY` can be configured for authentication and connecting to `Elasticsearch`.
- An index with name `embedchain_store_1536` is created if not present.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1536 stands for what?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1536 is dimension of OpenAI embedding model.

- An index with name `embedchain_store_1536` is created if not present.

### OpenSourceApp
Similarly for Open source app set `db_type='es'`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open source app is also opinionated. More specifically, it's supposed to be open source. So CustomApp is the only place to go for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, agree with cachho here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think we have to talk about the name of this section. Why not call it what it is, Vector Database? And then the sections are not clear to me. it should probably be

<h2>Vector Database</h2>
<h3>ChromaDb</h3>
<h3>Elasticsearch</h3>

but saying "Chromadb" is used as default. and then jumping to an Elasticsearch example might be confusing.

Copy link
Contributor Author

@pc9 pc9 Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, have added Vector Database and Elasticsearch heading, I was unsure what to add under ChromaDb so I have skipped it.

"""
Elasticsearch as vector database
:param embedding_fn: Function to generate embedding vectors.
:param config: Optional. elastic search client
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm not deep enough into elastic search to know why you don't call client client if that's what it is.

Copy link
Contributor Author

@pc9 pc9 Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed client named as config param

"""
Elasticsearch as vector database
:param embedding_fn: Function to generate embedding vectors.
:param config: Optional. elastic search client
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not optional, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed client named as config param

pyproject.toml Outdated
@@ -90,6 +90,7 @@ youtube-transcript-api = "^0.6.1"
beautifulsoup4 = "^4.12.2"
pypdf = "^3.11.0"
pytube = "^15.0.0"
elasticsearch = "^8.9.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As always, the question is: do we want to make all dependencies mandatory. I guess, you don't have to answer this question and what you do here is fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree here.
we should have elasticsearch like this

pip install embedchain[elasticsearch]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please refer to #407 setup.py and pyproject.toml to see how this is done if you are unsure.

Copy link
Contributor Author

@pc9 pc9 Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update: using elasticsearch(8.9.0) as optional dependency, can be installed by pip install embedchain[elasticsearch]

setup.py Outdated
@@ -36,6 +36,7 @@
"pydantic==1.10.8",
"replicate==0.9.0",
"duckduckgo-search==3.8.4",
"elasticsearch>=8.0.0",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this a lower version?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i think latest version is 8.9.0.
Any specific reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update: now using elasticsearch(8.9.0) as optional dependency as @cachho suggested.

from embedchain.vectordb.elasticsearch_db import EsDB


class TestEsDB(unittest.TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this isn't a helpful comment, but maybe more positive tests wouldn't hurt. Like testing add, get, reset, and not just testing illegal methods.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we should have both positive and negative tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need help here, need to figure out how to mock elasticsearch client to successfully test both positive and negative test cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pc9 : can you open a new issue for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure will do that.

os.environ["ES_API_KEY"] = "api_key" # Optional


es_app_config = AppConfig(db_type='es')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with @cachho here.


## Vector Database

We support `Chromadb` and `Elasticsearch` as two type of vector database.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We support Chroma and Elasticsearch as two vector databases.
Chroma is used as a default database.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

from embedchain import App
from embedchain.config import AppConfig

os.environ["ES_ENDPOINT"] = "elasticsearch_endpoint"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also agree with @cachho here. There should be a base config class for db, and then for each vector database, we should have specific classes like

  • ElasticsearchDBConfig
  • ChromaDBConfig

The user can get all the variables and put them in a config class like

es_config = ElasticsearchDBConfig(endpoint=os.environ[""], ...)

os.environ["ES_API_KEY"] = "api_key" # Optional


es_app_config = AppConfig(db_type='es')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

App is opinionated. So we will also choose Chroma database as the default option there.

This can go in CustomAppConfig.

from embedchain.vectordb.base_vector_db import BaseVectorDB


class EsDB(BaseVectorDB):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ElasticsearchDB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@@ -17,29 +28,43 @@ def __init__(self, log_level=None, embedding_fn=None, db=None, host=None, port=N
:param id: Optional. ID of the app. Document metadata will have this id.
:param host: Optional. Hostname for the database server.
:param port: Optional. Port for the database server.
:param db_type: Optional. db type to use. Currently [chroma, es] are supported.
:param vector_dim: Vector dimension generated by embedding fn
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we can skip this part. right now vector_dim is not used with chroma as it computes the dimension itself. but in future we may need it.

if embedding_fn is None:
raise ValueError("ChromaDb cannot be instantiated without an embedding function")

if db_type == "es":
from embedchain.vectordb.elasticsearch_db import EsDB
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, i initialize my elasticsearch as

ES_EXTRA_PARAMS = {
    "http_auth": (ES_USER, ES_PASSWORD),
    "ca_certs": ES_CA_CERTS,
}

ES_CONNECTION = Elasticsearch(ES_URL, **ES_EXTRA_PARAMS)

We should have a generic support so that a user can initialize ES in whatever way suits them.

from embedchain.vectordb.chroma_db import ChromaDB

return ChromaDB(embedding_fn=embedding_fn, host=host, port=port)
return VectorDb(ChromaDB(embedding_fn=embedding_fn, host=host, port=port))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in VectorDb, b should be capital.
VectorDB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As suggested, VecotrDb class is removed.

@@ -26,6 +27,8 @@ def __init__(self, log_level=None, host=None, port=None, id=None, model=None):
host=host,
port=port,
id=id,
db_type=db_type,
vector_dim=384, # vector length created by embedding fn
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refer from a variable rather than the value.

Copy link
Contributor Author

@pc9 pc9 Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update: referring from VectorDimensions under models

existing_docs = self.collection.get(
ids=ids,
where=where, # optional filter
existing_ids = self.db.get(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didnt get how this part is working.
Also we should not skip where. where can have more variables.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we should not skip where. where can have more variables.

This is not implemented yet (#395). Would be too much to ask of him probably.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: not skipping where, but currently only using app_id if present in where as filter while fetching from elasticsearch.

@pc9 pc9 marked this pull request as draft August 9, 2023 03:34
pc9 added 3 commits August 9, 2023 18:21
 - Remove additional vector db class and add functions in base and inherited db classes
 - Move vector dimensions and db type in enum classes under models
 - Support elasticsearch as db type in CustomApp and do not alter App and OpenSourceApp
 - Add elasticsearch as an optional dependency
@pc9 pc9 marked this pull request as ready for review August 10, 2023 17:49
@pc9 pc9 requested review from cachho and taranjeet August 10, 2023 17:49
@taranjeet
Copy link
Member

this is good implementation @pc9 . Great work overall.

@cachho : thanks for the review.

merging this PR.

@pc9 : can you open a new issue for adding tests for mocking elasticsearch?

@taranjeet taranjeet merged commit 0179141 into mem0ai:main Aug 11, 2023
3 checks passed
@pc9 pc9 deleted the feature/es-db-support branch August 11, 2023 04:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feature request: add support for elasticsearch as a datasource
3 participants