-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: added support for elasticsearch as a datasource #402
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, thanks for your PR. I got quite a few things to say and comment about it.
I know it can sound harsh and ungrateful, but in a review I try to get to the point and not sugar coat it. You definitely did a lot of great work here. Lots of this stuff isn't your fault and is due to the way we implemented the database, which I have criticized myself for (#389).
Please let me know what you think.
I guess the biggest issues are:
- I see this as part of
CustomApp
.OpenSourceApp
is ruled out since it's not open source. And more importantly,App
andOpenSourceApp
are supposed to be opinionated, and trying to not overload the user with choices. @taranjeet please confirm. - Please try to use inheritance to implement the database method in their respective class. On the base class you can raise a
NotImplemented
error for the methods. I know this also requires you to do quite a bit of work on embedchain. But this is the most maintainable way to deal with this going forward. Thanks for that.
docs/advanced/datasource.mdx
Outdated
from embedchain import App | ||
from embedchain.config import AppConfig | ||
|
||
os.environ["ES_ENDPOINT"] = "elasticsearch_endpoint" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, the question is whether we want to abstract this, take the variables as arguments in something like AppConfig
or DbConfig
and then declare them as environment variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also agree with @cachho here. There should be a base config class for db, and then for each vector database, we should have specific classes like
- ElasticsearchDBConfig
- ChromaDBConfig
The user can get all the variables and put them in a config class like
es_config = ElasticsearchDBConfig(endpoint=os.environ[""], ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Host and Port are database settings that are already part of the AppConfig
, I know I said it myself, but now I think we should just make this an option of AppConfig
. Otherwise the user has to import too much boilerplate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved config to ElasticsearchDBConfig
docs/advanced/datasource.mdx
Outdated
os.environ["ES_API_KEY"] = "api_key" # Optional | ||
|
||
|
||
es_app_config = AppConfig(db_type='es') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AppConfig is opinionated. https://docs.embedchain.ai/advanced/app_types#app
Unless we say elasticsearch is the best database out there, AppConfig
does not need this configuration option, it's better suited for CustomAppConfig
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree with @cachho here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
App is opinionated. So we will also choose Chroma database as the default option there.
This can go in CustomAppConfig.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
docs/advanced/datasource.mdx
Outdated
- `Elasticsearch` as vector database can be used by setting `db_type='es'` in `AppConfig`. | ||
- `ES_ENDPOINT` is mandatory to connect to `Elasticsearch`. | ||
- `ES_API_KEY_ID` and `ES_API_KEY` can be configured for authentication and connecting to `Elasticsearch`. | ||
- An index with name `embedchain_store_1536` is created if not present. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1536 stands for what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1536 is dimension of OpenAI embedding model.
docs/advanced/datasource.mdx
Outdated
- An index with name `embedchain_store_1536` is created if not present. | ||
|
||
### OpenSourceApp | ||
Similarly for Open source app set `db_type='es'` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open source app is also opinionated. More specifically, it's supposed to be open source. So CustomApp
is the only place to go for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, agree with cachho here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
docs/advanced/datasource.mdx
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I think we have to talk about the name of this section. Why not call it what it is, Vector Database
? And then the sections are not clear to me. it should probably be
<h2>Vector Database</h2>
<h3>ChromaDb</h3>
<h3>Elasticsearch</h3>
but saying "Chromadb" is used as default.
and then jumping to an Elasticsearch example might be confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, have added Vector Database
and Elasticsearch
heading, I was unsure what to add under ChromaDb
so I have skipped it.
""" | ||
Elasticsearch as vector database | ||
:param embedding_fn: Function to generate embedding vectors. | ||
:param config: Optional. elastic search client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I'm not deep enough into elastic search to know why you don't call client
client
if that's what it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed client named as config param
""" | ||
Elasticsearch as vector database | ||
:param embedding_fn: Function to generate embedding vectors. | ||
:param config: Optional. elastic search client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not optional, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed client named as config param
pyproject.toml
Outdated
@@ -90,6 +90,7 @@ youtube-transcript-api = "^0.6.1" | |||
beautifulsoup4 = "^4.12.2" | |||
pypdf = "^3.11.0" | |||
pytube = "^15.0.0" | |||
elasticsearch = "^8.9.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As always, the question is: do we want to make all dependencies mandatory. I guess, you don't have to answer this question and what you do here is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree here.
we should have elasticsearch like this
pip install embedchain[elasticsearch]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please refer to #407 setup.py
and pyproject.toml
to see how this is done if you are unsure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update: using elasticsearch(8.9.0) as optional dependency, can be installed by pip install embedchain[elasticsearch]
setup.py
Outdated
@@ -36,6 +36,7 @@ | |||
"pydantic==1.10.8", | |||
"replicate==0.9.0", | |||
"duckduckgo-search==3.8.4", | |||
"elasticsearch>=8.0.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this a lower version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, i think latest version is 8.9.0.
Any specific reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from embedchain.vectordb.elasticsearch_db import EsDB | ||
|
||
|
||
class TestEsDB(unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this isn't a helpful comment, but maybe more positive tests wouldn't hurt. Like testing add
, get
, reset
, and not just testing illegal methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes we should have both positive and negative tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need help here, need to figure out how to mock elasticsearch client to successfully test both positive and negative test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pc9 : can you open a new issue for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure will do that.
docs/advanced/datasource.mdx
Outdated
os.environ["ES_API_KEY"] = "api_key" # Optional | ||
|
||
|
||
es_app_config = AppConfig(db_type='es') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree with @cachho here.
docs/advanced/datasource.mdx
Outdated
|
||
## Vector Database | ||
|
||
We support `Chromadb` and `Elasticsearch` as two type of vector database. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We support Chroma and Elasticsearch as two vector databases.
Chroma is used as a default database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
docs/advanced/datasource.mdx
Outdated
from embedchain import App | ||
from embedchain.config import AppConfig | ||
|
||
os.environ["ES_ENDPOINT"] = "elasticsearch_endpoint" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also agree with @cachho here. There should be a base config class for db, and then for each vector database, we should have specific classes like
- ElasticsearchDBConfig
- ChromaDBConfig
The user can get all the variables and put them in a config class like
es_config = ElasticsearchDBConfig(endpoint=os.environ[""], ...)
docs/advanced/datasource.mdx
Outdated
os.environ["ES_API_KEY"] = "api_key" # Optional | ||
|
||
|
||
es_app_config = AppConfig(db_type='es') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
App is opinionated. So we will also choose Chroma database as the default option there.
This can go in CustomAppConfig.
from embedchain.vectordb.base_vector_db import BaseVectorDB | ||
|
||
|
||
class EsDB(BaseVectorDB): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ElasticsearchDB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
@@ -17,29 +28,43 @@ def __init__(self, log_level=None, embedding_fn=None, db=None, host=None, port=N | |||
:param id: Optional. ID of the app. Document metadata will have this id. | |||
:param host: Optional. Hostname for the database server. | |||
:param port: Optional. Port for the database server. | |||
:param db_type: Optional. db type to use. Currently [chroma, es] are supported. | |||
:param vector_dim: Vector dimension generated by embedding fn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we can skip this part. right now vector_dim is not used with chroma as it computes the dimension itself. but in future we may need it.
if embedding_fn is None: | ||
raise ValueError("ChromaDb cannot be instantiated without an embedding function") | ||
|
||
if db_type == "es": | ||
from embedchain.vectordb.elasticsearch_db import EsDB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, i initialize my elasticsearch as
ES_EXTRA_PARAMS = {
"http_auth": (ES_USER, ES_PASSWORD),
"ca_certs": ES_CA_CERTS,
}
ES_CONNECTION = Elasticsearch(ES_URL, **ES_EXTRA_PARAMS)
We should have a generic support so that a user can initialize ES in whatever way suits them.
from embedchain.vectordb.chroma_db import ChromaDB | ||
|
||
return ChromaDB(embedding_fn=embedding_fn, host=host, port=port) | ||
return VectorDb(ChromaDB(embedding_fn=embedding_fn, host=host, port=port)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in VectorDb
, b
should be capital.
VectorDB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As suggested, VecotrDb class is removed.
@@ -26,6 +27,8 @@ def __init__(self, log_level=None, host=None, port=None, id=None, model=None): | |||
host=host, | |||
port=port, | |||
id=id, | |||
db_type=db_type, | |||
vector_dim=384, # vector length created by embedding fn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refer from a variable rather than the value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update: referring from VectorDimensions under models
existing_docs = self.collection.get( | ||
ids=ids, | ||
where=where, # optional filter | ||
existing_ids = self.db.get( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didnt get how this part is working.
Also we should not skip where. where
can have more variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also we should not skip where. where can have more variables.
This is not implemented yet (#395). Would be too much to ask of him probably.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: not skipping where, but currently only using app_id if present in where as filter while fetching from elasticsearch.
- Remove additional vector db class and add functions in base and inherited db classes - Move vector dimensions and db type in enum classes under models - Support elasticsearch as db type in CustomApp and do not alter App and OpenSourceApp - Add elasticsearch as an optional dependency
0faa2b0
to
ef0219e
Compare
Description
Adding support to use elasticsearch as vector database in CustomApp.
to-dos:
Type of change
How Has This Been Tested?
Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration
Checklist:
Maintainer Checklist