Skip to content

Commit

Permalink
Pylint (import related warnings) and REST API improvements (deepset-a…
Browse files Browse the repository at this point in the history
…i#2326)

* remove duplicate imports

* fix ungrouped-imports

* Fix wrong-import-position

* Fix unused-import

* pyproject.toml

* Working on wrong-import-order

* Solve wrong-import-order

* fix Pool import

* Move open_search_index_to_document_store and elasticsearch_index_to_document_store in elasticsearch.py

* remove Converter from modeling

* Fix mypy issues on adaptive_model.py

* create es_converter.py

* remove converter import

* change import path in tests

* Restructure REST API to not rely on global vars from search.apy and improve tests

* Fix openapi generator

* Move variable initialization

* Change type of FilterRequest.filters

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
  • Loading branch information
ZanSara and github-actions[bot] committed Apr 12, 2022
1 parent 75dcfd3 commit 96a538b
Show file tree
Hide file tree
Showing 98 changed files with 1,291 additions and 1,227 deletions.
11 changes: 8 additions & 3 deletions .github/utils/generate_openapi_specs.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@
import sys
import shutil

sys.path.append(".")
from rest_api.utils import get_openapi_specs, get_app, get_pipelines # pylint: disable=wrong-import-position
from haystack import __version__ # pylint: disable=wrong-import-position

REST_PATH = Path("./rest_api").absolute()
PIPELINE_PATH = str(REST_PATH / "pipeline" / "pipeline_empty.haystack-pipeline.yml")
APP_PATH = str(REST_PATH / "application.py")
Expand All @@ -13,8 +17,9 @@

print(f"Loading OpenAPI specs from {APP_PATH} with pipeline at {PIPELINE_PATH}")

sys.path.append(".")
from rest_api.application import get_openapi_specs, haystack_version
# To initialize the app and the pipelines
get_app()
get_pipelines()

# Generate the openapi specs
specs = get_openapi_specs()
Expand All @@ -29,4 +34,4 @@
os.remove(specs_file)

# Add versioned copy
shutil.copy(DOCS_PATH / "openapi.json", DOCS_PATH / f"openapi-{haystack_version}.json")
shutil.copy(DOCS_PATH / "openapi.json", DOCS_PATH / f"openapi-{__version__}.json")
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -46,4 +46,4 @@ EXPOSE 8000
ENV HAYSTACK_DOCKER_CONTAINER="HAYSTACK_CPU_CONTAINER"

# cmd for running the API
CMD ["gunicorn", "rest_api.application:app", "-b", "0.0.0.0", "-k", "uvicorn.workers.UvicornWorker", "--workers", "1", "--timeout", "180"]
CMD ["gunicorn", "rest_api.application:app", "-b", "0.0.0.0", "-k", "uvicorn.workers.UvicornWorker", "--workers", "1", "--timeout", "180"]
2 changes: 1 addition & 1 deletion docker-compose-gpu.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ services:
- 8000:8000
restart: on-failure
environment:
# See rest_api/pipelines.yaml for configurations of Search & Indexing Pipeline.
# See rest_api/pipeline/pipelines.haystack-pipeline.yml for configurations of Search & Indexing Pipeline.
- DOCUMENTSTORE_PARAMS_HOST=elasticsearch
- PIPELINE_YAML_PATH=/home/user/rest_api/pipeline/pipelines_dpr.haystack-pipeline.yml
- CONCURRENT_REQUEST_PER_WORKER
Expand Down
2 changes: 1 addition & 1 deletion docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ services:
- 8000:8000
restart: on-failure
environment:
# See rest_api/pipelines.yaml for configurations of Search & Indexing Pipeline.
# See rest_api/pipeline/pipelines.haystack-pipeline.yml for configurations of Search & Indexing Pipeline.
- DOCUMENTSTORE_PARAMS_HOST=elasticsearch
- PIPELINE_YAML_PATH=/home/user/rest_api/pipeline/pipelines.haystack-pipeline.yml
- CONCURRENT_REQUEST_PER_WORKER
Expand Down
112 changes: 0 additions & 112 deletions docs/_src/api/api/document_store.md
Original file line number Diff line number Diff line change
Expand Up @@ -4752,115 +4752,3 @@ and UTC as default time zone.
This method cannot be part of WeaviateDocumentStore, as this would result in a circular import between weaviate.py
and filter_utils.py.

<a id="utils.open_search_index_to_document_store"></a>

#### open\_search\_index\_to\_document\_store

```python
def open_search_index_to_document_store(document_store: "BaseDocumentStore", original_index_name: str, original_content_field: str, original_name_field: Optional[str] = None, included_metadata_fields: Optional[List[str]] = None, excluded_metadata_fields: Optional[List[str]] = None, store_original_ids: bool = True, index: Optional[str] = None, preprocessor: Optional[PreProcessor] = None, id_hash_keys: Optional[List[str]] = None, batch_size: int = 10_000, host: Union[str, List[str]] = "localhost", port: Union[int, List[int]] = 9200, username: str = "admin", password: str = "admin", api_key_id: Optional[str] = None, api_key: Optional[str] = None, aws4auth=None, scheme: str = "https", ca_certs: Optional[str] = None, verify_certs: bool = False, timeout: int = 30, use_system_proxy: bool = False) -> "BaseDocumentStore"
```

This function provides brownfield support of existing OpenSearch indexes by converting each of the records in

the provided index to haystack `Document` objects and writing them to the specified `DocumentStore`. It can be used
on a regular basis in order to add new records of the OpenSearch index to the `DocumentStore`.

**Arguments**:

- `document_store`: The haystack `DocumentStore` to write the converted `Document` objects to.
- `original_index_name`: OpenSearch index containing the records to be converted.
- `original_content_field`: OpenSearch field containing the text to be put in the `content` field of the
resulting haystack `Document` objects.
- `original_name_field`: Optional OpenSearch field containing the title of the Document.
- `included_metadata_fields`: List of OpenSearch fields that shall be stored in the `meta` field of the
resulting haystack `Document` objects. If `included_metadata_fields` and `excluded_metadata_fields` are `None`,
all the fields found in the OpenSearch records will be kept as metadata. You can specify only one of the
`included_metadata_fields` and `excluded_metadata_fields` parameters.
- `excluded_metadata_fields`: List of OpenSearch fields that shall be excluded from the `meta` field of the
resulting haystack `Document` objects. If `included_metadata_fields` and `excluded_metadata_fields` are `None`,
all the fields found in the OpenSearch records will be kept as metadata. You can specify only one of the
`included_metadata_fields` and `excluded_metadata_fields` parameters.
- `store_original_ids`: Whether to store the ID a record had in the original OpenSearch index at the
`"_original_es_id"` metadata field of the resulting haystack `Document` objects. This should be set to `True`
if you want to continuously update the `DocumentStore` with new records inside your OpenSearch index. If this
parameter was set to `False` on the first call of `open_search_index_to_document_store`,
all the indexed Documents in the `DocumentStore` will be overwritten in the second call.
- `index`: Name of index in `document_store` to use to store the resulting haystack `Document` objects.
- `preprocessor`: Optional PreProcessor that will be applied on the content field of the original OpenSearch
record.
- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
In this case the id will be generated by using the content and the defined metadata.
- `batch_size`: Number of records to process at once.
- `host`: URL(s) of OpenSearch nodes.
- `port`: Ports(s) of OpenSearch nodes.
- `username`: Username (standard authentication via http_auth).
- `password`: Password (standard authentication via http_auth).
- `api_key_id`: ID of the API key (altenative authentication mode to the above http_auth).
- `api_key`: Secret value of the API key (altenative authentication mode to the above http_auth).
- `aws4auth`: Authentication for usage with AWS OpenSearch
(can be generated with the requests-aws4auth package).
- `scheme`: `"https"` or `"http"`, protocol used to connect to your OpenSearch instance.
- `ca_certs`: Root certificates for SSL: it is a path to certificate authority (CA) certs on disk.
You can use certifi package with `certifi.where()` to find where the CA certs file is located in your machine.
- `verify_certs`: Whether to be strict about ca certificates.
- `timeout`: Number of seconds after which an OpenSearch request times out.
- `use_system_proxy`: Whether to use system proxy.

<a id="utils.elasticsearch_index_to_document_store"></a>

#### elasticsearch\_index\_to\_document\_store

```python
def elasticsearch_index_to_document_store(document_store: "BaseDocumentStore", original_index_name: str, original_content_field: str, original_name_field: Optional[str] = None, included_metadata_fields: Optional[List[str]] = None, excluded_metadata_fields: Optional[List[str]] = None, store_original_ids: bool = True, index: Optional[str] = None, preprocessor: Optional[PreProcessor] = None, id_hash_keys: Optional[List[str]] = None, batch_size: int = 10_000, host: Union[str, List[str]] = "localhost", port: Union[int, List[int]] = 9200, username: str = "", password: str = "", api_key_id: Optional[str] = None, api_key: Optional[str] = None, aws4auth=None, scheme: str = "http", ca_certs: Optional[str] = None, verify_certs: bool = True, timeout: int = 30, use_system_proxy: bool = False) -> "BaseDocumentStore"
```

This function provides brownfield support of existing Elasticsearch indexes by converting each of the records in

the provided index to haystack `Document` objects and writing them to the specified `DocumentStore`. It can be used
on a regular basis in order to add new records of the Elasticsearch index to the `DocumentStore`.

**Arguments**:

- `document_store`: The haystack `DocumentStore` to write the converted `Document` objects to.
- `original_index_name`: Elasticsearch index containing the records to be converted.
- `original_content_field`: Elasticsearch field containing the text to be put in the `content` field of the
resulting haystack `Document` objects.
- `original_name_field`: Optional Elasticsearch field containing the title of the Document.
- `included_metadata_fields`: List of Elasticsearch fields that shall be stored in the `meta` field of the
resulting haystack `Document` objects. If `included_metadata_fields` and `excluded_metadata_fields` are `None`,
all the fields found in the Elasticsearch records will be kept as metadata. You can specify only one of the
`included_metadata_fields` and `excluded_metadata_fields` parameters.
- `excluded_metadata_fields`: List of Elasticsearch fields that shall be excluded from the `meta` field of the
resulting haystack `Document` objects. If `included_metadata_fields` and `excluded_metadata_fields` are `None`,
all the fields found in the Elasticsearch records will be kept as metadata. You can specify only one of the
`included_metadata_fields` and `excluded_metadata_fields` parameters.
- `store_original_ids`: Whether to store the ID a record had in the original Elasticsearch index at the
`"_original_es_id"` metadata field of the resulting haystack `Document` objects. This should be set to `True`
if you want to continuously update the `DocumentStore` with new records inside your Elasticsearch index. If this
parameter was set to `False` on the first call of `elasticsearch_index_to_document_store`,
all the indexed Documents in the `DocumentStore` will be overwritten in the second call.
- `index`: Name of index in `document_store` to use to store the resulting haystack `Document` objects.
- `preprocessor`: Optional PreProcessor that will be applied on the content field of the original Elasticsearch
record.
- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
In this case the id will be generated by using the content and the defined metadata.
- `batch_size`: Number of records to process at once.
- `host`: URL(s) of Elasticsearch nodes.
- `port`: Ports(s) of Elasticsearch nodes.
- `username`: Username (standard authentication via http_auth).
- `password`: Password (standard authentication via http_auth).
- `api_key_id`: ID of the API key (altenative authentication mode to the above http_auth).
- `api_key`: Secret value of the API key (altenative authentication mode to the above http_auth).
- `aws4auth`: Authentication for usage with AWS Elasticsearch
(can be generated with the requests-aws4auth package).
- `scheme`: `"https"` or `"http"`, protocol used to connect to your Elasticsearch instance.
- `ca_certs`: Root certificates for SSL: it is a path to certificate authority (CA) certs on disk.
You can use certifi package with `certifi.where()` to find where the CA certs file is located in your machine.
- `verify_certs`: Whether to be strict about ca certificates.
- `timeout`: Number of seconds after which an Elasticsearch request times out.
- `use_system_proxy`: Whether to use system proxy.

Loading

0 comments on commit 96a538b

Please sign in to comment.