-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: Generate JSON schema when missing #3533
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the speed improvement on my machine too 🤩 Crazy that we have to do this, anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creative solution to a problem we couldn't really fix otherwise. 👍
In the tests I just saw that many are failing with a FileNotFoundError: [Errno 2] No such file or directory: '/home/runner/work/haystack/haystack/haystack/json-schemas/haystack-pipeline-main.schema.json'
but there seems to be not other reason for failing tests. 👍
c363dc7
to
bdf5f7e
Compare
35c516e
to
0bd7ffe
Compare
@@ -150,6 +150,8 @@ saved_models | |||
*_build | |||
rest_api/file-upload/* | |||
**/feedback_squad_direct.json | |||
haystack/json-schemas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like keeping all the ignore rules at the root, the .gitkeep
file will make sure the json-schemas
folder is always checked out by git.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest version of my code creates the folder if missing so we don't even need the .gitkeep
file anymore
@@ -176,7 +176,7 @@ def create_schema_for_node_class(node_class: Type[BaseComponent]) -> Tuple[Dict[ | |||
|
|||
node_name = getattr(node_class, "__name__") | |||
|
|||
logger.info("Creating schema for '%s'", node_name) | |||
logger.debug("Creating schema for '%s'", node_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this verbosity level is more appropriate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the schema related changes look good to me! 👍 One thing I wonder about: there was a test/samples/squad/small.jsonl added. I don't think that this addition was intended as we already have a small.json. So after removing that file again I would say this PR is ready to be merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ready to be merged 👍
* removed unused script * print info logs when generating openapi schema * create json schema only when needed * fix tests * Remove leftover Co-authored-by: ZanSara <[email protected]>
* Fix docstrings for DocumentStores * Fix docstrings for AnswerGenerator * Fix docstrings for Connector * Fix docstrings for DocumentClassifier * Fix docstrings for LabelGenerator * Fix docstrings for QueryClassifier * Fix docstrings for Ranker * Fix docstrings for Retriever and Summarizer * Fix docstrings for Translator * Fix docstrings for Pipelines * Fix docstrings for Primitives * Fix Python code block spacing * Add line break before code block * Fix code blocks * fix: discard metadata fields if not set in Weaviate (#3578) * fix weaviate bug in returning embeddings and setting empty meta fields * review comment * Update unstable version and openapi schema (#3584) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * fix: Flatten `DocumentClassifier` output in `SQLDocumentStore`; remove `_sql_session_rollback` hack in tests (#3273) * first draft * fix * fix * move test to test_sql * test: add test to check id_hash_keys is not ignored (#3577) * refactor: Generate JSON schema when missing (#3533) * removed unused script * print info logs when generating openapi schema * create json schema only when needed * fix tests * Remove leftover Co-authored-by: ZanSara <[email protected]> * move milvus tests to their own module (#3596) * feat: store metadata using JSON in SQLDocumentStore (#3547) * add warnings * make the field cachable * review comment * Pin faiss-cpu as 1.7.3 seems to have problems (#3603) * Update Haystack imports (#3599) * Update Python version (#3602) * fix: `ParsrConverter` fails on pages without text (#3605) * try to fix bug * remove print * leftover * refactor: update Squad data (#3513) * refractor the to_squad data class * fix the validation label * refractor the to_squad data class * fix the validation label * add the test for the to_label object function * fix the tests for to_label_objects * move all the test related to squad data to one file * remove unused imports * revert tiny_augmented.json Co-authored-by: ZanSara <[email protected]> * Url fixes (#3592) * add 2 example scripts * fixing faq script * fixing some urls * removing example scripts * black reformatting * add labeler to the repo (#3609) * convert eval metrics to python float (#3612) * feat: add support for `BM25Retriever` in `InMemoryDocumentStore` (#3561) * very first draft * implement query and query_batch * add more bm25 parameters * add rank_bm25 dependency * fix mypy * remove tokenizer callable parameter * remove unused import * only json serializable attributes * try to fix: pylint too-many-public-methods / R0904 * bm25 attribute always present * convert errors into warnings to make the tutorial 1 work * add docstrings; tests * try to make tests run * better docstrings; revert not running tests * some suggestions from review * rename elasticsearch retriever as bm25 in tests; try to test memory_bm25 * exclude tests with filters * change elasticsearch to bm25 retriever in test_summarizer * add tests * try to improve tests * better type hint * adapt test_table_text_retriever_embedding * handle non-textual docs * query only textual documents * Incorporate Reviewer feedback * refactor: replace `torch.no_grad` with `torch.inference_mode` (where possible) (#3601) * try to replace torch.no_grad * revert erroneous change * revert other module breaking * revert training/base * Fix docstrings for DocumentStores * Fix docstrings for AnswerGenerator * Fix docstrings for Connector * Fix docstrings for DocumentClassifier * Fix docstrings for LabelGenerator * Fix docstrings for QueryClassifier * Fix docstrings for Ranker * Fix docstrings for Retriever and Summarizer * Fix docstrings for Translator * Fix docstrings for Pipelines * Fix docstrings for Primitives * Fix Python code block spacing * Add line break before code block * Fix code blocks * Incorporate Reviewer feedback Co-authored-by: Massimiliano Pippi <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Stefano Fiorucci <[email protected]> Co-authored-by: Julian Risch <[email protected]> Co-authored-by: ZanSara <[email protected]> Co-authored-by: Espoir Murhabazi <[email protected]> Co-authored-by: Tuana Celik <[email protected]> Co-authored-by: tstadel <[email protected]>
Related Issues
Proposed Changes:
This is a bit hardcore, but I couldn't find a gentler method. The idea is to let thesocket
module fail systematically while we generate the schema. The speed-up is around 1 second on my laptop.We discussed a better alternative that makes the schema generation performance less critical. With this PR, we simply generate the schema whenever it's needed but there isn't one (e.g. after a fresh install).
How did you test it?
Tested locally and checked out the diff with the schema generated from
main
Notes for the reviewer
Leaving comments in the code
Checklist