refactor: Generate JSON schema when missing #3533

masci · 2022-11-04T15:49:34Z

Related Issues

originated from speed up YAML schema generation #3494

Proposed Changes:

~~This is a bit hardcore, but I couldn't find a gentler method. The idea is to let the socket module fail systematically while we generate the schema. The speed-up is around 1 second on my laptop.~~
We discussed a better alternative that makes the schema generation performance less critical. With this PR, we simply generate the schema whenever it's needed but there isn't one (e.g. after a fresh install).

How did you test it?

Tested locally and checked out the diff with the schema generated from main

Notes for the reviewer

Leaving comments in the code

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added tests that demonstrate the correct behavior of the change
I've used the conventional commit convention for my PR title
I documented my code
I ran pre-commit hooks and fixed any issue

ZanSara

I see the speed improvement on my machine too 🤩 Crazy that we have to do this, anyway

julian-risch

Creative solution to a problem we couldn't really fix otherwise. 👍
In the tests I just saw that many are failing with a FileNotFoundError: [Errno 2] No such file or directory: '/home/runner/work/haystack/haystack/haystack/json-schemas/haystack-pipeline-main.schema.json' but there seems to be not other reason for failing tests. 👍

masci · 2022-11-16T10:33:44Z

.gitignore

@@ -150,6 +150,8 @@ saved_models
 *_build
 rest_api/file-upload/*
 **/feedback_squad_direct.json
+haystack/json-schemas


I like keeping all the ignore rules at the root, the .gitkeep file will make sure the json-schemas folder is always checked out by git.

The latest version of my code creates the folder if missing so we don't even need the .gitkeep file anymore

masci · 2022-11-16T10:34:16Z

haystack/nodes/_json_schema.py

@@ -176,7 +176,7 @@ def create_schema_for_node_class(node_class: Type[BaseComponent]) -> Tuple[Dict[

 node_name = getattr(node_class, "__name__")

- logger.info("Creating schema for '%s'", node_name)
+ logger.debug("Creating schema for '%s'", node_name)


I think this verbosity level is more appropriate

julian-risch

All the schema related changes look good to me! 👍 One thing I wonder about: there was a test/samples/squad/small.jsonl added. I don't think that this addition was intended as we already have a small.json. So after removing that file again I would say this PR is ready to be merged.

test/samples/squad/small.jsonl

julian-risch

Ready to be merged 👍

* removed unused script * print info logs when generating openapi schema * create json schema only when needed * fix tests * Remove leftover Co-authored-by: ZanSara <[email protected]>

* Fix docstrings for DocumentStores * Fix docstrings for AnswerGenerator * Fix docstrings for Connector * Fix docstrings for DocumentClassifier * Fix docstrings for LabelGenerator * Fix docstrings for QueryClassifier * Fix docstrings for Ranker * Fix docstrings for Retriever and Summarizer * Fix docstrings for Translator * Fix docstrings for Pipelines * Fix docstrings for Primitives * Fix Python code block spacing * Add line break before code block * Fix code blocks * fix: discard metadata fields if not set in Weaviate (#3578) * fix weaviate bug in returning embeddings and setting empty meta fields * review comment * Update unstable version and openapi schema (#3584) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * fix: Flatten `DocumentClassifier` output in `SQLDocumentStore`; remove `_sql_session_rollback` hack in tests (#3273) * first draft * fix * fix * move test to test_sql * test: add test to check id_hash_keys is not ignored (#3577) * refactor: Generate JSON schema when missing (#3533) * removed unused script * print info logs when generating openapi schema * create json schema only when needed * fix tests * Remove leftover Co-authored-by: ZanSara <[email protected]> * move milvus tests to their own module (#3596) * feat: store metadata using JSON in SQLDocumentStore (#3547) * add warnings * make the field cachable * review comment * Pin faiss-cpu as 1.7.3 seems to have problems (#3603) * Update Haystack imports (#3599) * Update Python version (#3602) * fix: `ParsrConverter` fails on pages without text (#3605) * try to fix bug * remove print * leftover * refactor: update Squad data (#3513) * refractor the to_squad data class * fix the validation label * refractor the to_squad data class * fix the validation label * add the test for the to_label object function * fix the tests for to_label_objects * move all the test related to squad data to one file * remove unused imports * revert tiny_augmented.json Co-authored-by: ZanSara <[email protected]> * Url fixes (#3592) * add 2 example scripts * fixing faq script * fixing some urls * removing example scripts * black reformatting * add labeler to the repo (#3609) * convert eval metrics to python float (#3612) * feat: add support for `BM25Retriever` in `InMemoryDocumentStore` (#3561) * very first draft * implement query and query_batch * add more bm25 parameters * add rank_bm25 dependency * fix mypy * remove tokenizer callable parameter * remove unused import * only json serializable attributes * try to fix: pylint too-many-public-methods / R0904 * bm25 attribute always present * convert errors into warnings to make the tutorial 1 work * add docstrings; tests * try to make tests run * better docstrings; revert not running tests * some suggestions from review * rename elasticsearch retriever as bm25 in tests; try to test memory_bm25 * exclude tests with filters * change elasticsearch to bm25 retriever in test_summarizer * add tests * try to improve tests * better type hint * adapt test_table_text_retriever_embedding * handle non-textual docs * query only textual documents * Incorporate Reviewer feedback * refactor: replace `torch.no_grad` with `torch.inference_mode` (where possible) (#3601) * try to replace torch.no_grad * revert erroneous change * revert other module breaking * revert training/base * Fix docstrings for DocumentStores * Fix docstrings for AnswerGenerator * Fix docstrings for Connector * Fix docstrings for DocumentClassifier * Fix docstrings for LabelGenerator * Fix docstrings for QueryClassifier * Fix docstrings for Ranker * Fix docstrings for Retriever and Summarizer * Fix docstrings for Translator * Fix docstrings for Pipelines * Fix docstrings for Primitives * Fix Python code block spacing * Add line break before code block * Fix code blocks * Incorporate Reviewer feedback Co-authored-by: Massimiliano Pippi <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Stefano Fiorucci <[email protected]> Co-authored-by: Julian Risch <[email protected]> Co-authored-by: ZanSara <[email protected]> Co-authored-by: Espoir Murhabazi <[email protected]> Co-authored-by: Tuana Celik <[email protected]> Co-authored-by: tstadel <[email protected]>

masci requested a review from a team as a code owner November 4, 2022 15:49

masci requested review from julian-risch and removed request for a team November 4, 2022 15:49

masci added the topic:installation label Nov 4, 2022

ZanSara approved these changes Nov 4, 2022

View reviewed changes

masci force-pushed the massi/json-gen branch from c79e0b9 to ac3c621 Compare November 7, 2022 13:47

julian-risch approved these changes Nov 9, 2022

View reviewed changes

masci force-pushed the massi/json-gen branch 2 times, most recently from c363dc7 to bdf5f7e Compare November 15, 2022 11:18

masci and others added 16 commits November 16, 2022 08:37

removed unused script

08d2c55

mock pinecone during schema generation

b245e38

fix mocking

ff8747b

disable the whole socket module

779d060

mypy

fba7314

add entrypoint

64c56b0

try

48f22d1

try

e651964

revert

f2898d2

try

4ddf465

revert

7ea01f8

try

1a9be72

skip folder

8b60b27

try

53091fb

try

00d2b2b

try

0bd7ffe

masci force-pushed the massi/json-gen branch from 35c516e to 0bd7ffe Compare November 16, 2022 07:37

masci added 4 commits November 16, 2022 09:03

try

b97d494

try

08db994

revert

48a6703

skip gen script directory

8ceeb2a

masci added 5 commits November 16, 2022 10:20

remove generator script

8abd445

revert

6bea7d8

keep empty folder

3612c02

print info logs when generating openapi schema

9bc6113

create json schema only when needed

5aee191

masci changed the title ~~refactor: Prevent 3rd party libraries from accessing the network while generating the schema~~ refactor: Generate JSON schema when missing Nov 16, 2022

masci commented Nov 16, 2022

View reviewed changes

masci requested review from julian-risch and ZanSara November 16, 2022 10:34

masci added 10 commits November 16, 2022 12:27

pass the file path explicitly

bafe45c

try

95e0eee

3.7 compat

fd387b6

bug

e818db5

return on exception

f0a0454

re-raise

12ab5c2

arf

e212655

try

12fc261

not needed

13b72b4

fix tests

5cab5a6

julian-risch requested changes Nov 17, 2022

View reviewed changes

test/samples/squad/small.jsonl Outdated Show resolved Hide resolved

Remove leftover

4b3bd71

julian-risch approved these changes Nov 17, 2022

View reviewed changes

masci merged commit 6cd0e33 into main Nov 17, 2022

masci deleted the massi/json-gen branch November 17, 2022 10:09

This was referenced Nov 17, 2022

speed up YAML schema generation #3494

Closed

pip install -U farm-haystack does not trigger schema generation #3505

Closed

ZanSara removed their request for review November 21, 2022 09:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: Generate JSON schema when missing #3533

refactor: Generate JSON schema when missing #3533

masci commented Nov 4, 2022 •

edited

Loading

ZanSara left a comment

julian-risch left a comment

masci Nov 16, 2022 •

edited

Loading

masci Nov 16, 2022

masci Nov 16, 2022

julian-risch left a comment

julian-risch left a comment

refactor: Generate JSON schema when missing #3533

refactor: Generate JSON schema when missing #3533

Conversation

masci commented Nov 4, 2022 • edited Loading

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

ZanSara left a comment

Choose a reason for hiding this comment

julian-risch left a comment

Choose a reason for hiding this comment

masci Nov 16, 2022 • edited Loading

Choose a reason for hiding this comment

masci Nov 16, 2022

Choose a reason for hiding this comment

masci Nov 16, 2022

Choose a reason for hiding this comment

julian-risch left a comment

Choose a reason for hiding this comment

julian-risch left a comment

Choose a reason for hiding this comment

masci commented Nov 4, 2022 •

edited

Loading

masci Nov 16, 2022 •

edited

Loading