Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] Use dynamic batch size for simdjson to parse multiple json document #53056

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

srlch
Copy link
Contributor

@srlch srlch commented Nov 20, 2024

Why I'm doing:

In current implementation, JsonDocumentStreamParser use simdjson::ondemand::parser::iterate_many to
parse multiple JSON document. This API need caller pass the max size of JSON document in the file, says
max_json_lenght_in_file in a given file to allocate the a memory chunk to finish the parsing process.
But the problem is that, the caller pass the file size instead of max_json_lenght_in_file and allocate
huge memory chunk (which may not be used) almost 5~6 time of the file size. This is a huge memory amplification

What I'm doing:

Introduce json_parse_many_batch_size to control the batch_size passed into simdjson::ondemand::parser::iterate_many.
If json_parse_many_batch_size > simdjson::dom::MINIMAL_BATCH_SIZE, use json_parse_many_batch_size as batch size,
otherwise use simdjson::dom::DEFAULT_BATCH_SIZE.

But simdjson does not guarantee the behavior if the parsing failed because of the small size of batch_size. Any kind of
error/expcetion and even iterator failure (error == EMPTY) can happen in this case. To use to dynamic batch_size, we should
handle all possible error and retry using larger batch_size until the batch_size hit the limit(file size - last success postition)

Fixes #issue
https://github.com/StarRocks/StarRocksTest/issues/8636

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 3.4
    • 3.3
    • 3.2
    • 3.1
    • 3.0

@srlch srlch requested a review from a team as a code owner November 20, 2024 09:19
@mergify mergify bot assigned srlch Nov 20, 2024
return Status::OK();
}
return Status::EndOfFile("all documents of the stream are iterated");
return _get_current_impl(row);
} catch (simdjson::simdjson_error& e) {
std::string err_msg;
if (e.error() == simdjson::CAPACITY) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most risky bug in this code is:
If _batch_size is initially larger than len, the assignment to _doc_stream can result in iterate_many() being called with an invalid length, causing undefined behavior or a crash.

You can modify the code like this:

-        _doc_stream = _parser->iterate_many(data, len, len);
+        _doc_stream = _parser->iterate_many(data, len, std::min(_batch_size, len));

@srlch srlch changed the title [WIP] [Enhancement] Use dynamic batch size for simdjson to parse multiple json document [Enhancement] Use dynamic batch size for simdjson to parse multiple json document Nov 21, 2024
@srlch srlch requested a review from a team as a code owner November 21, 2024 02:54
if (_doc_stream_itr != _doc_stream.end()) {
simdjson::ondemand::document_reference doc = *_doc_stream_itr;
// simdjson version 3.9.4 and JsonFunctions::to_json_string may crash when json is invalid.
// https://github.com/StarRocks/StarRocksTest/issues/8327
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this link is not accessible from public, remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

// _last_begin_offsets represent begin offet of last success object in _doc_stream
size_t _last_begin_offsets = 0;
// _batch_size using in batch mode parsing
size_t _batch_size = (config::json_parse_many_batch_size > 0) ?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall put this non-trivial initialization in constructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

throw;
}

_batch_size = std::min(_batch_size * 2, _len);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this batch_size can be changed again and again, should forbid the parser from being used to parse multiple docs, the _batch_size determined by the previous document may not be suitable for next doc.

or you have the way to reset the batch_size before parsing the next doc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Signed-off-by: srlch <[email protected]>
Signed-off-by: srlch <[email protected]>
Signed-off-by: srlch <[email protected]>
Signed-off-by: srlch <[email protected]>
Signed-off-by: srlch <[email protected]>
kevincai
kevincai previously approved these changes Nov 25, 2024
Signed-off-by: srlch <[email protected]>
Signed-off-by: srlch <[email protected]>
Copy link

[Java-Extensions Incremental Coverage Report]

pass : 0 / 0 (0%)

Copy link

[FE Incremental Coverage Report]

pass : 0 / 0 (0%)

Copy link

[BE Incremental Coverage Report]

pass : 42 / 45 (93.33%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 be/src/exec/json_parser.cpp 42 45 93.33% [69, 72, 106]

Copy link
Contributor

@wyb wyb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently, all json content in the job is cached in a large buffer, which takes up a lot of memory.
can it be split into small buffers and let iterate_many perform streaming parsing?

Status JsonDocumentStreamParser::parse(char* data, size_t len, size_t allocated) noexcept {
try {
_data = data;
_len = len;

_doc_stream = _parser->iterate_many(data, len, len);
_doc_stream = _parser->iterate_many(data, len,
config::enable_dynamic_batch_size_for_json_parse_many ? _batch_size : _len);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_len may be smaller than _batch_size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants