-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Enhancement] Use dynamic batch size for simdjson to parse multiple json document #53056
base: main
Are you sure you want to change the base?
Conversation
return Status::OK(); | ||
} | ||
return Status::EndOfFile("all documents of the stream are iterated"); | ||
return _get_current_impl(row); | ||
} catch (simdjson::simdjson_error& e) { | ||
std::string err_msg; | ||
if (e.error() == simdjson::CAPACITY) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most risky bug in this code is:
If _batch_size
is initially larger than len
, the assignment to _doc_stream
can result in iterate_many()
being called with an invalid length, causing undefined behavior or a crash.
You can modify the code like this:
- _doc_stream = _parser->iterate_many(data, len, len);
+ _doc_stream = _parser->iterate_many(data, len, std::min(_batch_size, len));
bf7cb20
to
b403633
Compare
be/src/exec/json_parser.cpp
Outdated
if (_doc_stream_itr != _doc_stream.end()) { | ||
simdjson::ondemand::document_reference doc = *_doc_stream_itr; | ||
// simdjson version 3.9.4 and JsonFunctions::to_json_string may crash when json is invalid. | ||
// https://github.com/StarRocks/StarRocksTest/issues/8327 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this link is not accessible from public, remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
be/src/exec/json_parser.h
Outdated
// _last_begin_offsets represent begin offet of last success object in _doc_stream | ||
size_t _last_begin_offsets = 0; | ||
// _batch_size using in batch mode parsing | ||
size_t _batch_size = (config::json_parse_many_batch_size > 0) ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall put this non-trivial initialization in constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
be/src/exec/json_parser.cpp
Outdated
throw; | ||
} | ||
|
||
_batch_size = std::min(_batch_size * 2, _len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this batch_size can be changed again and again, should forbid the parser from being used to parse multiple docs, the _batch_size determined by the previous document may not be suitable for next doc.
or you have the way to reset the batch_size before parsing the next doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
…son document Signed-off-by: srlch <[email protected]>
be49e7d
to
5b42aaf
Compare
Signed-off-by: srlch <[email protected]>
Signed-off-by: srlch <[email protected]>
Signed-off-by: srlch <[email protected]>
Signed-off-by: srlch <[email protected]>
Signed-off-by: srlch <[email protected]>
Signed-off-by: srlch <[email protected]>
Signed-off-by: srlch <[email protected]>
[Java-Extensions Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
[FE Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
[BE Incremental Coverage Report]✅ pass : 42 / 45 (93.33%) file detail
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently, all json content in the job is cached in a large buffer, which takes up a lot of memory.
can it be split into small buffers and let iterate_many perform streaming parsing?
Status JsonDocumentStreamParser::parse(char* data, size_t len, size_t allocated) noexcept { | ||
try { | ||
_data = data; | ||
_len = len; | ||
|
||
_doc_stream = _parser->iterate_many(data, len, len); | ||
_doc_stream = _parser->iterate_many(data, len, | ||
config::enable_dynamic_batch_size_for_json_parse_many ? _batch_size : _len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_len
may be smaller than _batch_size
.
Why I'm doing:
In current implementation, JsonDocumentStreamParser use
simdjson::ondemand::parser::iterate_many
toparse multiple JSON document. This API need caller pass the max size of JSON document in the file, says
max_json_lenght_in_file
in a given file to allocate the a memory chunk to finish the parsing process.But the problem is that, the caller pass the file size instead of
max_json_lenght_in_file
and allocatehuge memory chunk (which may not be used) almost 5~6 time of the file size. This is a huge memory amplification
What I'm doing:
Introduce
json_parse_many_batch_size
to control the batch_size passed intosimdjson::ondemand::parser::iterate_many
.If
json_parse_many_batch_size > simdjson::dom::MINIMAL_BATCH_SIZE
, usejson_parse_many_batch_size
as batch size,otherwise use
simdjson::dom::DEFAULT_BATCH_SIZE
.But simdjson does not guarantee the behavior if the parsing failed because of the small size of batch_size. Any kind of
error/expcetion and even iterator failure (error == EMPTY) can happen in this case. To use to dynamic batch_size, we should
handle all possible error and retry using larger batch_size until the batch_size hit the limit(file size - last success postition)
Fixes #issue
https://github.com/StarRocks/StarRocksTest/issues/8636
What type of PR is this:
Does this PR entail a change in behavior?
If yes, please specify the type of change:
Checklist:
Bugfix cherry-pick branch check: