Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to solve large json file? #927

Closed
napasa opened this issue Jan 21, 2018 · 12 comments
Closed

How to solve large json file? #927

napasa opened this issue Jan 21, 2018 · 12 comments
Labels
solution: proposed fix a fix for the issue has been proposed and waits for confirmation

Comments

@napasa
Copy link

napasa commented Jan 21, 2018

My json file has a array of approximately ten million data.
ucrtbased.dll!0f5160d0() Unknown
[Frames below may be incorrect and/or missing, no symbols loaded for ucrtbased.dll]
[External Code]

WenKuInfoProcess.exe!std::_Allocate(unsigned int _Count, unsigned int _Sz, bool _Try_aligned_allocation) Line 87 C++
WenKuInfoProcess.exe!std::allocatorstd::_Container_proxy::allocate(unsigned int _Count) Line 828 C++
WenKuInfoProcess.exe!std::_Wrap_alloc<std::allocatorstd::_Container_proxy >::allocate(unsigned int _Count) Line 1078 C++
WenKuInfoProcess.exe!std::_String_alloc<std::_String_base_types<char,std::allocator > >::_Alloc_proxy() Line 1776 C++
WenKuInfoProcess.exe!std::_String_alloc<std::_String_base_types<char,std::allocator > >::_String_alloc<std::_String_base_types<char,std::allocator > ><std::_Wrap_alloc<std::allocator >,void>(std::_Wrap_alloc<std::allocator > && _Al) Line 1731 C++
WenKuInfoProcess.exe!std::basic_string<char,std::char_traits,std::allocator >::basic_string<char,std::char_traits,std::allocator >(std::basic_string<char,std::char_traits,std::allocator > && _Right) Line 2055 C++
WenKuInfoProcess.exe!std::pair<std::basic_string<char,std::char_traits,std::allocator > const ,nlohmann::basic_json<std::map,std::vector,std::basic_string<char,std::char_traits,std::allocator >,bool,__int64,unsigned __int64,double,std::allocator,nlohmann::adl_serializer> >::pair<std::basic_string<char,std::char_traits,std::allocator > const ,nlohmann::basic_json<std::map,std::vector,std::basic_string<char,std::char_traits,std::allocator >,bool,__int64,unsigned __int64,double,std::allocator,nlohmann::adl_serializer> ><std::basic_string<char,std::char_traits,std::allocator >,nlohmann::basic_json<std::map,std::vector,std::basic_string<char,std::char_traits,std::allocator >,bool,__int64,unsigned __int64,double,std::allocator,nlohmann::adl_serializer>,void,0>(std::basic_string<char,std::char_traits,std::allocator > && _Val1, nlohmann::basic_json<std::map,std::vector,std::basic_string<char,std::char_traits,std::allocator >,bool,__int64,unsigned __int64,double,std::allocator,nlohmann::adl_serializer> && _Val2) Line 188 C++
WenKuInfoProcess.exe!std::allocator<std::_Tree_node<std::pair<std::basic_string<char,std::char_traits,std::allocator > const ,nlohmann::basic_json<std::map,std::vector,std::basic_string<char,std::char_traits,std::allocator >,bool,__int64,unsigned __int64,double,std::allocator,nlohmann::adl_serializer> >,void *> >::construct<std::pair<std::basic_string<char,std::char_traits,std::allocator > const ,nlohmann::basic_json<std::map,std::vector,std::basic_string<char,std::char_traits,std::allocator >,bool,__int64,unsigned __int64,double,std::allocator,nlohmann::adl_serializer> >,std::basic_string<char,std::char_traits,std::allocator >,nlohmann::basic_json<std::map,std::vector,std::basic_string<char,std::char_traits,std::allocator >,bool,__int64,unsigned __int64,double,std::allocator,nlohmann::adl_serializer> >(std::pair<std::basic_string<char,std::char_traits,std::allocator > const ,nlohmann::basic_json<std::map,std::vector,std::basic_string<char,std::char_traits,std::allocator >,bool,__int64,unsigned __int64,double,std::allocator,nlohmann::adl_serializer> > * _Ptr, std::basic_string<char,std::char_traits,std::allocator > && <_Args_0>, nlohmann::basic_json<std::map,std::vector,std::basic_string<char,std::char_traits,std::allocator >,bool,__int64,unsigned __int64,double,std::allocator,nlohmann::adl_serializer> && <_Args_1>) Line 840 C++
WenKuInfoProcess.exe!std::allocator_traits<std::allocator<std::_Tree_node<std::pair<std::basic_string<char,std::char_traits,std::allocator > const ,nlohmann::basic_json<std::map,std::vector,std::basic_string<char,std::char_traits,std::allocator >,bool,__int64,unsigned __int64,double,std::allocator,nlohmann::adl_serializer> >,void *> > >::construct<std::pair<std::basic_string<char,std::char_traits,std::allocator > const ,nlohmann::basic_json<std::map,std::vector,std::basic_string<char,std::char_traits,std::allocator >,bool,__int64,unsigned __int64,double,std::allocator,nlohmann::adl_serializer> >,std::basic_string<char,std::char_traits,std::allocator >,nlohmann::basic_json<std::map,std::vector,std::basic_string<char,std::char_traits,std::allocator >,bool,__int64,unsigned __int64,double,std::allocator,nlohmann::adl_serializer> >(std::allocator<std::_Tree_node<std::pair<std::basic_string<char,std::char_traits,std::allocator > const ,nlohmann::basic_json<std::map,std::vector,std::basic_string<char,std::char_traits,std::allocator >,bool,__int64,unsigned __int64,double,std::allocator,nlohmann::adl_serializer> >,void *> > & _Al, std::pair<std::basic_string<char,std::char_traits,std::allocator > const ,nlohmann::basic_json<std::map,std::vector,std::basic_string<char,std::char_traits,std::allocator >,bool,__int64,unsigned __int64,double,std::allocator,nlohmann::adl_serializer> > * _Ptr, std::basic_string<char,std::char_traits,std::allocator > && <_Args_0>, nlohmann::basic_json<std::map,std::vector,std::basic_string<char,std::char_traits,std::allocator >,bool,__int64,unsigned __int64,double,std::allocator,nlohmann::adl_serializer> && <_Args_1>) Line 960 C++
WenKuInfoProcess.exe!std::_Wrap_alloc<std::allocator<std::_Tree_node<std::pair<std::basic_string<char,std::char_traits,std::allocator > const ,nlohmann::basic_json<std::map,std::vector,std::basic_string<char,std::char_traits,std::allocator >,bool,__int64,unsigned __int64,double,std::allocator,nlohmann::adl_serializer> >,void *> > >::construct<std::pair<std::basic_string<char,std::char_traits,std::allocator > const ,nlohmann::basic_json<std::map,std::vector,std::basic_string<char,std::char_traits,std::allocator >,bool,__int64,unsigned __int64,double,std::allocator,nlohmann::adl_serializer> >,std::basic_string<char,std::char_traits,std::allocator >,nlohmann::basic_json<std::map,std::vector,std::basic_string<char,std::char_traits,std::allocator >,bool,__int64,unsigned __int64,double,std::allocator,nlohmann::adl_serializer> >(std::pair<std::basic_string<char,std::char_traits,std::allocator > const ,nlohmann::basic_json<std::map,std::vector,std::basic_string<char,std::char_traits,std::allocator >,bool,__int64,unsigned __int64,double,std::allocator,nlohmann::adl_serializer> > * _Ptr, std::basic_string<char,std::char_traits,std::allocator > && <_Args_0>, nlohmann::basic_json<std::map,std::vector,std::basic_string<char,std::char_traits,std::allocator >,bool,__int64,unsigned __int64,double,std::allocator,nlohmann::adl_serializer> && <_Args_1>) Line 1096 C++
WenKuInfoProcess.exe!std::_Tree_comp_alloc<std::_Tmap_traits<std::basic_string<char,std::char_traits,std::allocator >,nlohmann::basic_json<std::map,std::vector,std::basic_string<char,std::char_traits,std::allocator >,bool,__int64,unsigned __int64,double,std::allocator,nlohmann::adl_serializer>,std::less<std::basic_string<char,std::char_traits,std::allocator > >,std::allocator<std::pair<std::basic_string<char,std::char_traits,std::allocator > const ,nlohmann::basic_json<std::map,std::vector,std::basic_string<char,std::char_traits,std::allocator >,bool,__int64,unsigned __int64,double,std::allocator,nlohmann::adl_serializer> > >,0> >::_Buynode<std::basic_string<char,std::char_traits,std::allocator >,nlohmann::basic_json<std::map,std::vector,std::basic_string<char,std::char_traits,std::allocator >,bool,__int64,unsigned __int64,double,std::allocator,nlohmann::adl_serializer> >(std::basic_string<char,std::char_traits,std::allocator > && <_Val_0>, nlohmann::basic_json<std::map,std::vector,std::basic_string<char,std::char_traits,std::allocator >,bool,__int64,unsigned __int64,double,std::allocator,nlohmann::adl_serializer> && <_Val_1>) Line 902 C++
WenKuInfoProcess.exe!std::_Tree<std::_Tmap_traits<std::basic_string<char,std::char_traits,std::allocator >,nlohmann::basic_json<std::map,std::vector,std::basic_string<char,std::char_traits,std::allocator >,bool,__int64,unsigned __int64,double,std::allocator,nlohmann::adl_serializer>,std::less<std::basic_string<char,std::char_traits,std::allocator > >,std::allocator<std::pair<std::basic_string<char,std::char_traits,std::allocator > const ,nlohmann::basic_json<std::map,std::vector,std::basic_string<char,std::char_traits,std::allocator >,bool,__int64,unsigned __int64,double,std::allocator,nlohmann::adl_serializer> > >,0> >::emplace<std::basic_string<char,std::char_traits,std::allocator >,nlohmann::basic_json<std::map,std::vector,std::basic_string<char,std::char_traits,std::allocator >,bool,__int64,unsigned __int64,double,std::allocator,nlohmann::adl_serializer> >(std::basic_string<char,std::char_traits,std::allocator > && <_Val_0>, nlohmann::basic_json<std::map,std::vector,std::basic_string<char,std::char_traits,std::allocator >,bool,__int64,unsigned __int64,double,std::allocator,nlohmann::adl_serializer> && <_Val_1>) Line 1084 C++
WenKuInfoProcess.exe!nlohmann::detail::parser<nlohmann::basic_json<std::map,std::vector,std::basic_string<char,std::char_traits,std::allocator >,bool,__int64,unsigned __int64,double,std::allocator,nlohmann::adl_serializer> >::parse_internal(bool keep, nlohmann::basic_json<std::map,std::vector,std::basic_string<char,std::char_traits,std::allocator >,bool,__int64,unsigned __int64,double,std::allocator,nlohmann::adl_serializer> & result) Line 3083 C++
WenKuInfoProcess.exe!nlohmann::detail::parser<nlohmann::basic_json<std::map,std::vector,std::basic_string<char,std::char_traits,std::allocator >,bool,__int64,unsigned __int64,double,std::allocator,nlohmann::adl_serializer> >::parse_internal(bool keep, nlohmann::basic_json<std::map,std::vector,std::basic_string<char,std::char_traits,std::allocator >,bool,__int64,unsigned __int64,double,std::allocator,nlohmann::adl_serializer> & result) Line 3150 C++
WenKuInfoProcess.exe!nlohmann::detail::parser<nlohmann::basic_json<std::map,std::vector,std::basic_string<char,std::char_traits,std::allocator >,bool,__int64,unsigned __int64,double,std::allocator,nlohmann::adl_serializer> >::parse_internal(bool keep, nlohmann::basic_json<std::map,std::vector,std::basic_string<char,std::char_traits,std::allocator >,bool,__int64,unsigned __int64,double,std::allocator,nlohmann::adl_serializer> & result) Line 3076 C++
WenKuInfoProcess.exe!nlohmann::detail::parser<nlohmann::basic_json<std::map,std::vector,std::basic_string<char,std::char_traits,std::allocator >,bool,__int64,unsigned __int64,double,std::allocator,nlohmann::adl_serializer> >::parse(const bool strict, nlohmann::basic_json<std::map,std::vector,std::basic_string<char,std::char_traits,std::allocator >,bool,__int64,unsigned __int64,double,std::allocator,nlohmann::adl_serializer> & result) Line 2941 C++
WenKuInfoProcess.exe!nlohmann::operator>>(std::basic_istream<char,std::char_traits > & i, nlohmann::basic_json<std::map,std::vector,std::basic_string<char,std::char_traits,std::allocator >,bool,__int64,unsigned __int64,double,std::allocator,nlohmann::adl_serializer> & j) Line 13150 C++
WenKuInfoProcess.exe!main(int argc, char * * argv) Line 22 C++
[External Code]

@nlohmann
Copy link
Owner

Does this error message occur during parsing?

@nlohmann nlohmann added the state: needs more info the author of the issue needs to provide more details label Jan 21, 2018
@napasa
Copy link
Author

napasa commented Jan 21, 2018

Yes~. Code is simple. std::fstream read file, pass data to json obj.

@nlohmann
Copy link
Owner

Right now, we only support DOM-like parsing to memory. There is an experimental SAX-like approach, but it has not been merged yet. With such an approach, you may parse and process the input without the need of converting each element to a JSON value and storing it.

Would it be possible to share your input so we could experiment whether the new approach would help?

@nlohmann nlohmann removed the state: needs more info the author of the issue needs to provide more details label Jan 21, 2018
@napasa
Copy link
Author

napasa commented Jan 21, 2018

Ok, Laterly I will upload data to Google Drive and paste link here. Finally I choose sed and grep to parse my data.

@nlohmann
Copy link
Owner

Thanks!

@napasa
Copy link
Author

napasa commented Jan 21, 2018

@nlohmann
Copy link
Owner

Thanks. What did you want to do with the file after parsing? I am asking because it is possible to define a callback function to be called during parsing, see https://nlohmann.github.io/json/classnlohmann_1_1basic__json_ab4f78c5f9fd25172eeec84482e03f5b7.html#ab4f78c5f9fd25172eeec84482e03f5b7

With such a function, you can process the input during parsing without the need of actually storing all input data.

@nlohmann
Copy link
Owner

Example:

#include <iostream>
#include <fstream>
#include "json.hpp"

using json = nlohmann::json;

int main()
{
    std::size_t count = 0;
    
    auto x = [&count](int depth, json::parse_event_t event, json& parsed) {
        if (event == json::parse_event_t::object_end)
        {
            ++count;
            return false; // do not store the object value
        }
        else
        {
            return true;
        }
    };
    
    std::ifstream f("wk_file_list.json");
    json::parse(f, x);
    
    std::cerr << "file has " << count << " elements\n";
}

Output: file has 1992511 elements.

On my machine, this takes less than 3 MB of memory.

@JosephP91
Copy link

JosephP91 commented Feb 18, 2018

Hi @nlohmann ,
I think I have the same problem. I have to process a very big JSON file to store its content in a MongoDB database. I am talking about a 85 Mb on disk file. Can I use that function to avoid memory problems?

@nlohmann
Copy link
Owner

The code above is just an example how to cope with a single array of objects. You may need to adjust it to your needs. Please also have a look at the documentation.

@stale
Copy link

stale bot commented Mar 20, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated label Mar 20, 2018
@stale stale bot closed this as completed Mar 27, 2018
@nlohmann
Copy link
Owner

nlohmann commented Apr 3, 2018

A small update on this (as it was mentioned in #971):

  • With the SAX parser, the syntax check takes 3 MB of RAM and about 7 seconds to complete.
  • The new DOM parser takes 12 seconds and 2.7 GB of RAM to read the complete file. In comparison, jq takes 25 seconds and roughly the same amount of memory.

@nlohmann nlohmann added solution: proposed fix a fix for the issue has been proposed and waits for confirmation and removed state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated labels Dec 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solution: proposed fix a fix for the issue has been proposed and waits for confirmation
Projects
None yet
Development

No branches or pull requests

3 participants