JSON Convertor : Pipeline Component #7784

srini047 · 2024-06-01T06:27:39Z

Is your feature request related to a problem? Please describe.
Currently we have a .txt to Document convertor besides others and unstructured. But I see most of the data we deal with are in the form of JSON.

Describe the solution you'd like
So a .json to Document convertor will be a bread winner while consuming API data in pipelines.

Describe alternatives you've considered
Unstructured file convertor is present but JSON schema as a individual convertor adds more sense and value.

The text was updated successfully, but these errors were encountered:

julian-risch · 2024-06-09T04:54:48Z

@srini047 Thank you for your suggestion. If you like, feel free to open a pull request. Our contributing guidelines are here.

arminnajafi · 2024-06-24T21:17:24Z

@julian-risch
I like to start contributing Haystack. Do you think this can be a good starter? In that case, please feel free to assign it to me.

Thanks,

CarlosFerLo · 2024-06-30T11:47:09Z

@arminnajafi I have been contributing for a month or so, and they do not normally assign external people to issues, or at least that I have seen. If you want to do this one, feel free to open a PR, they will review it when you are ready.

kanenorman · 2024-07-12T05:23:27Z

I'm proposing a JSONToDocument converter for Haystack 2.0, inspired by LangChain's JSONLoader. This would allow powerful parsing of JSON files, similar to LangChain's implementation which is built on jq.

Example JSON schema (prize.json):

{
    "prizes": [
        {
            "year": "string",
            "category": "string",
            "laureates": [
                {
                    "id": "string",
                    "firstname": "string",
                    "surname": "string",
                    "motivation": "string",
                    "share": "string",
                }
            ],
        }
    ]
}

proposed implementation

from haystack.components.converters import JSONToDocument

converter = JSONToDocument(
    jq_schema=".prizes[].laureates[]?",
    content_key="motivation",
    additional_meta_fields=["firstname", "surname", "share"],
)
docs = converter.run(sources=["./prize.json"])
print(docs["documents"][0])

expected output:

Document(id=db72dfbe9, content: '"for the discovery and synthesis of quantum dots"', meta: {'file_path': './prize.json', 'firstname': 'Moungi', 'surname': 'Bawendi', 'share': '3'})

tradicio · 2024-07-25T10:13:09Z

Based on the @kanenorman suggestion, I realized a basic implementation of a JSONToDocument component in #8079.

In this first implementation, I have not yet included the content_key and additional_meta_fields parameters yet. The point is that I have not figured out a simple way to implement the metadata logic in any JSON structure.

Let me know how this component can be improved to include this logic.

kanenorman · 2024-07-26T05:56:54Z

@tradicio - Thank you. I'm working on incorporating the jq logic. Are you planning on leaving your PR up as final or converting to draft?

tradicio · 2024-07-26T07:53:39Z

@tradicio - Thank you. I'm working on incorporating the jq logic. Are you planning on leaving your PR up as final or converting to draft?

I'm not sure how much I'll be able to work on the PR in the next few weeks, if you think you can incorporate the jq logic I'm more than happy to make the PR become draft

s-a · 2024-07-31T20:44:29Z

this component would be game changer. i try hard to find a solution for old but gold data tables :) like csv etc. @tradicio could you point me to the right direction where i can find more information about this topic?
for any of these kind of things i could convert csv2json, xls2json(+worksheets) then?

tradicio · 2024-08-02T14:00:56Z

this component would be game changer. i try hard to find a solution for old but gold data tables :) like csv etc. @tradicio could you point me to the right direction where i can find more information about this topic? for any of these kind of things i could convert csv2json, xls2json(+worksheets) then?

For more information on how the logic behind jq works, I recommend you start with the official documentation. Regarding the structure of the component, I have been inspired by the JSONLoader component in LangChain, as suggested by @kanenorman.

With respect to your second question, I think this JSONToDocument component can be also a first step to work on tabular data but I would still keep separate any future components (such as the one I suggested in #8036). CSV and XLSX files are often used to collect data with specific structures compared to JSON files. It seems to me that they are used in different contexts and for different purposes so they need different processor components.

julian-risch added the Contributions wanted! Looking for external contributions label Jun 9, 2024

CuriousLearner2 mentioned this issue Jul 4, 2024

Add JSON converter and tests #7974

Closed

tradicio mentioned this issue Jul 25, 2024

feat: Added JSONToDocument component in converter components #8079

Closed

s-a mentioned this issue Jul 31, 2024

Build a CSVToDocument Component #8036

Closed

julian-risch added the P2 Medium priority, add to the next sprint if no P1 available label Sep 18, 2024

julian-risch assigned silvanocerza Sep 23, 2024

silvanocerza mentioned this issue Sep 24, 2024

feat: Add JSONConverter Component #8397

Merged

silvanocerza closed this as completed in #8397 Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON Convertor : Pipeline Component #7784

JSON Convertor : Pipeline Component #7784

srini047 commented Jun 1, 2024

julian-risch commented Jun 9, 2024

arminnajafi commented Jun 24, 2024

CarlosFerLo commented Jun 30, 2024

kanenorman commented Jul 12, 2024

tradicio commented Jul 25, 2024

kanenorman commented Jul 26, 2024

tradicio commented Jul 26, 2024

s-a commented Jul 31, 2024

tradicio commented Aug 2, 2024

JSON Convertor : Pipeline Component #7784

JSON Convertor : Pipeline Component #7784

Comments

srini047 commented Jun 1, 2024

julian-risch commented Jun 9, 2024

arminnajafi commented Jun 24, 2024

CarlosFerLo commented Jun 30, 2024

kanenorman commented Jul 12, 2024

tradicio commented Jul 25, 2024

kanenorman commented Jul 26, 2024

tradicio commented Jul 26, 2024

s-a commented Jul 31, 2024

tradicio commented Aug 2, 2024