Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON Convertor : Pipeline Component #7784

Closed
srini047 opened this issue Jun 1, 2024 · 9 comments · Fixed by #8397
Closed

JSON Convertor : Pipeline Component #7784

srini047 opened this issue Jun 1, 2024 · 9 comments · Fixed by #8397
Assignees
Labels
Contributions wanted! Looking for external contributions P2 Medium priority, add to the next sprint if no P1 available

Comments

@srini047
Copy link
Contributor

srini047 commented Jun 1, 2024

Is your feature request related to a problem? Please describe.
Currently we have a .txt to Document convertor besides others and unstructured. But I see most of the data we deal with are in the form of JSON.

Describe the solution you'd like
So a .json to Document convertor will be a bread winner while consuming API data in pipelines.

Describe alternatives you've considered
Unstructured file convertor is present but JSON schema as a individual convertor adds more sense and value.

@julian-risch julian-risch added the Contributions wanted! Looking for external contributions label Jun 9, 2024
@julian-risch
Copy link
Member

@srini047 Thank you for your suggestion. If you like, feel free to open a pull request. Our contributing guidelines are here.

@arminnajafi
Copy link

@julian-risch
I like to start contributing Haystack. Do you think this can be a good starter? In that case, please feel free to assign it to me.

Thanks,

@CarlosFerLo
Copy link
Contributor

@arminnajafi I have been contributing for a month or so, and they do not normally assign external people to issues, or at least that I have seen. If you want to do this one, feel free to open a PR, they will review it when you are ready.

@kanenorman
Copy link

I'm proposing a JSONToDocument converter for Haystack 2.0, inspired by LangChain's JSONLoader. This would allow powerful parsing of JSON files, similar to LangChain's implementation which is built on jq.

Example JSON schema (prize.json):

{
    "prizes": [
        {
            "year": "string",
            "category": "string",
            "laureates": [
                {
                    "id": "string",
                    "firstname": "string",
                    "surname": "string",
                    "motivation": "string",
                    "share": "string",
                }
            ],
        }
    ]
}

proposed implementation

from haystack.components.converters import JSONToDocument

converter = JSONToDocument(
    jq_schema=".prizes[].laureates[]?",
    content_key="motivation",
    additional_meta_fields=["firstname", "surname", "share"],
)
docs = converter.run(sources=["./prize.json"])
print(docs["documents"][0])

expected output:

Document(id=db72dfbe9, content: '"for the discovery and synthesis of quantum dots"', meta: {'file_path': './prize.json', 'firstname': 'Moungi', 'surname': 'Bawendi', 'share': '3'})

@tradicio
Copy link

Based on the @kanenorman suggestion, I realized a basic implementation of a JSONToDocument component in #8079.

In this first implementation, I have not yet included the content_key and additional_meta_fields parameters yet. The point is that I have not figured out a simple way to implement the metadata logic in any JSON structure.

Let me know how this component can be improved to include this logic.

@kanenorman
Copy link

@tradicio - Thank you. I'm working on incorporating the jq logic. Are you planning on leaving your PR up as final or converting to draft?

@tradicio
Copy link

@tradicio - Thank you. I'm working on incorporating the jq logic. Are you planning on leaving your PR up as final or converting to draft?

I'm not sure how much I'll be able to work on the PR in the next few weeks, if you think you can incorporate the jq logic I'm more than happy to make the PR become draft

@s-a
Copy link

s-a commented Jul 31, 2024

this component would be game changer. i try hard to find a solution for old but gold data tables :) like csv etc. @tradicio could you point me to the right direction where i can find more information about this topic?
for any of these kind of things i could convert csv2json, xls2json(+worksheets) then?

@tradicio
Copy link

tradicio commented Aug 2, 2024

this component would be game changer. i try hard to find a solution for old but gold data tables :) like csv etc. @tradicio could you point me to the right direction where i can find more information about this topic? for any of these kind of things i could convert csv2json, xls2json(+worksheets) then?

For more information on how the logic behind jq works, I recommend you start with the official documentation. Regarding the structure of the component, I have been inspired by the JSONLoader component in LangChain, as suggested by @kanenorman.

With respect to your second question, I think this JSONToDocument component can be also a first step to work on tabular data but I would still keep separate any future components (such as the one I suggested in #8036). CSV and XLSX files are often used to collect data with specific structures compared to JSON files. It seems to me that they are used in different contexts and for different purposes so they need different processor components.

@julian-risch julian-risch added the P2 Medium priority, add to the next sprint if no P1 available label Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Contributions wanted! Looking for external contributions P2 Medium priority, add to the next sprint if no P1 available
Projects
8 participants