[C4GT] Document Uploader ( Text chunking into paragraphs + content retrieval ) #78

ChakshuGautam · 2023-05-15T06:02:45Z

Project Details

AI Toolchain is a collection of tools for quickly building and deploying machine learning models for various use cases. Currently, the toolchain includes a text translation model, and more models may be added in the future. It abstracts the dirty details of how a model works similar to Huggingface and gives a clean API that you can orchestrate at a BFF level.

Features to be implemented

The idea is to implement a document uploader API that is async and returns the embeddings for chunks of that document. It should save the data for a short period until the user asks for the download. This data can then be uploaded by the user wherever they have a search engine. The current problem statement doesn't cover this.

How it works

Extract the text from the PDF file. Tokenize the extracted text using cosine distance and create chunks. For each chunk, create vector embeddings using an Instructor Model.

Create APIs to upload the following document Types

PDF
Audio (transcription)
Video (transcription)

Behavior of Upload API

It takes a pdf file and uploads it to our database.
API returns a document id in response. For future calls, this document id should be used. Each document id maps to an index containing embeddings.
If you are indexing multiple documents, then pass document ids accordingly.
Taken from here

File Status API

This API is used to check the status of file upload.
It returns status and document id.
Possible values for status are yet_to_start, in_progress, completed, and failed
If the embeddings for a document are successfully created and indexed, then completed is returned.
Taken from here

Chunking

To be done based on cosine distance between docs
Threshold should be configurable by the API params

Sample pdfs:

https://drive.google.com/drive/u/0/folders/1sAsuh-EFH-xmFYrxzhmj0VRUZYNzsyLw

OpenAI Embedding Alternatives

Evaluate and compare different models
https://huggingface.co/hkunlp/instructor-xl

Learning Path

Complexity

Medium

Skills Required

Python, Knowledge of HuggingFace Transformers, NLP.

Name of Mentors:

@GautamR-Samagra

Project size

8 Weeks

Product Set Up

See the setup here

Acceptance Criteria

Unit Test Cases
e2e Test Caes
OpenAPI Spec/Postman Collection
Dockerfile for this module

Milestone

Every document type supported is a milestone.

Reference

C4GT

This issue is nominated for Code for GovTech (C4GT) 2023 edition.
C4GT is India's first annual coding program to create a community that can build and contribute to global Digital Public Goods. If you want to use Open Source GovTech to create impact, then this is the opportunity for you! More about C4GT here: https://codeforgovtech.in/

The scope of this ticket has now expanded to make it the 'content processing' part of 'FAQ bot'.
The FAQ bot allows a user to be able to provide content input in the form on csvs, free text, pdfs, audio, video and the bot is able to add it to a 'Content DB'. The user is then able to interact with the bot via text/speech on related content and the bot is able to identify relevant content using RAG techniques and be able to be able to respond to the user in a conversational manner.

This ticket covers the content processing part of the bot. It includes the following tasks in its scope:

The text was updated successfully, but these errors were encountered:

ajitg25 · 2023-05-18T13:53:04Z

I understood the problem statement to take the transcriptions and store the embeddings in the database. I would like to contribute to this issue . Please assign it to me!!

chandra-pro · 2023-05-19T09:53:28Z

I have good knowledge of working on NLP and I also understand your problem . So I would to contribute to this issue . Could you please assign it to me

Dhruv88 · 2023-05-19T11:38:18Z

I have worked on a similar problem statement earlier. We had been given paragraphs on several topics and then a question on a specific topic was asked and we had to retrieve the answer for that query using the given paragraphs. The solution we had come up with was to convert the paragraphs into embeddings using the hugging face transformer model. The embeddings were indexed using the FAISS indexing library. Then for a question, we took its embeddings and retrieved the closest paragraph embeddings from the index using cosine similarity. Here is the link to the code notebook for reference click here

We used retrieval then question Answering to solve the problem.

Thus, I think I can work on converting the above code into a proper API as required by the project.

ajitg25 · 2023-05-19T12:45:00Z

I have made FastAPI to upload the PDF file and extract the text as per mentioned in the "git with basic implementation". I implemented the requirements of "Behavior of Upload API". Please review it

GautamR-Samagra · 2023-05-29T11:05:37Z

The approach which initally suggested was to creating a window for the embeddings and checking for any sharp changes in the embeddings.

However, now we aren't sure if the changing in the similarity score is a good enough approach as information about a variety of things may be present in a paragraph and this then separates them into different chunks.

I think what will be required will be :
A benchmark model using GPT that takes all the text in a page and creates chunks out of it.
Explore topic modelling style approach to the problem that does some kind of heirarchial clustering a page into identified topics.

Some sample PDFs are provided here. A simple test can be done on a page and we can see if the text extracted is getting chunked into the same paragraphs as in the pdf.

ajitg25 · 2023-05-31T02:52:39Z

Okay sir, Currently I am dividing the page into chunks and then I am doing embedding. So now what I need to do is first divide the content of pages based on different topic and then do the embedding. Have I understood right Sir?

I will explore the PDFs you have attached.

GautamR-Samagra · 2023-06-01T03:11:11Z

Okay sir, Currently I am dividing the page into chunks and then I am doing embedding. So now what I need to do is first divide the content of pages based on different topic and then do the embedding. Have I understood right Sir?

I will explore the PDFs you have attached.

That is correct.

Potential flow for solving this could be :

Pick a pdf - a good example seems to be https://drive.google.com/drive/u/0/folders/1sAsuh-EFH-xmFYrxzhmj0VRUZYNzsyLw from the pdfs provided in the folders.
Decide an evaluation metric. For the above pdf, the text has neat paragrpahs that can be considered chunks. Create a test set- with the pdf parsed according to the headings to get the paragraphs.
Use various methods to extract text from pdf and chunk it into various paragraphs such that are of similar topic
Measure the accuracy of the chunking for various chunking methods by comparing your chunks vs the pdf paragraphs.
Store the chunks in a CSV/DB.

Next steps:

Generate tags for each chunk that can be searched for various user questions/prompts.
Embed the tags using any vecotr embeddings
Integrate the setup within a vector DB

ajitg25 · 2023-06-03T17:06:00Z

Ok sir

Codecreatermunesh · 2023-06-07T16:22:57Z

I have been working on this project since 20 May. I did see many projects, but finally, I will understand everything related to this problem statement. I am submitting only this proposal. I have good knowledge of NLP and have been learning about HuggingFace Transformers for the last 1 week. I am interested in this project. I have been doing Machine learning for the last 1 year, so I have good knowledge of Python Language.

Sanchariii · 2023-06-08T16:17:36Z

I have worked on more or else similar project before using hugging face transformer model. I would like to contribute on this project.

notinrange · 2023-06-10T05:49:35Z

Hello @GautamR-Samagra Sir, I wanted to contribute to the development of the document uploader API within the AI Toolchain, for
helping streamline document processing, embedding generation, and indexing for enhanced machine learning workflows.

GautamR-Samagra changed the title ~~[C4GT] Document Uploader~~ [C4GT] Document Uploader ( Text chunking into paragraphs + content retrieval ) Jun 25, 2023

GautamR-Samagra assigned H4R5H1T-007 Jun 25, 2023

GautamR-Samagra mentioned this issue Jun 25, 2023

Generic HF Interface Use Case #114

Closed

GautamR-Samagra added the C4GT label Jun 25, 2023

GautamR-Samagra closed this as completed Feb 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C4GT] Document Uploader ( Text chunking into paragraphs + content retrieval ) #78

[C4GT] Document Uploader ( Text chunking into paragraphs + content retrieval ) #78

ChakshuGautam commented May 15, 2023 •

edited by GautamR-Samagra

Loading

ajitg25 commented May 18, 2023

chandra-pro commented May 19, 2023

Dhruv88 commented May 19, 2023 •

edited

Loading

ajitg25 commented May 19, 2023

GautamR-Samagra commented May 29, 2023

ajitg25 commented May 31, 2023

GautamR-Samagra commented Jun 1, 2023 •

edited

Loading

ajitg25 commented Jun 3, 2023

Codecreatermunesh commented Jun 7, 2023

Sanchariii commented Jun 8, 2023

notinrange commented Jun 10, 2023

[C4GT] Document Uploader ( Text chunking into paragraphs + content retrieval ) #78

[C4GT] Document Uploader ( Text chunking into paragraphs + content retrieval ) #78

Comments

ChakshuGautam commented May 15, 2023 • edited by GautamR-Samagra Loading

Project Details

Features to be implemented

How it works

Create APIs to upload the following document Types

Behavior of Upload API

File Status API

Chunking

Sample pdfs:

OpenAI Embedding Alternatives

Learning Path

Complexity

Skills Required

Name of Mentors:

Project size

Product Set Up

Acceptance Criteria

Milestone

Reference

C4GT

ajitg25 commented May 18, 2023

chandra-pro commented May 19, 2023

Dhruv88 commented May 19, 2023 • edited Loading

ajitg25 commented May 19, 2023

GautamR-Samagra commented May 29, 2023

ajitg25 commented May 31, 2023

GautamR-Samagra commented Jun 1, 2023 • edited Loading

ajitg25 commented Jun 3, 2023

Codecreatermunesh commented Jun 7, 2023

Sanchariii commented Jun 8, 2023

notinrange commented Jun 10, 2023

ChakshuGautam commented May 15, 2023 •

edited by GautamR-Samagra

Loading

Dhruv88 commented May 19, 2023 •

edited

Loading

GautamR-Samagra commented Jun 1, 2023 •

edited

Loading