Skip to content

sravan-s/document_search

Repository files navigation

Please donot use this for commercial purposes This is a study project to search through new indian law document https://www.mha.gov.in/sites/default/files/250883_english_01042024.pdf

Before Make sure you have docker and rust in your system. docker compose up qdrant

Step #1 (This part can be automated) Feed the law document to database

note This PDF is unstrcutured and quite hard to parse so the following step is idiosyncratic Prepare data

Split it into sections You need to do some manual work, because data in PDF is unstrctured Or maybe you can use one of the paid LLMs to parse the PDF into structured data

Step #2 Keep structured data in text_files ./data/formatted

  • See format.example to see the example
  • Ideally seperate by chapter, but it doesnt matter Convert structured data to embedding using sentence transformer(see law_search repo) Save embedding into vector DB(see law_search repo)

Step #3 Use webserver to query from vecterDb(see web_server)

  • Webserver is rocket

We use Qdrant as vectorDB Postgres to save structured data Candle-rs + GritLM/GritLM-8x7B (if possible) for sentence transformer Used fastembed + BGELargeENV15

About

To search through some large law document

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published