Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Local RAG with private knowledge base
51 points by zephodb 13 hours ago | hide | past | favorite | 15 comments
Looking for a free, local, open source RAG solution for running a reference library with 1000s of technical PDFs and word docs. Tried the Ollama + open webui, Ollama+Anything LLM with opensource models such as Llama3.2 etc. As expected the more documents we feed the lower the accuracy. Doing it for a bunch of senior citizens who still love geeking out.





I'm very interested in checking something like this out for getting to grips with a local codebase of unfamiliar SQL and assorted scripts and reports.

I have a few tabs open that I haven't had a chance to try:

https://github.com/Mintplex-Labs/anything-llm

https://github.com/Bin-Huang/chatbox

https://github.com/saeedezzati/superpower-chatgpt


I've made wdoc just for that: https://github.com/thiswillbeyourgithub/WDoc

I am a medical student with thousands of pdfs, various anki databases, video conferences, audio recordings, markdown notes etc. It can query into all of them and return extremely high quality output with sources to each original document.

It's still in alpha though and there's only 0.5 user beside me that I know of so there are bugs that have yet to be found!


You can use BerryDB for doing this use case at scale. BerryDB is a JSON native database that can ingest PDFs, images, etc and it has a built in semantic layer (for labeling) so that way you can build your knowledge database with entities and relationships. This will ground your knowledge with entities and accuracy scales very well with large number of documents

It provides APIs to extract paragraphs or tables from your PDFs in bulk, You can also separately do bulk labeling (say classification, NER and other labeling types). Once you have a knowledge database, it creates 4 indexes on top of your JSON data layer - db index for metadata search, full text search index, annotation index and vector index, so that way you can perform any search operation including hybrid search

The fact that your data layer is in JSON, it gives you infinite flexibility to add new snippets of knowledge or new labels and improve accuracy over time.

https://berrydb.io


I’ve got this RAG repo working entirely locally (Ollama/Postgres) but it doesnt RAG on documents like you want.

https://github.com/Azure-Samples/rag-postgres-openai-python

I’d like to make that version when I have the time, probably just using Llamaindex for the ingestion.

My tips for getting SLMs working well for RAG: http://blog.pamelafox.org/2024/08/making-ollama-compatible-r...


> expected the more documents we feed the lower the accuracy

Not surprising!

The LLM itself is the least important bit as long as it’s serviceable.

Depending on your goal you need to have a specific RAG strategy.

How are you breaking up the documents? Are the documents consistently formatted to make breaking them up uniform? Do you need to do some preprocessing to make them uniform?

When you retrieve documents how many do you stuff into your prompt as context?

Do you stuff the same top N chunks from a single prompt or do you have a tailored prompt chain retrieving different resourced based on the prompt and desired output?


> How are you breaking up the documents? Are the documents consistently formatted to make breaking them up uniform? Do you need to do some preprocessing to make them uniform?

> When you retrieve documents how many do you stuff into your prompt as context?

> Do you stuff the same top N chunks from a single prompt or do you have a tailored prompt chain retrieving different resourced based on the prompt and desired output?

Wouldn't these questions be answered by the RAG solution the OP is asking for?



Yes! We can definitely help with this. Khoj lets you chat with your documents, indexing your private knowledge base for local RAG with any open source (or foundation) model.

You can make it as 'fancy' as you want, and use speech-to-text, image generation, web scraping, custom agents.

Let me know if you run into any issues? I'd love to get this setup for senior citizens! You can reach me at saba at khoj.dev.


I would look at articles on building an open source RAG pipeline. Generation (model) is the last in a series of important steps -- you have options to choose from (retrieval, storage, etc) in each component step. Those decisions will affect the accuracy you mention.

Langchain, llamaindex have good resources on building such a pipeline the last I checked


Fascinating there doesn't seem to be a consensus "just use this" answer here.

My sentiments exactly, and given how widespread a need RAG is, I'm extremely surprised that we don't have something solid and clearly a leader in the space yet. We don't even seem to have two or three! It's "pick one of these million side-projects".

txtai was brought up in a discussion yesterday. I saved it to look at later. But you might find it useful. https://github.com/neuml/txtai

Here is that that thread. https://news.ycombinator.com/item?id=41981907


I am working on a quick hack/prototype. Right now it only returns search results from a vector database.

https://github.com/breakpointninja/semantic_search_cli






Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: