Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Open-source real time data framework for LLM applications (getindexify.ai)
92 points by diptanu 3 months ago | hide | past | favorite | 6 comments
Hey HN, I am the founder of Tensorlake. Prototyping LLM applications have become a lot easier, building decision making LLM applications that work on constantly updating data is still very challenging in production settings. The systems engineering problems that we have seen people face are -

1. Reliably process ingested content in real time if the application is sensitive to freshness of information. 2. Being able to bring in any kind of model, and run different parts of the pipeline on GPUs and CPUs. 3. Fault Tolerance to ingestion spike, compute infrastructure failure. 4. Scaling compute, reads and writes as data volume grows.

We are built and open sourced Indexify(https://github.com/tensorlakeai/indexify), to provide a compute engine and data frameworks to LLM applications that work on dynamic environments where data is updated frequently, or new data is constantly created.

Developers describe a declarative extraction graph, with stages that extract or transform unstructured data. Data passes from one stage to another, and end up finally at sinks like Vector Databases, Blob Stores or Structured DataStores like Postgres.

Examples - 1. Graph that does Video Understanding could be: Ingestion -> Audio Extraction -> Transcriptions -> NER and Embedding. And another path, Ingestion -> Key Frame Extraction -> Object and Scene Description (https://github.com/tensorlakeai/indexify/blob/main/docs/docs...) 2. Structured Extraction and Search on PDF: PDF -> Markdown -> Chunking -> Embedding, NER (https://github.com/tensorlakeai/indexify/blob/main/docs/docs...)

Application Layer - Indexify works as a retriever in the LLM application stack, so you can use it pretty easily with your existing applications. Call the retriever API over HTTP to get extracted data from Indexify, and that's pretty much all the integration you need to search or retrieve data.

You could use composable extractors and chain them together to build complex real time data pipelines that work with any unstructured data.

Since this is HN, I have the liberty to talk some technical details :)

How is it Real Time? We built a replicated state machine with Raft to process 10s of 1000s of ingestion events every second. The storage and network layer is optimized for progressing the scheduler to create tasks under 2 milliseconds. The architecture of the scheduler is very similar to that of Google's Borg and Hashicorp's Nomad. The architecture we have can be extended to parallel scheduling on multiple machines and have a centralized sequencer like Nomad.

Storage Systems: Since the focus is unstructured data, we wanted to be able to support storing and extracting from large files and be able to scale horizontally as data volume grows. Indexify uses blob stores under the hood to store unstructured data. If a graph creates embeddings, they are automatically stored in Vector Stores, and structured data is stored in structured stores like Postgres. Under the hood we have Rust traits between the ingestion server and the data stores, so we can easily implement support for other vector stores.

Sync Vector and Structured Store - Indexify also syncs structured data with vector store, if it detects the presence of both in a graph. This allows to use pre-filtering capabilities to narrow down the search space for better results.

APIs - Indexify exposes semantic search APIs over vector store, and a read SQL queries over semi-structured data. We can automatically figure out the schema of the structured data and expose a SQL interface on top. Behind the scenes we parse SQL and have a layer which scans and reads databases to slice and dice the rows. So BI tools should work out of the box on extracted data. We have Python and Typescript libraries to make it easy for people to build new or integrate into existing applications.

Thoughts? Would love to hear if you think this would be useful to what you are building!




I'm curious about how Indexify handles fault tolerance during ingestion spikes and compute infrastructure failures. Could you please provide some examples?


Sorry just seeing this! There are various aspects of how it handles ingestion spikes - 1. The ingestion api writes to blob stores, which are horizontally scalable. Only when an ingestion finishes, we write the metadata to the replicated state machine. 2. The replicated state machine takes 100k IOPs on most commodity machines, and they can vertically scale. 3. The extractors can autoscale based on the number of tasks in the system. 4. The ingestion server cluster can autoscale also based on the amount of IOPs they are doing.

Hope this answers the question :)


Looks good at first sight!

Is this essentially a RAG solution? Or is this focused more on ease of use and being able to quickly use all kinds of different data types?


Thanks! It's pretty general purpose. It has a retrieval API for RAG use cases. It can be used for building agents too, which might only care about data from certain data sources and get invoked on any changes. Some of our users use it for just data extraction from PDFs.

The patterns for embedding, structured extraction for different data types doesn't change much for each use-cases, if the underlying API and storage subsystem is flexible.


Why Rust? What is the main benefit ?


Rust is really good for building solid services. Indexfiy has never crashed for us in unexpected ways. The other option was to build this in C++ but the state of package management in C++ is really bad so I chose to use Rust.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: