We have LLMs with hundreds of thousands of tokens context windows and prompt caching that makes using them affordable. Why don’t we just stuff the whole code base in the context window?
This paper shows that 200-800 is the ideal chunk size; if you go above, the model starts getting confused / distracted. https://arxiv.org/pdf/2406.14497
The truth is we started there. But for any reasonably-sized, complex codebase this just isn't going to work as the context window isn't sufficient and moreover it becomes harder for the LLM to reason over arbitrary parts of the context.
For the time being, indexing and retrieving a good collection of 10-20 code chunks is more effective/performant in practice.
Not an expert, but OP is right and this is generally a known issue with large windows and RAG. Small chunks are usually best. Also how you chunk is important. OP - what’s the most optimal way to parse/chunk code snippets?
We're using an improvement over this exact blogpost actually. We started from there, but weren't happy that some of the chunks were really small (and they would undeservedly get surfaced to the top). So we added some extra logic to merge the siblings if they're small.