We have LLMs with hundreds of thousands of tokens context windows and prompt cac...

CuriousJ · 2024-08-29T02:07:44 1724897264

This paper shows that 200-800 is the ideal chunk size; if you go above, the model starts getting confused / distracted. https://arxiv.org/pdf/2406.14497

zaptrem · 2024-09-07T07:04:32 1725692672

Makes sense. Thanks!

nutellalover · 2024-08-29T01:38:07 1724895487

The truth is we started there. But for any reasonably-sized, complex codebase this just isn't going to work as the context window isn't sufficient and moreover it becomes harder for the LLM to reason over arbitrary parts of the context.

For the time being, indexing and retrieving a good collection of 10-20 code chunks is more effective/performant in practice.

siamese_puff · 2024-08-29T01:53:44 1724896424

Not an expert, but OP is right and this is generally a known issue with large windows and RAG. Small chunks are usually best. Also how you chunk is important. OP - what’s the most optimal way to parse/chunk code snippets?

jshobrook · 2024-08-29T02:07:16 1724897236

You can use the AST to chunk the code: https://docs.sweep.dev/blogs/chunking-2m-files

CuriousJ · 2024-08-29T02:09:38 1724897378

We're using an improvement over this exact blogpost actually. We started from there, but weren't happy that some of the chunks were really small (and they would undeservedly get surfaced to the top). So we added some extra logic to merge the siblings if they're small.

https://github.com/Storia-AI/repo2vec/blob/1864102949e720320...