Add tasks for performance on long context lengths #1748

nairbv · 2024-04-25T14:21:59Z

There are a couple of papers I see with benchmarks for really long context lengths that don't seem to be available in lm-evaluation-harness. It would be great to have one of these or something similar for measuring ability to extract information from long context windows, important for RAG.

LongBench: https://arxiv.org/abs/2308.14508
∞ Bench: https://arxiv.org/abs/2402.13718
Are there any others that might be better?

haileyschoelkopf · 2024-04-26T15:02:22Z

Needle-in-a-haystack might also be a nice-to-have though I think more difficult / "natural" long-context evals should be prioritized.

haileyschoelkopf added the feature request A feature that isn't implemented yet. label Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tasks for performance on long context lengths #1748

Add tasks for performance on long context lengths #1748

nairbv commented Apr 25, 2024

haileyschoelkopf commented Apr 26, 2024

Add tasks for performance on long context lengths #1748

Add tasks for performance on long context lengths #1748

Comments

nairbv commented Apr 25, 2024

haileyschoelkopf commented Apr 26, 2024