A package to evaluate factuality of long-form generation. Original implementation of our EMNLP 2023 paper "FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation"

Python 248 32 Updated May 19, 2024

IBM / tempqa-wd

Temporal question answering dataset for Wikidata

12 4 Updated Dec 7, 2023

likenneth / dialogue_action_token

Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner

Python 4 Updated Jun 27, 2024

likenneth / q_probe

Q-Probe: A Lightweight Approach to Reward Maximization for Language Models

Jupyter Notebook 35 1 Updated Jun 10, 2024

nayeon7lee / FactualityPrompt

Python 74 1 Updated Nov 11, 2022

iiis-ai / IterativeQuestionComposing

Official implementation of DPFM @ ICLR 2024 paper "Augmenting Math Word Problems via Iterative Question Composing"(https://arxiv.org/abs/2401.09003)

Python 9 Updated Mar 4, 2024

hendrycks / apps

APPS: Automated Programming Progress Standard (NeurIPS 2021)

Python 377 50 Updated Jun 19, 2024

Khan / khan-exercises

A (deprecated) framework for building exercises to work with Khan Academy.

HTML 1,610 864 Updated Oct 21, 2020

KwanWaiChung / MT-Eval

Code and data for "MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models"

Python 17 Updated Mar 1, 2024

mtbench101 / mt-bench-101

[ACL 2024] MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues

27 Updated Jul 24, 2024

Joshuaclymer / GameBench

Python 8 1 Updated Jun 27, 2024

lm-sys / RouteLLM

A framework for serving and evaluating LLM routers - save LLM costs without compromising quality!

Python 2,313 157 Updated Jul 20, 2024

MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.

Python 5,840 724 Updated Jul 22, 2024

bytarnish / AGILE

Python 10 Updated Jun 4, 2024

RainJamesY / FuzzLLM

The opensoure repository of FuzzLLM

Python 12 2 Updated May 4, 2024

kvcache-ai / Mooncake

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

909 17 Updated Jul 10, 2024

FlagOpen / Infinity-Instruct

16 1 Updated Jun 14, 2024

lm-sys / arena-hard-auto

Arena-Hard-Auto: An automatic LLM benchmark.

Jupyter Notebook 347 37 Updated Jul 23, 2024

princeton-nlp / SWE-bench

[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?

Python 1,523 252 Updated Jul 22, 2024

NVIDIA / NeMo-Aligner

Scalable toolkit for efficient model alignment

Python 452 48 Updated Jul 25, 2024

LiveBench / LiveBench

LiveBench: A Challenging, Contamination-Free LLM Benchmark

Python 148 12 Updated Jul 24, 2024

openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.

Python 11,267 762 Updated Jul 10, 2024

facebookresearch / CommAI-env

A platform for developing AI systems as described in A Roadmap towards Machine Intelligence - http:https://arxiv.org/abs/1511.08130

1,327 210 Updated Sep 16, 2020

brendenlake / SCAN

Simple language-driven navigation tasks for studying compositional learning

177 27 Updated Nov 5, 2020

gpt-engineer-org / gpt-engineer

Specify what you want it to build, the AI asks for clarification, and then builds it.

Python 51,473 6,695 Updated Jul 23, 2024

OpenDevin / OpenDevin

🐚 OpenDevin: Code Less, Make More

Python 28,866 3,340 Updated Jul 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Xiong Jun Wu(熊君武) junwucs

Block or report junwucs

Starred repositories

lmmlzn / Awesome-LLMs-Datasets

hkust-nlp / dart-math

google-gemini / gemma-cookbook

dvlab-research / Step-DPO

shmsw25 / FActScore