Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
tinyBenchmarks		tinyBenchmarks
tutorials		tutorials
LICENSE		LICENSE
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py

Repository files navigation

tinyBenchmarks: evaluating LLMs with fewer examples

This package is based on the ideas presented in

reference goes here.

Please cite us in the following way

@article{abcde,
  title={tinyBenchmarks: evaluating LLMs with fewer examples},
  author={our names},
  journal={journal},
  pages={pages},
  year={year},
  publisher={publisher}
}

Datasets

Please check our HuggingFace's collection with tiny datasets, each one containing 100 examples. In that collection, you will find tiny versions of

From the Open LLM Leaderboard: TruthfulQA, GSM8K, Winogrande, ARC, HellaSwag, and MMLU;
From AlpacaEval: AlpacaEval 2.0;
From [HELM Lite](https://github.com/tatsu-lab/alpaca_eval](https://crfm.stanford.edu/helm/lite): to be added.

Installing package

You can install our package by running the following commands on the terminal

$ pip install git+https://github.com/felipemaiapolo/tinyBenchmarks

Estimating the performance of a new LLM

In the code

import numpy as np
import tinyBenchmarks as tb

### Parameters
benchmark = 'lb' # choose from possible benchmarks in
                 # ['lb','mmlu','alpaca','helm_lite','truthfulqa',
                 #  'gsm8k', 'winogrande', 'arc', 'hellaswag']

y = np.random.binomial(1,.5, 600) # dummy data (unidimensional numpy array)
                                  # In this example, y has dimension 600 because we
                                  # observe 100 examples from each Open LLM Leaderboard scenario)

### Evaluation
tb.evaluate(y, benchmark)

{'harness_truthfulqa_mc_0': {'irt': 0.5483476132190942,
  'pirt': 0.5216756041366227,
  'gpirt': 0.5350116086778585},
 'gsm8k': {'irt': 0.5132676269901439,
  'pirt': 0.5328183759663551,
  'gpirt': 0.5230430014782494},
 'winogrande': {'irt': 0.4301499605367009,
  'pirt': 0.4792754277690377,
  'gpirt': 0.4547126941528693},
 'arc': {'irt': 0.5520477815699659,
  'pirt': 0.5066457168990404,
  'gpirt': 0.5293467492345032},
 'hellaswag': {'irt': 0.5338577972515436,
  'pirt': 0.5108037778592825,
  'gpirt': 0.5223307875554131},
 'mmlu': {'irt': 0.5377958382081949,
  'pirt': 0.5393624918280722,
  'gpirt': 0.5385791650181335}}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tinyBenchmarks: evaluating LLMs with fewer examples

Datasets

Installing package

Estimating the performance of a new LLM

About

Releases

Packages

Languages

License

codeaudit/tinyBenchmarks

Folders and files

Latest commit

History

Repository files navigation

tinyBenchmarks: evaluating LLMs with fewer examples

Datasets

Installing package

Estimating the performance of a new LLM

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages