Skip to content

Probabilistic LLM evaluations. [CogSci2023; ACL2023]

Notifications You must be signed in to change notification settings

benlipkin/probsem

Repository files navigation

Tests DOI

ProbSem

Deprecation Notice

⚠️ This project is functional, but is no longer being actively maintained. I recommend using minicons for most LLM scoring needs. If you'd like to replicate any paper results using probsem, the paper branches are still supported.

Summary

This repository provides a framework to leverage large language models (LLMs) to assign context-conditional probability distributions over queried strings, with default support for all OpenAI engines and HuggingFace CausalLM models.

It is intended to be flexible across a wide range of research applications spanning linguistics, cognitive science, program synthesis, and NLP.

Here are a few examples:

  • Cloze Completion Task

    .. prompt, task instructions ..
    context:    The color of the Boston sky during January is
    query1:     blue  # P=0.4
    query2:     gray  # P=0.6
  • Multiple Choice QA

    .. prompt, task instructions ..
    context:    The girl pushed the boy.
    posttext:   Which of the following logically entails?
                A: The girl was pushed by the boy.
                B: The boy was pushed by the boy.
                C: The boy was pushed by the girl.
                D: The girl was pushed by the girl.
                The correct response is:
    query1:     A   # P=0.03
    query2:     B   # P=0.01
    query3:     C   # P=0.95
    query4:     D   # P=0.01
  • Semantic Parsing

    .. prompt, task instructions ..
    pretext:    ;; Player strengths were distributed ~N(50,20)
    context:    ;; X has nearly average strength.
    query1:     (λ (x) (= (abs (- (strength x) 50)) 0))   ;; P=0.1
    query2:     (λ (x) (< (abs (- (strength x) 50)) 10))  ;; P=0.9
  • Code completion

    .. prompt, task instructions ..
    context:    def reverse(lst:list):
    query1:       return lst[::-1]      # P=0.40
    query2:       return reversed(lst)  # P=0.30
    query3:       lst.reverse()         # P=0.20
    query4:       list.reverse(lst)     # P=0.10

In each of these examples, a user may define a flexible frame of reference using the concatenation of a prompt, context, and optional pretext and posttext, which wrap the context, to derive a probability distribution over possible completions defined as queries. The precise formulation of such evaluations can be explored further by viewing the examples in the inputs folder or checking out the BENCHMARKS.md walkthrough.

Version Note

The name of this repository ProbSem is a legacy reference to the original use case for which it was developed: Evaluations of Probabilistic Semantics and Pragmatics. It was generalized into its current form after expressed interest from collaborators and colleagues.

As such the main branch is under development and evolving. To replicate specific papers, git checkout the corresponding paper branch and follow instructions in the associated README.md.

Getting Started

Download the repo:

git clone --branch main --depth 1 [email protected]:benlipkin/probsem.git

Build environment:

Note: Multiple installation strategies are provided.

  • Anaconda, Make: automatically build and populate virtual environment (recommended).

    make env

    Can test installation via:

    make test
  • pip[strict]: install exact dependencies used during development into current environment.

    python -m pip install -r requirements.txt
  • pip[flexible]: install general dependencies with fewer version specifications at discretion of user.

    python -m pip install -e .

Setup API Key:

To use OpenAI models, an API key must be placed at ~/.openai_api_key

Run

The first step is to generate your benchmark. This includes, at minimum, a Prompt file and one TestSuite. See BENCHMARKS.md for more info on the structure of these files.

nano inputs/prompt.txt
nano inputs/prompt_testsuite.json

Once a prompt and test suite are defined, they can be evaluated at the command line. For a given prompt prompt and test suite testsuite, as shown above, the following syntax can be used for evaluation.

CLI

python -m probsem --prompt prompt --test testsuite

The prompt *.txt file and test suite *.json file must share the same prefix (prompt above) to be linked, and are assumed by default to exist in the inputs folder. This default, and others, can be overwritten. See below.

Optional arguments (and other relevant internal details):

  • --input_dir [STR] {default: "inputs"} Update path to directory containing the benchmark files to be read in.
  • --output_dir [STR] {default: "outputs"} Update path to directory where output files should be saved. On each run, a CSV is saved with the resulting scores.
  • --model [STR] {default: "code-davinci-002"} Customize the model used for scoring. All OpenAI API engines and HuggingFace CausalLM models are currently supported. HF models run on GPU by default else CPU if not available.
  • --norm [BOOL True] {default: False} This flag can be used to turn on normalization. By default scores returned reflect the sum of the query token context-conditional log-probabilties. When this flag is passed, these values are normalized for the number of tokens, uniquely for each tokenizer.
  • --temp [FLOAT >0] {default: 1.0} Following the derivation of individual query-level scores, a probability distribution over the batch of queries is calculated by passing the array of logit scores to a softmax function with temperature parameter $\alpha$. Specifying $\alpha&lt;1.0$ decreases the entropy of the returned multinomial distribution and $\alpha&gt;1.0$ increases the entropy. Entropy can be thought of qualitatively as inverse to the peakiness of the distribution, being maximized at the uniform distribution and returning $0$ when all probability mass is on a single value.

API

An API is also supported for integration with existing applications. To run the same default example from above, the following code will suffice. All optional parameters are available as well.

from probsem.probsem import ProbSem

probsem = ProbSem(
    prompt="prompt",
    test="testsuite",
)
results = probsem.run()

Issues/Contributing

If you find any particular aspects of this repository unclear, or if you encounter any errors, please open an issue. Comments on documentation, examples, and clarity are also appreciated. If you find an issue, and have ideas on how to address it, feel free to open a pull request. Community contributions are greatly appreciated.

Citation

@software{LipkinProbSem2023,
  author = {Lipkin, Benjamin},
  title = {ProbSem},
  year = {2023},
  url = {https://github.com/benlipkin/probsem},
  doi = {10.5281/zenodo.7603078}
}

License

License: MIT

About

Probabilistic LLM evaluations. [CogSci2023; ACL2023]

Resources

Stars

Watchers

Forks

Packages

No packages published