Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OLMES Standard Compliance #948

Open
6 of 18 tasks
elronbandel opened this issue Jun 25, 2024 · 0 comments
Open
6 of 18 tasks

OLMES Standard Compliance #948

elronbandel opened this issue Jun 25, 2024 · 0 comments
Assignees

Comments

@elronbandel
Copy link
Member

OLMES includes the following elements, justified in detail above:

  • Use test set when available, otherwise validation. Sample 1000 instances if more than 1500

  • Use specified, exact prompt format (Section 3.1)

  • Use fixed, curated 5-shot examples (Section 3.2)

    • balanced
    • verified
  • Evaluate with both MCF and CF, use the best result (Section 3.4)

    • add CF template (choices specified)
    • add MCF template (choices and numerals are not specified)
    • choose the best between CF and MCF at system level
  • Follow recommendations for all other evaluation details:

    • For MMLU: use macro average (over 57 tasks) rather than micro average (over 14042 instances), following AI@Meta (2024). This better represents the diversity of fields in the dataset, although in practice it does not generally make a big difference (see Figure 7).
    • When a model requires it, make sure to add the appropriate token at start of prompt (e.g., Gemma (Gemma Team et al., 2024)).
    • When using the “character” normalization forCF, include the leading space in the calculation of answer length.
    • Restrict all inputs (with completions) to 2048 tokens for consistency across models
    • Use the default model precision when evaluating (i.e., avoid options like load_in_8bit
      unless it produces identical results)
    • OLMES uses the standard approach of two newlines to separate each in-context example.
      • (set unitxt format for olmes?)
    • Other than the original instruction line for MMLU Hendrycks et al. (2021), we do not add
      any extra instructions. This is in view of previous work finding the subject information from instructions makes little changes to model ranking (Alzahrani et al., 2024), and to reduce additional sources of variation in the prompt.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants