Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MNLI task giving (very) different results than the HuggingFace task accuracy metric #320

Open
JunShern opened this issue May 8, 2022 · 0 comments
Labels
bug Something isn't working. good first issue Good for newcomers help wanted Contributors and extra help welcome.

Comments

@JunShern
Copy link

JunShern commented May 8, 2022

Hello, thank you for working on the eval-harness!

I am trying out the harness in the simplest possible setting: Training a GPT2 model on the MNLI dataset using the HuggingFace trainer, and evaluate the task accuracy for MNLI.

I have subclassed HF's Trainer to include eval-harness evaluation in its evaluate method, so that I can get regular metrics during training.

class CustomEvalTrainer(Trainer):
    ...

    def evaluate(
        self,
        eval_dataset = None,
        ignore_keys = None,
        metric_key_prefix = "eval",
    ):
        # Run the default LM loss evaluations, this includes HF's own calculation of the MNLI accuracy metric
        results_dict = super().evaluate(eval_dataset, ignore_keys, metric_key_prefix)

        # Additionally perform lm-evaluation-harness evaluations

        # Eval harness requires a saved HF transformers model as input, so we save to file and pass that as input
        eval_model_path = Path(self.args.output_dir) / f"eval_harness_model-{self.state.global_step}"
        self.save_model(eval_model_path)

        eval_harness_args = {
            "model": "gpt2",
            "model_args": f"pretrained={str(eval_model_path)}", # This is the latest trained
            "tasks": ["mnli"],
            "num_fewshot": 0,
            "batch_size": 4,
            "device": "cuda:0",
            "no_cache": False,
            "limit": 100, # Currently set to 100 for quicker evaluation, but the problem persists even when limit=None
        }

        eval_output = evaluator.simple_evaluate(**eval_harness_args)

        # ... log the MNLI results from `eval_output`

And I use this CustomEvalTrainer inside this HuggingFace example script to run my training.

The default HF evaluate method also does its own computation of the MNLI accuracy metric, so I have two MNLI accuracy metrics.

The results I obtained are below:

image

  • eval_accuracy is the HF MNLI metric which improves nicely by >4%, dev.mnli.acc is the eval-harness MNLI metric which takes a single jump of +2%, then stays still.

Ideally, both of these metrics should be (almost) the same, so I am wondering why my eval-harness metric behaves so strangely.

So my questions are:

  1. Can anyone spot any weird issues with the way I'm using the eval-harness? I have a fully-reproducible version of this problem in this Colab.
  2. If not, is it possible that there is a bug in the MNLI task implementation? In particular, this issue Review all GLUE + SuperGLUE tasks. #248 caught my eye.

Versions:

lm-eval==0.2.0
transformers==4.18.0

Thank you!

@StellaAthena StellaAthena added bug Something isn't working. help wanted Contributors and extra help welcome. labels Nov 21, 2022
@StellaAthena StellaAthena added the good first issue Good for newcomers label Apr 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working. good first issue Good for newcomers help wanted Contributors and extra help welcome.
Projects
None yet
Development

No branches or pull requests

2 participants