MNLI task giving (very) different results than the HuggingFace task accuracy metric #320

JunShern · 2022-05-08T18:48:13Z

Hello, thank you for working on the eval-harness!

I am trying out the harness in the simplest possible setting: Training a GPT2 model on the MNLI dataset using the HuggingFace trainer, and evaluate the task accuracy for MNLI.

I have subclassed HF's Trainer to include eval-harness evaluation in its evaluate method, so that I can get regular metrics during training.

class CustomEvalTrainer(Trainer):
    ...

    def evaluate(
        self,
        eval_dataset = None,
        ignore_keys = None,
        metric_key_prefix = "eval",
    ):
        # Run the default LM loss evaluations, this includes HF's own calculation of the MNLI accuracy metric
        results_dict = super().evaluate(eval_dataset, ignore_keys, metric_key_prefix)

        # Additionally perform lm-evaluation-harness evaluations

        # Eval harness requires a saved HF transformers model as input, so we save to file and pass that as input
        eval_model_path = Path(self.args.output_dir) / f"eval_harness_model-{self.state.global_step}"
        self.save_model(eval_model_path)

        eval_harness_args = {
            "model": "gpt2",
            "model_args": f"pretrained={str(eval_model_path)}", # This is the latest trained
            "tasks": ["mnli"],
            "num_fewshot": 0,
            "batch_size": 4,
            "device": "cuda:0",
            "no_cache": False,
            "limit": 100, # Currently set to 100 for quicker evaluation, but the problem persists even when limit=None
        }

        eval_output = evaluator.simple_evaluate(**eval_harness_args)

        # ... log the MNLI results from `eval_output`

And I use this CustomEvalTrainer inside this HuggingFace example script to run my training.

The default HF evaluate method also does its own computation of the MNLI accuracy metric, so I have two MNLI accuracy metrics.

The results I obtained are below:

eval_accuracy is the HF MNLI metric which improves nicely by >4%, dev.mnli.acc is the eval-harness MNLI metric which takes a single jump of +2%, then stays still.

Ideally, both of these metrics should be (almost) the same, so I am wondering why my eval-harness metric behaves so strangely.

So my questions are:

Can anyone spot any weird issues with the way I'm using the eval-harness? I have a fully-reproducible version of this problem in this Colab.
If not, is it possible that there is a bug in the MNLI task implementation? In particular, this issue Review all GLUE + SuperGLUE tasks. #248 caught my eye.

Versions:

lm-eval==0.2.0
transformers==4.18.0

Thank you!

The text was updated successfully, but these errors were encountered:

JunShern mentioned this issue May 8, 2022

Review all GLUE + SuperGLUE tasks. #248

Closed

StellaAthena added bug Something isn't working. help wanted Contributors and extra help welcome. labels Nov 21, 2022

StellaAthena added the good first issue Good for newcomers label Apr 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MNLI task giving (very) different results than the HuggingFace task accuracy metric #320

MNLI task giving (very) different results than the HuggingFace task accuracy metric #320

JunShern commented May 8, 2022

MNLI task giving (very) different results than the HuggingFace task accuracy metric #320

MNLI task giving (very) different results than the HuggingFace task accuracy metric #320

Comments

JunShern commented May 8, 2022