You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying out the harness in the simplest possible setting: Training a GPT2 model on the MNLI dataset using the HuggingFace trainer, and evaluate the task accuracy for MNLI.
I have subclassed HF's Trainer to include eval-harness evaluation in its evaluate method, so that I can get regular metrics during training.
classCustomEvalTrainer(Trainer):
...
defevaluate(
self,
eval_dataset=None,
ignore_keys=None,
metric_key_prefix="eval",
):
# Run the default LM loss evaluations, this includes HF's own calculation of the MNLI accuracy metricresults_dict=super().evaluate(eval_dataset, ignore_keys, metric_key_prefix)
# Additionally perform lm-evaluation-harness evaluations# Eval harness requires a saved HF transformers model as input, so we save to file and pass that as inputeval_model_path=Path(self.args.output_dir) /f"eval_harness_model-{self.state.global_step}"self.save_model(eval_model_path)
eval_harness_args= {
"model": "gpt2",
"model_args": f"pretrained={str(eval_model_path)}", # This is the latest trained"tasks": ["mnli"],
"num_fewshot": 0,
"batch_size": 4,
"device": "cuda:0",
"no_cache": False,
"limit": 100, # Currently set to 100 for quicker evaluation, but the problem persists even when limit=None
}
eval_output=evaluator.simple_evaluate(**eval_harness_args)
# ... log the MNLI results from `eval_output`
The default HF evaluate method also does its own computation of the MNLI accuracy metric, so I have two MNLI accuracy metrics.
The results I obtained are below:
eval_accuracy is the HF MNLI metric which improves nicely by >4%, dev.mnli.acc is the eval-harness MNLI metric which takes a single jump of +2%, then stays still.
Ideally, both of these metrics should be (almost) the same, so I am wondering why my eval-harness metric behaves so strangely.
So my questions are:
Can anyone spot any weird issues with the way I'm using the eval-harness? I have a fully-reproducible version of this problem in this Colab.
If not, is it possible that there is a bug in the MNLI task implementation? In particular, this issue Review all GLUE + SuperGLUE tasks. #248 caught my eye.
Versions:
lm-eval==0.2.0
transformers==4.18.0
Thank you!
The text was updated successfully, but these errors were encountered:
Hello, thank you for working on the eval-harness!
I am trying out the harness in the simplest possible setting: Training a GPT2 model on the MNLI dataset using the HuggingFace trainer, and evaluate the task accuracy for MNLI.
I have subclassed HF's
Trainer
to include eval-harness evaluation in itsevaluate
method, so that I can get regular metrics during training.And I use this CustomEvalTrainer inside this HuggingFace example script to run my training.
The default HF evaluate method also does its own computation of the MNLI accuracy metric, so I have two MNLI accuracy metrics.
The results I obtained are below:
eval_accuracy
is the HF MNLI metric which improves nicely by >4%,dev.mnli.acc
is the eval-harness MNLI metric which takes a single jump of +2%, then stays still.Ideally, both of these metrics should be (almost) the same, so I am wondering why my eval-harness metric behaves so strangely.
So my questions are:
Versions:
Thank you!
The text was updated successfully, but these errors were encountered: