Skip to content

thegregyang/LossUpAccUp

Repository files navigation

When Loss Decreases, Accuracy Increases...Right?

...at least that seems to be the conventional wisdom. They can sometimes go out of (anti-)sync, like the ResNet example below (noted by Guo et al.), but here we also illustrate an extreme case where loss and accuracy become almost perfectly correlated, like the LeNet example below. This in particular means that, if one were to determine "overfitting" based on validation loss, one could be way wrong, if accuracy is the goal.

When I went back to the basics, trained some LeNets, and saw these graphs, I thought there was a bug in my code. But nope, they are real. This repo hopes to serve as lighthouse for others who are also confused by this.

Wait, Hol' Up

Guo et al. framed this as a miscalibration problem where the neural network becomes unduly overconfident on inputs it classifies incorrectly, but I find this phenomenon more strange from a theoretical perspective, where the usual justification for training with cross entropy goes like this:

when we train using cross entropy loss, we should get good population loss by early stopping before the validation loss goes up. Because cross entropy is a good proxy for 0-1 loss, we should also expect good population accuracy from this procedure.

But here training using cross entropy loss actually achieves good validation accuracy in spite of apparently overfitting the validation loss. So the cross entropy as training loss has more going for it than just being a good proxy for the 0-1 loss, possibly some implicit regularization effect?

Weird Phenomenon But OK?

I don't think this observation is useful for settings where we have better validation metrics we can track, such as accuracy in classification tasks or BLEU score in machine translation. But this does provoke some thoughts in domains like language modeling where loss/perplexity is the primary metric --- when we stop training early based on validation perplexity, are we sure we are not stopping too early, if our goal is to learn language?

Twitter Discussion

There was a discussion on twitter regarding this phenomenon. I ran some more experiments in response to some comments there. These results suggest the situation is more nuanced than one might think.

  • Underparametrization. A few comments suggested that underparametrized networks most likely won't see this misleading validation loss overfitting. I trained a smaller LeNet that gets <90% training accuracy but still see the phenomenon. So, in language modeling with large datasets (and relatively underparametrized models), we still have reason to not trust perplexity too much.

    Notebook

    Open In Colab

  • Normalization. Quite a few comments [1, 2, 3, 4] suggested that if the weights or logits are normalized in some way, then we wouldn't see this misleading validation loss overfitting. I put batchnorm on every layer and also a final layer norm (without affine parameters) on the logits, but still see the phenomenon, though here the validation loss doesn't blow up nearly as much as before.

    Notebook

    Open In Colab

Some Related Links

Request for Contribution

File a pull request if

  • you have other clean examples of this (where loss and some quality metric increase simultaneously on the validation set), especially outside of image classification or with losses other than cross entropy
  • you have a nice, simple explanation of this phenomenon, written in markdown or a jupyter notebook
  • you have relevant literature you'd like to link here

About

Loss and accuracy go opposite ways...right?

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published