Conflict With "On the Efficacy of Knowledge Distillation" Results #150

AhmedHussKhalifa · 2021-08-01T05:07:48Z

AhmedHussKhalifa
Aug 1, 2021

Hey,

I want to thank you for this great work.

I went through your trained model of ImageNet by KD. The resent-18 trained by resnet-34 has a performance of 71.34%, which is really amazing. I found in "On the Efficacy of Knowledge Distillation" paper the same experiment with accuracy of 69.21% but with different hyperparameters as down mentioned. To the best of my knowledge, you have used different alfa (0.5) and temperature (1).

Do you think, is this the only reason for this huge difference?

yoshitomo-matsubara · 2021-08-01T06:01:54Z

yoshitomo-matsubara
Aug 1, 2021
Maintainer

Hi @AhmedHussKhalifa ,

Thank you for your interest in and question about torchdistill work!

From the description, I think the choice of temperature and alpha matters in their settings and produced the different trend. Another possible factor is the number of GPUs (i.e., effective batch size and linear scaling rule of learning rate if distributed training).
In my paper, however, the KD with even 1 GPU and the same temperature (1.0) and alpha (0.5) achieved 71.23% whereas 71.34% with 3 GPUs. So, I feel that the KD's hyperparameters were not well tuned in their paper.

6 replies

yoshitomo-matsubara Aug 3, 2021
Maintainer

Hi @AhmedHussKhalifa ,

Sure, my answers are as follows:
A1. Basically, the choice of alpha (called "a relative weight" below) and temperature was based on the original KD paper.

For the distillation we tried temperatures of [1, 2, 5, 10] and used a relative weight of 0.5 on the cross-entropy for the hard targets

In preliminary experiments, I also tried alpha of {0.1, 0.5, 0.9} and temperature of {1.0, 3.0, 5.0}.

A2. I might not fully understand the point. As long as the baseline methods are well tuned or follow configs (e.g., hyperparameters) in the original papers, I think it would be fair. This is also what I meant by

the KD's hyperparameters were not well tuned in their paper.

A3. I used Ubuntu 18.04 LTS and Python 3.6

Hope this helps!

AhmedHussKhalifa Aug 3, 2021
Author

Hey,
I really appreciate your replies.

I am interested to know your preliminary results for the hyperparameters (alpha of {0.1, 0.5, 0.9} and temperature of {1.0, 3.0, 5.0}). Could you please share it? It will help me a lot with my experiments.

Sorry if I was not clear about the platform question. I meant if you are using AWS or google cloud in your experiments.

Much appericated for your responses and hard work.

yoshitomo-matsubara Aug 3, 2021
Maintainer

Hi @AhmedHussKhalifa

I am interested to know your preliminary results for the hyperparameters (alpha of {0.1, 0.5, 0.9} and temperature of {1.0, 3.0, 5.0}). Could you please share it? It will help me a lot with my experiments.

Sorry, it was done more than one year ago, and I couldn't find the preliminary experimental results.
You can run the experiments by updating alpha and temperature here and run the script by this command.
Feel free to ask me if you need some help to perform the experiment. It should be pretty straightforward.

Sorry if I was not clear about the platform question. I meant if you are using AWS or google cloud in your experiments.

It was a local server rather than a cloud instance. Of course, it should work on such cloud instances as well.

AhmedHussKhalifa Aug 4, 2021
Author

Hi @AhmedHussKhalifa ,

Thank you for your interest in and question about torchdistill work!

From the description, I think the choice of temperature and alpha matters in their settings and produced a different trend. Another possible factor is the number of GPUs (i.e., effective batch size and linear scaling rule of learning rate if distributed training).
In my paper, however, the KD with even 1 GPU and the same temperature (1.0) and alpha (0.5) achieved 71.23% whereas 71.34% with 3 GPUs. So, I feel that the KD's hyperparameters were not well tuned in their paper.

Hey,

Do you think the outcomes of this paper "On the Efficacy of Knowledge Distillation" would not be valid after you have proofed that it is still possible to distil more knowledge by selecting the new hyperparameters?

I think my only observation is that changing the hyperparameters can change many observations. Right?

yoshitomo-matsubara Aug 5, 2021
Maintainer

Hi @AhmedHussKhalifa

I don't know what you mean by "not be valid". It looks like your suggested paper attempted to empirically show the improvement by their proposed method. We can see their improvement when comparing to student model trained without teacher (only cross entropy loss with human annotations), so it may be better than such a vanilla training method (maybe distill and leverage some "knowledge" in teacher).
But, their improvement over Hinton et al.'s KD looks questionable to me for the reasons stated in the above comments. So, I cannot see the actual advantage of their proposed method compared to the original KD at least for ImageNet dataset. (This is not limited to the specific paper above)

I think my only observation is that changing the hyperparameters can change many observations. Right?

That's true, thus the well-tuned baselines are important when showing that the proposed method outperforms the existing ones.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conflict With "On the Efficacy of Knowledge Distillation" Results #150

{{title}}

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Conflict With "On the Efficacy of Knowledge Distillation" Results #150

AhmedHussKhalifa Aug 1, 2021

Replies: 1 comment · 6 replies

yoshitomo-matsubara Aug 1, 2021 Maintainer

yoshitomo-matsubara Aug 3, 2021 Maintainer

AhmedHussKhalifa Aug 3, 2021 Author

yoshitomo-matsubara Aug 3, 2021 Maintainer

AhmedHussKhalifa Aug 4, 2021 Author

yoshitomo-matsubara Aug 5, 2021 Maintainer

AhmedHussKhalifa
Aug 1, 2021

Replies: 1 comment 6 replies

yoshitomo-matsubara
Aug 1, 2021
Maintainer

yoshitomo-matsubara Aug 3, 2021
Maintainer

AhmedHussKhalifa Aug 3, 2021
Author

yoshitomo-matsubara Aug 3, 2021
Maintainer

AhmedHussKhalifa Aug 4, 2021
Author

yoshitomo-matsubara Aug 5, 2021
Maintainer