Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FID of Tiny-ImageNet or ImageNet 64x64 #4

Closed
Yeez-lee opened this issue Nov 21, 2023 · 5 comments
Closed

FID of Tiny-ImageNet or ImageNet 64x64 #4

Yeez-lee opened this issue Nov 21, 2023 · 5 comments

Comments

@Yeez-lee
Copy link

Yeez-lee commented Nov 21, 2023

Hi,

Thanks for your codes. But I have questions about FID when dataset is larger. If my dataset is either Tiny-ImageNet or ImageNet 64x64, how many images should I generate to calculate FID? The exact number of Tiny-ImageNet or ImageNet 64x64 (larger than 50k)? And I should change the batch number (125) and 400 (125*400=50k) in sample.py, right?

BTW, I see other codes use total_training_steps instead of epoch. What is the relationship between these?

@FutureXiang
Copy link
Owner

Hi,

Thank you for your interest.

  1. The standard way to report FID is to (1) generate 50k images and (2) compare them to the source dataset. For example, you may want to use 50k generated images and 1.28m ImageNet images to calculate FID on IN64x64. It is NOT necessary to keep the source set size (e.g., 50k Cifar / 100k Tiny-IN /1.28m IN) the same as the target set size (50k), because 50k is large enough.
    • However, for experimental purposes (e.g., checkpoint & hyper-parameter selection), you may want to monitor the FID using only 10k images, which is more efficient.
    • Please note that the FID calculation module used in this repo (i.e., pytorch-fid) may NOT be suitable for very large dataset, because it extracts features and statistics for the source set repeatedly on each calculation. To do this more efficiently and elegantly, please check the EDM repo.
  2. I prefer using num_epochs because the total number of training images can be determined, given a specific dataset. Likewise, the EDM training code uses total_kimg to represent training duration. In contrast, total_training_steps is kind of meaningless in terms of representing the training budget because different implementations may use different batch_size.

@Yeez-lee
Copy link
Author

Hi,
Thank you for your reply. And I want to make the following comments.
(1) Even for the larger dataset (100k Tiny-IN /1.28m IN), only 50k generated images with source images (100k or 1.28m) are enough for FID. Is this correct?
(2) In your codes, does num_epochs mean that in each epoch, every data in the dataset is trained? For CIFAR-10, if I have 1000 epochs, do I have 1000*50k training images totally (like total_kimg in EDM)? If total_kimg is smaller than the size of a dataset, does it mean that not every data is trained?

@FutureXiang
Copy link
Owner

(1) Yes.
(2) Yes. But typically, we have num_epochs >> 1 and total_kimg >> |dataset|. For example, DDPM trains 2048 epochs on CIFAR-10, while EDM trains 4000 epochs on CIFAR-10 and 1950 epochs on ImageNet-64.

@Yeez-lee
Copy link
Author

Yeez-lee commented Dec 4, 2023

Thanks for your help. And I notice that your work uses unconditional models (DDPM or EDM). What if these models are conditional ones (with CFG in your another repository)? DDAE (DiT-XL/2) is evaluated in an unconditional manner, but how about conditional models‘ (DDPM or EDM) results?

@FutureXiang
Copy link
Owner

The CFG models (which are joint 10% uncond + 90% cond models) yield worse representations than pure unconditional models, despite achieving SOTA generative FIDs.

  • If we use y=null to retrieve features (i.e. in an unconditional manner), the features are still somewhat good, but the performance degrades (e.g. unconditional models reach ~90% K-NN acc on CIFAR-10, while CFG ones reach ~84%). I think (1) the insufficient 10% training of the unconditional version and (2) the joint parameterization limit the performance.
  • If we pass non-null y in [1...C] label embeddings to retrieve features, these features are basically meaningless.
  • An alternative way to use conditional models for classification is to probe the correct label in a zero-shot manner arxiv1 arxiv2, which is similar to CLIP. However, this way is somewhat inefficient because it relies on $T\times C$ forward passes to infer an answer.

@FutureXiang FutureXiang pinned this issue Dec 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants