you can download our paper in https://www.techrxiv.org/articles/preprint/FreeV_Free_Lunch_in_MultiModal_Diffusion_U-ViT/24633840
You can apply the FreeV method by replacing the "libs/uvit_multi_post_ln.py" file in UniDiffuser. You can also read the uvit_multi_post_ln.py code directly to easily understand our improved method (our method is as simple as FreeU and involves only a few lines of code to implement).
Load the image to create an instance:https://www.codewithgpu.com/i/GoldenFishes/FreeV/FreeV_In_UniDiffuser
@torch.no_grad()
def Fourier_filter(x, threshold, scale):
# FFT
x_freq = torch.fft.fftn(x, dim=-2)
x_freq = torch.fft.fftshift(x_freq, dim=-2)
B, L, C = x_freq.shape
mask = torch.ones((B, L, C)).cuda()
center = L // 2
mask[..., center - threshold:center + threshold, ...] = scale
x_freq = x_freq * mask
# IFFT
x_freq = torch.fft.ifftshift(x_freq, dim=-2)
x_filtered = torch.fft.ifftn(x_freq, dim=-2).real
return x_filtered
class Block_in_UViT(nn.Module):
def __init__(
self,
Free_b=1,
Free_s=1,
Free_f=1,
*args,
**kwargs
):
super().__init__(*args, **kwargs)
self.free_b = Free_b
self.free_s = Free_s
self.free_f = Free_f
def forward(self, x, skip=None):
if self.skip_linear is not None:
# ----------------- FreeV code ---------------------
x[:, :, 768:] = x[:, :, 768:] * self.free_b
skip = Fourier_filter(skip, threshold=1, scale=self.free_s)
# --------------------------------------------------
x = self.skip_linear(torch.cat([x, skip], dim=-1))
x = self.norm1(x)
# ----------------- FreeV code ---------------------
x[:, :, :768] = x[:, :, :768] * self.free_f
# --------------------------------------------------
x = x + self.drop_path(self.attn(x))
x = self.norm2(x)
x = x + self.drop_path(self.mlp(x))
x = self.norm3(x)
return x
It's evident that the model influenced by the FreeV improvement method exhibits a smoother process and more consistent information changes between adjacent denoising steps during the denoising process. Unlike the conventional models for non-generative tasks, where cosine similarity scores between different layers emphasize better feature extraction, and lower scores between consecutive layers indicate more information gain, in generative diffusion models, a smoother generation process signifies a more unified trend and better performance, especially when the final generated image quality remains consistent. Therefore, lower cosine similarity scores between adjacent denoising steps are considered better in generative diffusion models.
The cosine similarity scores in the later stages of denoising with FreeV are consistently higher compared to the model before improvement. This observation explains why FreeV performs remarkably well in denoising tasks. By appropriately modulating features from skip connections and backbone networks, FreeV leads to a more unified generation trend, at least in the context of image generation tasks.
We introduces a simple yet effective approach tailored for the U-ViT architecture, known as 'FreeV.' Leveraging the advantage of the unique feature fusion method within the visual Transformer architecture equipped with skip connections, FreeV significantly enhances the denoising and generation capabilities of the U-ViT architecture, substantially improving the quality of generated outputs without the need for additional training or fine-tuning.
Initially, for the
Where
Where
The direct incorporation of features from shallow Transformer Blocks into deep-level skip connections introduces a significant amount of high-frequency information, leading us to make a choice similar to the approach in FreeU. The mathematical operation for applying the modulation factor (s) to skip features is executed as follows:
Where
Where
The actual effect, validates the effectiveness of the FreeV approach proposed in this paper within the U-ViT architecture. The augmentation by modulation factors
Our paper has reject, you can only download it at TechRxiv. Reviews in this repo: ./FreeV_Reviews.md