You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I'm the author of the paper Muse Attention. Thank you for your reproduction. Since Muse has several modules including FFN CNN Self-attention and Cross Attention in one residual block, the network becomes more shallow (regarding the residual blocks) and wide. Reducing the dimension of each unit by 1/3, using a smaller initialization (e.g., from Xavier init to torch init), and increasing the number of residual blocks (while keeping the number of parameters) will help improve performance.
The text was updated successfully, but these errors were encountered:
Hi, I'm the author of the paper Muse Attention. Thank you for your reproduction. Since Muse has several modules including FFN CNN Self-attention and Cross Attention in one residual block, the network becomes more shallow (regarding the residual blocks) and wide. Reducing the dimension of each unit by 1/3, using a smaller initialization (e.g., from Xavier init to torch init), and increasing the number of residual blocks (while keeping the number of parameters) will help improve performance.
The text was updated successfully, but these errors were encountered: