Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The projection head order needs to be be relooked #17

Open
Akshay1-6180 opened this issue Jan 30, 2024 · 2 comments
Open

The projection head order needs to be be relooked #17

Akshay1-6180 opened this issue Jan 30, 2024 · 2 comments

Comments

@Akshay1-6180
Copy link

Going through these papers

  1. https://arxiv.org/pdf/1603.05027.pdf
  2. https://arxiv.org/pdf/2302.06112.pdf
class ProjectionHead(nn.Module):
    def __init__(
        self,
        embedding_dim,
        projection_dim=CFG.projection_dim,
        dropout=CFG.dropout
    ):
        super().__init__()
        self.projection = nn.Linear(embedding_dim, projection_dim)
        self.gelu = nn.GELU()
        self.fc = nn.Linear(projection_dim, projection_dim)
        self.dropout = nn.Dropout(dropout)
        self.layer_norm = nn.LayerNorm(projection_dim)
    
    def forward(self, x):
        projected = self.projection(x)
        x = self.gelu(projected)
        x = self.fc(x)
        x = self.dropout(x)
        x = x + projected
        x = self.layer_norm(x)
        return x

I feel the order should be this

class ProjectionHead(nn.Module):
    def __init__(
        self,
        embedding_dim,
        projection_dim=CFG.projection_dim,
        dropout=CFG.dropout
    ):
        super().__init__()
        self.projection = nn.Linear(embedding_dim, projection_dim)
        self.gelu = nn.GELU()
        self.fc = nn.Linear(projection_dim, projection_dim)
        self.dropout = nn.Dropout(dropout)
        self.layer_norm = nn.LayerNorm(projection_dim)
    
    def forward(self, x):
        projected = self.projection(x)
        x = self.layer_norm(projected)
        x = self.gelu(x)
        x = self.dropout(x)
        x = self.fc(x)
        x = x + projected
        
        return x
@GewelsJI
Copy link

Do you know why use GELU here? @Akshay1-6180

@Akshay1-6180
Copy link
Author

Akshay1-6180 commented Feb 20, 2024

so based on experiments it was found that GELU has a significantly smoother gradient transition and its not abrupt or sharp like relu , if u look at both the functions u would understand.
Moreover look at the GPT2 code , they use gelu and many other models i have encountered also use GELU so went with it.

https://github.com/openai/gpt-2/blob/master/src/model.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants