Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some questions regarding evaluations on next keyword prediction #4

Open
zhongpeixiang opened this issue May 18, 2020 · 10 comments
Open

Comments

@zhongpeixiang
Copy link

Hi,

Thank you very much for sharing your work!

I have a few questions regarding evaluations for keyword predictions. I'm sorry that I may miss or misunderstand your code since I'm not familiar with Tensorflow.

  1. For a given history of keywords, there can be multiple target keywords for the next turn. Do you minimize the negative log-likelihood losses for every target keyword? Is the batch loss averaged over batch size or the number of target keywords in the batch?

  2. How did you compute the correlation metric? Greedy, average or max embedding? Do you just compute the correlation between the top-1 keyword with target keywords or top-k keywords? Do you average across target keywords before or after computing correlations?

Any response will be appreciated.

Thanks,
Peixiang

@squareRoot3
Copy link
Owner

Thanks for your attention,

  1. In this reposotiry, we consider next keyword prediction as a binary classification of each candidate keyword and minimize the cross entropy loss of both positive and nagetive labels. The loss is averaged over each keyword.
        kw_labels = tf.map_fn(lambda x: tf.sparse_to_dense(x, [self.kw_vocab.size], 1., 0., False),
                              keywords_ids, dtype=tf.float32, parallel_iterations=True)[:, 4:]
        loss = tf.nn.sigmoid_cross_entropy_with_logits(labels=kw_labels, logits=matching_score)
        loss = tf.reduce_mean(loss)

You can also minimize the negative log-likelihood loss of every target keyword after a softmax layer. In my impression the training result is similar.

  1. The correlation metric is computed by the max cosine similarity of word embedding pair between the top-k predicted keywords and all words in the target response.

@zhongpeixiang
Copy link
Author

zhongpeixiang commented May 20, 2020

@squareRoot3 Thank you very much for the quick reply. I have two more questions regarding keyword prediction.

Q1

It seems that the test keywords are used as the vocab during training? Any reasons for this?

./config/data_config.py:

_keywords_path = 'tx_data/test/keywords_vocab.txt'

./model/neural.py:

self.kw_vocab = tx.data.Vocab(self.data_config._keywords_path)

Q2

I experimented both binary CE loss for every candidate keyword and negative log-likelihood loss for every target keyword, I found that the former has a R@1 of 0.015 but the latter has a R@1 of 0.065. Why the former loss is not comparable with your results?

Here is the PyTorch code to compute the two losses:

def compute_BCE(logits, target):
    """
        logits: (batch, vocab_size)
        target: (batch, seq_len), we set seq_len=10 such that each utterance has a max of 10 target keywords, the rest are padded with 0
    """
    target_new = torch.zeros_like(logits) # (batch, vocab_size)
    target_new = target_new.scatter(1, target, 1.0)
    target_new[:,0] = 0 # assign pad token to 0
    loss = F.binary_cross_entropy_with_logits(logits, target_new)
    return loss
def compute_NLLLoss(logits, target):
    """
        logits: (batch, vocab_size)
        target: (batch, seq_len)
    """
    target_mask = target.ne(0).float() # (batch, seq_len), mask out paddings
    logits = F.log_softmax(logits, dim=-1)
    loss = -1 * (torch.gather(logits, dim=1, index=target) * target_mask).sum() # negative log-likelihood loss
    loss = loss/target_mask.sum()
    return loss

@squareRoot3
Copy link
Owner

Q1: I had thought that the test keyword vocab contains more frequent keywords and the size is relatively smaller, which can facilicate training. But using the train keyword vocab seems more reasonable. We have fixed this in the new repository: https://github.com/James-Yip/TGODC-DKRN.

Q2: It looks like that the implement of two losses are correct, so I am sorry that I have no ideas about it. The BCE loss in our repository works normally.

@zhongpeixiang
Copy link
Author

Sorry to bother you again. Another strange thing happed to the retrieval-neural model.

I trained a keyword prediction model and obtained around 0.08 test R@1.

I also trained a retrieval baseline (without ketword conditioning) and obtained around 0.51 test R@1.

However, when I train the retrieval-neural model to use predicted keywords to retrieve the next turn, the result is still around 0.51. It seems that using keywords do not improve model performance.

My implementation of conditioning on keyword follows your code:

  1. Predict top 3 keywords for next turn based on keywords history and pretrained keyword predictor.
  2. Average the 3 keyword embeddings.
  3. Apply a linear transformation and get K.
  4. Encode contextual utterances and get C.
  5. Concatenate with contextual utterance representation and get [C;K].
  6. Encode candidate responses and get R.
  7. Use a separate GRU encoder to encode candidates for comparision with keywords, get R_kw
  8. Concatenate two candidate representations and get [R;R_kw]
  9. Apply elementwise multiplication between [C;K] and [R;R_kw], followed by a linear transformation.
class SMN(nn.Module):
    def __init__(self, embed_size, vocab_size, hidden_size, n_layers, bidirectional, dropout=0):
        super(SMN, self).__init__()
        self.embed_size = embed_size
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.n_layers = n_layers
        self.bidirectional = bidirectional
        self.dropout = dropout
        self.embedding = nn.Embedding(vocab_size, embed_size)

        self.utterance_encoder = nn.GRU(embed_size, hidden_size, n_layers, batch_first=True, dropout=dropout, bidirectional=True)
        self.context_encoder = nn.GRU(2*hidden_size, 2*hidden_size, n_layers, batch_first=True, dropout=dropout, bidirectional=False)
        self.candidate_encoder = nn.GRU(embed_size, hidden_size, n_layers, batch_first=True, dropout=dropout, bidirectional=True)
        self.candidate_kw_encoder = nn.GRU(embed_size, hidden_size, n_layers, batch_first=True, dropout=dropout, bidirectional=True)
        self.kw_mlp = nn.Linear(embed_size, 2*hidden_size)
        self.match_MLP_kw = nn.Linear(4*hidden_size, 1)
        self.match_MLP = nn.Linear(2*hidden_size, 1)
    
    def init_embedding(self, embedding, fix_word_embedding):
        self.embedding.weight.data.copy_(embedding)
        if fix_word_embedding:
            self.embedding.weight.requires_grad = False
    
    def forward(self, context, candidate, keywords=None):
        """
            context: (batch_size, context_len, seq_len)
            candidate: (batch_size, num_candidates, seq_len)
            keywords: (batch_size, 3)
        """
        # print(context.shape, candidate.shape, keywords.shape)
        batch_size, context_len, seq_len = context.shape
        _, num_candidates, _ = candidate.shape
        context_seq_lengths = context.reshape(batch_size*context_len, -1).ne(0).long().sum(dim=-1) # (batch_size*context_len, )
        context_lengths = context_seq_lengths.reshape(batch_size, context_len).ne(0).long().sum(dim=-1) # (batch_size, )
        candidate_seq_lengths = candidate.reshape(batch_size*num_candidates, -1).ne(0).long().sum(dim=-1) # (batch_size*num_candidates, )
        
        # context encoding
        context_out = self.embedding(context) # (batch, context_len, seq_len, embed_size)
        context_out, _ = self.utterance_encoder(context_out.reshape(batch_size*context_len, seq_len, -1)) # (batch*context_len, seq_len, 2*hidden_size)
        context_out = context_out[torch.arange(batch_size*context_len), (context_seq_lengths-1).clamp(min=0)] # (batch*context_len, 2*hidden_size)

        context_out, _ = self.context_encoder(context_out.reshape(batch_size, context_len, -1)) # (batch, context_len, 2*hidden_size)
        context_out = context_out[torch.arange(batch_size), (context_lengths-1).clamp(min=0)] # (batch, hidden_size)

        # keyword encoding
        if keywords is not None:
            kw_out = self.embedding(keywords) # (batch, 3, embed_size)
            kw_out = self.kw_mlp(kw_out.sum(dim=1)) # (batch, 2*hidden_size)
            context_out = torch.cat([context_out, kw_out], dim=-1) # (batch, 4*hidden_size)

        # candidate encoding
        candidate_emb = self.embedding(candidate) # (batch, num_candidates, seq_len, embed_size)
        candidate_out, _ = self.candidate_encoder(candidate_emb.reshape(batch_size*num_candidates, seq_len, -1)) # (batch*num_candidates, seq_len, 2*hidden_size)
        candidate_out = candidate_out[torch.arange(batch_size*num_candidates), (candidate_seq_lengths-1).clamp(min=0)] # (batch*num_candidates, 2*hidden_size)

        # candidate encoding to compare with keywords
        if keywords is not None:
            candidate_out_kw, _ = self.candidate_kw_encoder(candidate_emb.reshape(batch_size*num_candidates, seq_len, -1)) # (batch*num_candidates, seq_len, 2*hidden_size)
            candidate_out_kw = candidate_out_kw[torch.arange(batch_size*num_candidates), (candidate_seq_lengths-1).clamp(min=0)] # (batch*num_candidates, 2*hidden_size)
            candidate_out = torch.cat([candidate_out, candidate_out_kw], dim=-1) # (batch*num_candidates, 4*hidden_size)
            out = self.match_MLP_kw((context_out.unsqueeze(1) * candidate_out.reshape(batch_size, num_candidates, -1))).squeeze(-1) # (batch, num_candidates)
        else:
            out = self.match_MLP((context_out.unsqueeze(1) * candidate_out.reshape(batch_size, num_candidates, -1))).squeeze(-1) # (batch, num_candidates)
        return out

@zhongpeixiang
Copy link
Author

It seems that the word embedding of the trained keyword predictor is used in the retrieval model. I didn't implement that. I will fix this and let you know if it works.

@zhongpeixiang
Copy link
Author

After reusing the word embedding from the trained keyword predictor, the retrieval-neural model achieves test R1 of 0.5235, which is still a bit below the reported 0.5395. Hmm...

@zhongpeixiang
Copy link
Author

I suspect that one of the reasons is that we use different pretrained word embeddings. How is your pretrained word embedding obtained? GloVe on PersonaChat or GloVe from one of the files here https://nlp.stanford.edu/projects/glove/ ?

@squareRoot3
Copy link
Owner

The embedding file is provided in the source data. It is obtained from https://nlp.stanford.edu/projects/glove/ (seems to be glove.twitter.27B.zip)

Sorry that I am busy with some deadlines and have no time to check your codes. If you still have any question about this repository, feel free to ask me.

@zhongpeixiang
Copy link
Author

Any advice regarding why my model didn’t get improved after incorporating keyword?

@zhongpeixiang
Copy link
Author

Any updates?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants