Some questions regarding evaluations on next keyword prediction #4

zhongpeixiang · 2020-05-18T13:35:59Z

Hi,

Thank you very much for sharing your work!

I have a few questions regarding evaluations for keyword predictions. I'm sorry that I may miss or misunderstand your code since I'm not familiar with Tensorflow.

For a given history of keywords, there can be multiple target keywords for the next turn. Do you minimize the negative log-likelihood losses for every target keyword? Is the batch loss averaged over batch size or the number of target keywords in the batch?
How did you compute the correlation metric? Greedy, average or max embedding? Do you just compute the correlation between the top-1 keyword with target keywords or top-k keywords? Do you average across target keywords before or after computing correlations?

Any response will be appreciated.

Thanks,
Peixiang

The text was updated successfully, but these errors were encountered:

squareRoot3 · 2020-05-19T07:47:53Z

Thanks for your attention,

In this reposotiry, we consider next keyword prediction as a binary classification of each candidate keyword and minimize the cross entropy loss of both positive and nagetive labels. The loss is averaged over each keyword.

        kw_labels = tf.map_fn(lambda x: tf.sparse_to_dense(x, [self.kw_vocab.size], 1., 0., False),
                              keywords_ids, dtype=tf.float32, parallel_iterations=True)[:, 4:]
        loss = tf.nn.sigmoid_cross_entropy_with_logits(labels=kw_labels, logits=matching_score)
        loss = tf.reduce_mean(loss)

You can also minimize the negative log-likelihood loss of every target keyword after a softmax layer. In my impression the training result is similar.

The correlation metric is computed by the max cosine similarity of word embedding pair between the top-k predicted keywords and all words in the target response.

zhongpeixiang · 2020-05-20T04:20:51Z

@squareRoot3 Thank you very much for the quick reply. I have two more questions regarding keyword prediction.

Q1

It seems that the test keywords are used as the vocab during training? Any reasons for this?

./config/data_config.py:

_keywords_path = 'tx_data/test/keywords_vocab.txt'

./model/neural.py:

self.kw_vocab = tx.data.Vocab(self.data_config._keywords_path)

Q2

I experimented both binary CE loss for every candidate keyword and negative log-likelihood loss for every target keyword, I found that the former has a R@1 of 0.015 but the latter has a R@1 of 0.065. Why the former loss is not comparable with your results?

Here is the PyTorch code to compute the two losses:

def compute_BCE(logits, target):
    """
        logits: (batch, vocab_size)
        target: (batch, seq_len), we set seq_len=10 such that each utterance has a max of 10 target keywords, the rest are padded with 0
    """
    target_new = torch.zeros_like(logits) # (batch, vocab_size)
    target_new = target_new.scatter(1, target, 1.0)
    target_new[:,0] = 0 # assign pad token to 0
    loss = F.binary_cross_entropy_with_logits(logits, target_new)
    return loss

def compute_NLLLoss(logits, target):
    """
        logits: (batch, vocab_size)
        target: (batch, seq_len)
    """
    target_mask = target.ne(0).float() # (batch, seq_len), mask out paddings
    logits = F.log_softmax(logits, dim=-1)
    loss = -1 * (torch.gather(logits, dim=1, index=target) * target_mask).sum() # negative log-likelihood loss
    loss = loss/target_mask.sum()
    return loss

squareRoot3 · 2020-05-20T07:50:25Z

Q1: I had thought that the test keyword vocab contains more frequent keywords and the size is relatively smaller, which can facilicate training. But using the train keyword vocab seems more reasonable. We have fixed this in the new repository: https://github.com/James-Yip/TGODC-DKRN.

Q2: It looks like that the implement of two losses are correct, so I am sorry that I have no ideas about it. The BCE loss in our repository works normally.

zhongpeixiang · 2020-05-20T09:23:23Z

Sorry to bother you again. Another strange thing happed to the retrieval-neural model.

I trained a keyword prediction model and obtained around 0.08 test R@1.

I also trained a retrieval baseline (without ketword conditioning) and obtained around 0.51 test R@1.

However, when I train the retrieval-neural model to use predicted keywords to retrieve the next turn, the result is still around 0.51. It seems that using keywords do not improve model performance.

My implementation of conditioning on keyword follows your code:

Predict top 3 keywords for next turn based on keywords history and pretrained keyword predictor.
Average the 3 keyword embeddings.
Apply a linear transformation and get K.
Encode contextual utterances and get C.
Concatenate with contextual utterance representation and get [C;K].
Encode candidate responses and get R.
Use a separate GRU encoder to encode candidates for comparision with keywords, get R_kw
Concatenate two candidate representations and get [R;R_kw]
Apply elementwise multiplication between [C;K] and [R;R_kw], followed by a linear transformation.

class SMN(nn.Module):
    def __init__(self, embed_size, vocab_size, hidden_size, n_layers, bidirectional, dropout=0):
        super(SMN, self).__init__()
        self.embed_size = embed_size
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.n_layers = n_layers
        self.bidirectional = bidirectional
        self.dropout = dropout
        self.embedding = nn.Embedding(vocab_size, embed_size)

        self.utterance_encoder = nn.GRU(embed_size, hidden_size, n_layers, batch_first=True, dropout=dropout, bidirectional=True)
        self.context_encoder = nn.GRU(2*hidden_size, 2*hidden_size, n_layers, batch_first=True, dropout=dropout, bidirectional=False)
        self.candidate_encoder = nn.GRU(embed_size, hidden_size, n_layers, batch_first=True, dropout=dropout, bidirectional=True)
        self.candidate_kw_encoder = nn.GRU(embed_size, hidden_size, n_layers, batch_first=True, dropout=dropout, bidirectional=True)
        self.kw_mlp = nn.Linear(embed_size, 2*hidden_size)
        self.match_MLP_kw = nn.Linear(4*hidden_size, 1)
        self.match_MLP = nn.Linear(2*hidden_size, 1)
    
    def init_embedding(self, embedding, fix_word_embedding):
        self.embedding.weight.data.copy_(embedding)
        if fix_word_embedding:
            self.embedding.weight.requires_grad = False
    
    def forward(self, context, candidate, keywords=None):
        """
            context: (batch_size, context_len, seq_len)
            candidate: (batch_size, num_candidates, seq_len)
            keywords: (batch_size, 3)
        """
        # print(context.shape, candidate.shape, keywords.shape)
        batch_size, context_len, seq_len = context.shape
        _, num_candidates, _ = candidate.shape
        context_seq_lengths = context.reshape(batch_size*context_len, -1).ne(0).long().sum(dim=-1) # (batch_size*context_len, )
        context_lengths = context_seq_lengths.reshape(batch_size, context_len).ne(0).long().sum(dim=-1) # (batch_size, )
        candidate_seq_lengths = candidate.reshape(batch_size*num_candidates, -1).ne(0).long().sum(dim=-1) # (batch_size*num_candidates, )
        
        # context encoding
        context_out = self.embedding(context) # (batch, context_len, seq_len, embed_size)
        context_out, _ = self.utterance_encoder(context_out.reshape(batch_size*context_len, seq_len, -1)) # (batch*context_len, seq_len, 2*hidden_size)
        context_out = context_out[torch.arange(batch_size*context_len), (context_seq_lengths-1).clamp(min=0)] # (batch*context_len, 2*hidden_size)

        context_out, _ = self.context_encoder(context_out.reshape(batch_size, context_len, -1)) # (batch, context_len, 2*hidden_size)
        context_out = context_out[torch.arange(batch_size), (context_lengths-1).clamp(min=0)] # (batch, hidden_size)

        # keyword encoding
        if keywords is not None:
            kw_out = self.embedding(keywords) # (batch, 3, embed_size)
            kw_out = self.kw_mlp(kw_out.sum(dim=1)) # (batch, 2*hidden_size)
            context_out = torch.cat([context_out, kw_out], dim=-1) # (batch, 4*hidden_size)

        # candidate encoding
        candidate_emb = self.embedding(candidate) # (batch, num_candidates, seq_len, embed_size)
        candidate_out, _ = self.candidate_encoder(candidate_emb.reshape(batch_size*num_candidates, seq_len, -1)) # (batch*num_candidates, seq_len, 2*hidden_size)
        candidate_out = candidate_out[torch.arange(batch_size*num_candidates), (candidate_seq_lengths-1).clamp(min=0)] # (batch*num_candidates, 2*hidden_size)

        # candidate encoding to compare with keywords
        if keywords is not None:
            candidate_out_kw, _ = self.candidate_kw_encoder(candidate_emb.reshape(batch_size*num_candidates, seq_len, -1)) # (batch*num_candidates, seq_len, 2*hidden_size)
            candidate_out_kw = candidate_out_kw[torch.arange(batch_size*num_candidates), (candidate_seq_lengths-1).clamp(min=0)] # (batch*num_candidates, 2*hidden_size)
            candidate_out = torch.cat([candidate_out, candidate_out_kw], dim=-1) # (batch*num_candidates, 4*hidden_size)
            out = self.match_MLP_kw((context_out.unsqueeze(1) * candidate_out.reshape(batch_size, num_candidates, -1))).squeeze(-1) # (batch, num_candidates)
        else:
            out = self.match_MLP((context_out.unsqueeze(1) * candidate_out.reshape(batch_size, num_candidates, -1))).squeeze(-1) # (batch, num_candidates)
        return out

zhongpeixiang · 2020-05-20T09:34:39Z

It seems that the word embedding of the trained keyword predictor is used in the retrieval model. I didn't implement that. I will fix this and let you know if it works.

zhongpeixiang · 2020-05-20T10:00:58Z

After reusing the word embedding from the trained keyword predictor, the retrieval-neural model achieves test R1 of 0.5235, which is still a bit below the reported 0.5395. Hmm...

zhongpeixiang · 2020-05-22T11:23:40Z

I suspect that one of the reasons is that we use different pretrained word embeddings. How is your pretrained word embedding obtained? GloVe on PersonaChat or GloVe from one of the files here https://nlp.stanford.edu/projects/glove/ ?

squareRoot3 · 2020-05-23T03:33:15Z

The embedding file is provided in the source data. It is obtained from https://nlp.stanford.edu/projects/glove/ (seems to be glove.twitter.27B.zip)

Sorry that I am busy with some deadlines and have no time to check your codes. If you still have any question about this repository, feel free to ask me.

zhongpeixiang · 2020-06-03T11:55:59Z

Any advice regarding why my model didn’t get improved after incorporating keyword?

zhongpeixiang · 2020-06-10T07:27:43Z

Any updates?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some questions regarding evaluations on next keyword prediction #4

Some questions regarding evaluations on next keyword prediction #4

zhongpeixiang commented May 18, 2020

squareRoot3 commented May 19, 2020

zhongpeixiang commented May 20, 2020 •

edited

Loading

squareRoot3 commented May 20, 2020

zhongpeixiang commented May 20, 2020

zhongpeixiang commented May 20, 2020

zhongpeixiang commented May 20, 2020

zhongpeixiang commented May 22, 2020

squareRoot3 commented May 23, 2020

zhongpeixiang commented Jun 3, 2020

zhongpeixiang commented Jun 10, 2020

Some questions regarding evaluations on next keyword prediction #4

Some questions regarding evaluations on next keyword prediction #4

Comments

zhongpeixiang commented May 18, 2020

squareRoot3 commented May 19, 2020

zhongpeixiang commented May 20, 2020 • edited Loading

Q1

Q2

squareRoot3 commented May 20, 2020

zhongpeixiang commented May 20, 2020

zhongpeixiang commented May 20, 2020

zhongpeixiang commented May 20, 2020

zhongpeixiang commented May 22, 2020

squareRoot3 commented May 23, 2020

zhongpeixiang commented Jun 3, 2020

zhongpeixiang commented Jun 10, 2020

zhongpeixiang commented May 20, 2020 •

edited

Loading