Questions about kNN query and query generator #48

joris-de · 2022-04-23T12:48:26Z

Hi,
thank you for the great paper. I have some questions about the kNN query and the query generator:

What exactly is the operation in the kNN query layer? In the paper you write "Given the query coordinates p_Q, we query the features of the nearest keys according to the key coordinates p_k." Could you describe this in more detail? Are you still computing attention weights on the nearest neighbors?
In my understanding, the job of the query generator is to make sure that M missing point proxies are produced by the decoder. They are generated from the output of the encoder using a linear projection and the max-pooling operation. But then these coordinates are concatenated with the global features of the encoder. Why do this step? The query of the previous decoder self-attention is not used at all? Or are the coordinates generated from the previous decoder queries?

Thank you for your help.

yuxumin · 2022-04-23T15:08:21Z

Hi, thanks for your interest in our paper.

For the tokens in the Transformer encoder and decoder, they correspond to a specific coordinate. (The coordinates for tokens in the encoder are the center points of point proxies. The coordinates for tokens in the decoder are the predicted missing center points). Given the coordinates of these tokens, we can perform kNN-based feature fusion together with attention-based feature fusion. In details, We perform a max pooling operation on each neighborhood. (https://github.com/yuxumin/PoinTr/blob/master/models/Transformer.py#L162-L165)
The queries for the decoder are the initialization of missing point proxies, which are expected to contain geometric information about local structures at given coordinates. We first use the output features of the encoder to predict the coarse coordinates of the missing parts (as in https://github.com/yuxumin/PoinTr/blob/master/models/Transformer.py#L375). Then we initialize the missing proxies based on the coarse coordinates with an MLP. To maintain semantic information, we concatenate the coordinates with global features before send them into MLP.

The previous query of decoder self-attention was useless at all? Or are the coordinates generated from a previous decoder query?

Sorry that i can not understand the point you are confused. We only initialize the query for the first layer decoder, and the attention results of SA layer with be fused with the results from kNN-based module (https://github.com/yuxumin/PoinTr/blob/master/models/Transformer.py#L166-L167). For details, please refer to the source code (https://github.com/yuxumin/PoinTr/blob/master/models/Transformer.py#L229-L391)

joris-de · 2022-04-24T12:03:56Z

Thank you for the quick answer.

So The kNN model finds the nearest keys for every query using their coordinates p_k and p_Q? What is done with the values V in the kNN query (Figure 3 in the Paper shows that V is also an input to the kNN query)?
I was confused because in the original transformer the cross attention mechanism feeds into the second attention layer of the decoder, but you do not use the query generator for the cross attention but for the inputs of the decoder. Thank you for clarifying! By global feature of the encoder do you mean key or value?

yuxumin · 2022-04-24T12:15:58Z

Yes, we use p_k and p_Q to determine the neighborhood for a certain token and use the max-pooling of Vs in this neighborhood to update the feature of this token.
global feature of encoder is a maxpooling result of all the output feature of encoder.

yuxumin · 2022-05-05T10:03:40Z

Close it since no response. Feel free to re-open it if problems still exist

yuxumin closed this as completed May 5, 2022

Provide feedback