Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about kNN query and query generator #48

Closed
joris-de opened this issue Apr 23, 2022 · 4 comments
Closed

Questions about kNN query and query generator #48

joris-de opened this issue Apr 23, 2022 · 4 comments

Comments

@joris-de
Copy link

Hi,
thank you for the great paper. I have some questions about the kNN query and the query generator:

  1. What exactly is the operation in the kNN query layer? In the paper you write "Given the query coordinates p_Q, we query the features of the nearest keys according to the key coordinates p_k." Could you describe this in more detail? Are you still computing attention weights on the nearest neighbors?
  2. In my understanding, the job of the query generator is to make sure that M missing point proxies are produced by the decoder. They are generated from the output of the encoder using a linear projection and the max-pooling operation. But then these coordinates are concatenated with the global features of the encoder. Why do this step? The query of the previous decoder self-attention is not used at all? Or are the coordinates generated from the previous decoder queries?

Thank you for your help.

@yuxumin
Copy link
Owner

yuxumin commented Apr 23, 2022

Hi, thanks for your interest in our paper.

  1. For the tokens in the Transformer encoder and decoder, they correspond to a specific coordinate. (The coordinates for tokens in the encoder are the center points of point proxies. The coordinates for tokens in the decoder are the predicted missing center points). Given the coordinates of these tokens, we can perform kNN-based feature fusion together with attention-based feature fusion. In details, We perform a max pooling operation on each neighborhood. (https://github.com/yuxumin/PoinTr/blob/master/models/Transformer.py#L162-L165)
  2. The queries for the decoder are the initialization of missing point proxies, which are expected to contain geometric information about local structures at given coordinates. We first use the output features of the encoder to predict the coarse coordinates of the missing parts (as in https://github.com/yuxumin/PoinTr/blob/master/models/Transformer.py#L375). Then we initialize the missing proxies based on the coarse coordinates with an MLP. To maintain semantic information, we concatenate the coordinates with global features before send them into MLP.

The previous query of decoder self-attention was useless at all? Or are the coordinates generated from a previous decoder query?

Sorry that i can not understand the point you are confused. We only initialize the query for the first layer decoder, and the attention results of SA layer with be fused with the results from kNN-based module (https://github.com/yuxumin/PoinTr/blob/master/models/Transformer.py#L166-L167). For details, please refer to the source code (https://github.com/yuxumin/PoinTr/blob/master/models/Transformer.py#L229-L391)

@joris-de
Copy link
Author

Thank you for the quick answer.

  1. So The kNN model finds the nearest keys for every query using their coordinates p_k and p_Q? What is done with the values V in the kNN query (Figure 3 in the Paper shows that V is also an input to the kNN query)?
  2. I was confused because in the original transformer the cross attention mechanism feeds into the second attention layer of the decoder, but you do not use the query generator for the cross attention but for the inputs of the decoder. Thank you for clarifying! By global feature of the encoder do you mean key or value?

@yuxumin
Copy link
Owner

yuxumin commented Apr 24, 2022

  1. Yes, we use p_k and p_Q to determine the neighborhood for a certain token and use the max-pooling of Vs in this neighborhood to update the feature of this token.
  2. global feature of encoder is a maxpooling result of all the output feature of encoder.

@yuxumin
Copy link
Owner

yuxumin commented May 5, 2022

Close it since no response. Feel free to re-open it if problems still exist

@yuxumin yuxumin closed this as completed May 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants