Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some questions regarding the DiscoBox paper #18

Open
tianyufang1958 opened this issue Mar 29, 2023 · 2 comments
Open

Some questions regarding the DiscoBox paper #18

tianyufang1958 opened this issue Mar 29, 2023 · 2 comments

Comments

@tianyufang1958
Copy link

Thanks for the nice work and it runs smoothly with Docker. I have some questions with paper and hope you can give me some help.

  1. You mention both YOLACT++ and SOLO V2, but it is not clear which one you use in Figure 2, does it mean Discobox can use either of them as long as it has mask head?
  2. In section 3.2, fi and fk represent the ROI features of pixel i, could you please clarify what ROI means? Is the mask area within the bounding box? Also what the features are? RGB values and spatial information?
  3. For the Structured teach, Tc is the cross image potentials, does it mean the comparison of one boxing box with all other bboxes in other images with the same label?
  4. For the self-ensembling loss, you mentioned self consistency between our task and teach networks were calculated, but I am not clear how the Lnce? Does it compare the the masks features within bounding boxes across the teacher and task network?
  5. In structure teacher, Gibbs energy was defined with unary potentials, pair potentials and cross-image pair potentials, but in the learning section, the loss function does not have them. So question is how the learning can correlate with it and minimise the energy? Apology I am not very familiar with standard mean field.
@Chrisding
Copy link
Collaborator

Chrisding commented Mar 30, 2023

Hi @tianyufang1958

  1. Figure 2 is meant to be a higher level abstraction for both YOLACT++ and SOLOv2. These two methods are somewhat similar at high-level (that they adopt a two-branch like architecture) despite the detailed differences.
  2. ROI means region of interest, which indicates the bounding box region of an object. The features are from the feature map of FPN used in DiscoBox. The pixel colors are indicated as "I" instead in the paper. Please also refer to the attached image for details.
  3. Yes it's with other bboxes with the same label. But because there are too many of them and we don't want to repeat the computation of features, we store RoI features to a constantly updating memory bank and directly retrieve a subset of them from the memory bank whenever needing to form an intra-class pair.
  4. Lnce basically uses the dense correspondences to obtain positive and negative pairs of features from two bboxes with the same label, pull close features from positive pairs and push away features from negative pairs.
  5. Inference and learning are separate. Inference is only responsible to create the structured teacher through minimizing the Gibbs energy. By minimizing the energy, you obtain the structurally refined masks and dense correspondences. Then they are used in learning, which is only for the teacher-student distillation part.

1680216407154

@tianyufang1958
Copy link
Author

@Chrisding Could you please also let me know what dense correspondence means here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants