Does the code use the same hyper parameters as the paper described? #1

ssssholmes · 2021-10-24T07:14:33Z

Thanks for great work!

I read the code in MeanField class. It seems that it is inconsistent with the paper.

DiscoBox/mmdet/models/dense_heads/discobox_solov2_head.py

Lines 744 to 814 in 3b17041

 class MeanField(nn.Module): 

 # feature map (RGB) 

 # B = #num of object 

 # shape of [N 3 H W] 

 #@autocast(enabled=False) 

 def __init__(self, feature_map, kernel_size=3, require_grad=False, theta0=0.5, theta1=30, theta2=10, alpha0=3, 

 iter=20, base=0.45, gamma=0.01): 

 super(MeanField, self).__init__() 

 self.require_grad = require_grad 

 self.kernel_size = kernel_size 

 with torch.no_grad(): 

 self.unfold = torch.nn.Unfold(kernel_size, stride=1, padding=kernel_size//2) 

 feature_map = feature_map + 10 

 unfold_feature_map = self.unfold(feature_map).view(feature_map.size(0), feature_map.size(1), kernel_size**2, -1) 

 self.feature_map = feature_map 

 self.theta0 = theta0 

 self.theta1 = theta1 

 self.theta2 = theta2 

 self.alpha0 = alpha0 

 self.gamma = gamma 

 self.base = base 

 self.spatial = torch.tensor((np.arange(kernel_size**2)//kernel_size - kernel_size//2) ** 2 +\ 

 (np.arange(kernel_size**2) % kernel_size - kernel_size//2) ** 2).to(feature_map.device).float() 

 self.kernel = alpha0 * torch.exp((-(unfold_feature_map - feature_map.view(feature_map.size(0), feature_map.size(1), 1, -1)) ** 2).sum(1) / (2 * self.theta0 ** 2) + (-(self.spatial.view(1, -1, 1) / (2 * self.theta1 ** 2)))) 

 self.kernel = self.kernel.unsqueeze(1) 

 self.iter = iter 

 # input x 

 # shape of [N H W] 

 #@autocast(enabled=False) 

 def forward(self, x, targets, inter_img_mask=None): 

 with torch.no_grad(): 

 x = x * targets 

 x = (x > 0.5).float() * (1 - self.base*2) + self.base 

 U = torch.cat([1-x, x], 1) 

 U = U.view(-1, 1, U.size(2), U.size(3)) 

 if inter_img_mask is not None: 

 inter_img_mask.reshape(-1, 1, inter_img_mask.shape[2], inter_img_mask.shape[3]) 

 ret = U 

 for _ in range(self.iter): 

 nret = self.simple_forward(ret, targets, inter_img_mask) 

 ret = nret 

 ret = ret.view(-1, 2, ret.size(2), ret.size(3)) 

 ret = ret[:,1:] 

 ret = (ret > 0.5).float() 

 count = ret.reshape(ret.shape[0], -1).sum(1) 

 valid = (count >= ret.shape[2] * ret.shape[3] * 0.05) * (count <= ret.shape[2] * ret.shape[3] * 0.95) 

 valid = valid.float() 

 return ret, valid 

 #@autocast(enabled=False) 

 def simple_forward(self, x, targets, inter_img_mask): 

 h, w = x.size(2), x.size(3) 

 unfold_x = self.unfold(-torch.log(x)).view(x.size(0)//2, 2, self.kernel_size**2, -1) 

 aggre = (unfold_x * self.kernel).sum(2) 

 aggre = aggre.view(-1, 1, h, w) 

 f = torch.exp(-aggre) 

 f = f.view(-1, 2, h, w) 

 if inter_img_mask is not None: 

 f += inter_img_mask * self.gamma 

 f[:, 1:] *= targets 

 f = f + 1e-6 

 f = f / f.sum(1, keepdim=True) 

 f = (f > 0.5).float() * (1 - self.base*2) + self.base 

 f = f.view(-1, 1, h, w) 

 return f

ssssholmes · 2021-10-24T07:28:05Z

So I may have a few questions:

why does the feature_map need to add 10 in the codebase in line 760?
"targets" in forward function seems to be generated from gt_masks, does it mean that discobox still needs gt_mask as supervision signal.
"x" in forward function seems not to be a roi feature cropped from feature map.

voidrank · 2021-10-25T18:20:45Z

Hi @ssssholmes,

Thank you for your interests in our work. We are pleased to answer your questions:

+10 is only a trick to better isolate padded pixels from the real pixels inside the image. It does not directly change the mean field inference on real pixels as adding 10 does not change the difference value in kernel computation. We apply pairwise potential in a convolution-like manner, thus we need to pad some pixels outside the image border. The RGB values of those padded pixels are set to be 0 by default. We add 10 on each real pixel to enlarge their difference from the padded pixels, so that the padded pixels affect less on the real pixels.
gt_masks actually refers to treating the boxes as foreground masks. It's a naming problem and we will change it. Sorry for the confusion. To be more specific, if the output mask size if H x W, the gt_masks is defineds as a binary matrix G of H x W. gt_masks_{i,j} = 1 if the pixel (i, j) is inside the bounding box of the target object. gt_masks_{i,j} = 0 if pixel (i, j) is outside the bounding box.
You are right. In the paper we consider both YOLACT and SOLOv2 and there are subtle differences in their code bases. Our description in the paper mainly follows YOLACT which takes in cropped RoI features. On SOLOv2, we mostly follow their original implementation to take in the whole feature map. This is to keep the codebase clean and close to the original implementation. However, it should be mentioned that we do use bounding boxes as a mask to restrict the losses, mean field inference, and correspondence computation happening only inside the boxes. Thus everything is still on the RoI level.

ssssholmes · 2021-11-02T03:59:47Z

@voidrank Got it! Thank you!

ssssholmes changed the title ~~Does the code use the same parameter as the paper described?~~ Does the code use the same hyper parameters as the paper described? Oct 24, 2021

ssssholmes closed this as completed Nov 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does the code use the same hyper parameters as the paper described? #1

Does the code use the same hyper parameters as the paper described? #1

ssssholmes commented Oct 24, 2021 •

edited

Loading

ssssholmes commented Oct 24, 2021

voidrank commented Oct 25, 2021

ssssholmes commented Nov 2, 2021

Does the code use the same hyper parameters as the paper described? #1

Does the code use the same hyper parameters as the paper described? #1

Comments

ssssholmes commented Oct 24, 2021 • edited Loading

ssssholmes commented Oct 24, 2021

voidrank commented Oct 25, 2021

ssssholmes commented Nov 2, 2021

ssssholmes commented Oct 24, 2021 •

edited

Loading