Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does the code use the same hyper parameters as the paper described? #1

Closed
ssssholmes opened this issue Oct 24, 2021 · 3 comments
Closed

Comments

@ssssholmes
Copy link

ssssholmes commented Oct 24, 2021

Thanks for great work!

I read the code in MeanField class. It seems that it is inconsistent with the paper.

屏幕快照 2021-10-24 15 18 19

class MeanField(nn.Module):
# feature map (RGB)
# B = #num of object
# shape of [N 3 H W]
#@autocast(enabled=False)
def __init__(self, feature_map, kernel_size=3, require_grad=False, theta0=0.5, theta1=30, theta2=10, alpha0=3,
iter=20, base=0.45, gamma=0.01):
super(MeanField, self).__init__()
self.require_grad = require_grad
self.kernel_size = kernel_size
with torch.no_grad():
self.unfold = torch.nn.Unfold(kernel_size, stride=1, padding=kernel_size//2)
feature_map = feature_map + 10
unfold_feature_map = self.unfold(feature_map).view(feature_map.size(0), feature_map.size(1), kernel_size**2, -1)
self.feature_map = feature_map
self.theta0 = theta0
self.theta1 = theta1
self.theta2 = theta2
self.alpha0 = alpha0
self.gamma = gamma
self.base = base
self.spatial = torch.tensor((np.arange(kernel_size**2)//kernel_size - kernel_size//2) ** 2 +\
(np.arange(kernel_size**2) % kernel_size - kernel_size//2) ** 2).to(feature_map.device).float()
self.kernel = alpha0 * torch.exp((-(unfold_feature_map - feature_map.view(feature_map.size(0), feature_map.size(1), 1, -1)) ** 2).sum(1) / (2 * self.theta0 ** 2) + (-(self.spatial.view(1, -1, 1) / (2 * self.theta1 ** 2))))
self.kernel = self.kernel.unsqueeze(1)
self.iter = iter
# input x
# shape of [N H W]
#@autocast(enabled=False)
def forward(self, x, targets, inter_img_mask=None):
with torch.no_grad():
x = x * targets
x = (x > 0.5).float() * (1 - self.base*2) + self.base
U = torch.cat([1-x, x], 1)
U = U.view(-1, 1, U.size(2), U.size(3))
if inter_img_mask is not None:
inter_img_mask.reshape(-1, 1, inter_img_mask.shape[2], inter_img_mask.shape[3])
ret = U
for _ in range(self.iter):
nret = self.simple_forward(ret, targets, inter_img_mask)
ret = nret
ret = ret.view(-1, 2, ret.size(2), ret.size(3))
ret = ret[:,1:]
ret = (ret > 0.5).float()
count = ret.reshape(ret.shape[0], -1).sum(1)
valid = (count >= ret.shape[2] * ret.shape[3] * 0.05) * (count <= ret.shape[2] * ret.shape[3] * 0.95)
valid = valid.float()
return ret, valid
#@autocast(enabled=False)
def simple_forward(self, x, targets, inter_img_mask):
h, w = x.size(2), x.size(3)
unfold_x = self.unfold(-torch.log(x)).view(x.size(0)//2, 2, self.kernel_size**2, -1)
aggre = (unfold_x * self.kernel).sum(2)
aggre = aggre.view(-1, 1, h, w)
f = torch.exp(-aggre)
f = f.view(-1, 2, h, w)
if inter_img_mask is not None:
f += inter_img_mask * self.gamma
f[:, 1:] *= targets
f = f + 1e-6
f = f / f.sum(1, keepdim=True)
f = (f > 0.5).float() * (1 - self.base*2) + self.base
f = f.view(-1, 1, h, w)
return f

@ssssholmes ssssholmes changed the title Does the code use the same parameter as the paper described? Does the code use the same hyper parameters as the paper described? Oct 24, 2021
@ssssholmes
Copy link
Author

So I may have a few questions:

  1. why does the feature_map need to add 10 in the codebase in line 760?
  2. "targets" in forward function seems to be generated from gt_masks, does it mean that discobox still needs gt_mask as supervision signal.
  3. "x" in forward function seems not to be a roi feature cropped from feature map.

@voidrank
Copy link
Contributor

Hi @ssssholmes,

Thank you for your interests in our work. We are pleased to answer your questions:

  1. +10 is only a trick to better isolate padded pixels from the real pixels inside the image. It does not directly change the mean field inference on real pixels as adding 10 does not change the difference value in kernel computation. We apply pairwise potential in a convolution-like manner, thus we need to pad some pixels outside the image border. The RGB values of those padded pixels are set to be 0 by default. We add 10 on each real pixel to enlarge their difference from the padded pixels, so that the padded pixels affect less on the real pixels.

  2. gt_masks actually refers to treating the boxes as foreground masks. It's a naming problem and we will change it. Sorry for the confusion. To be more specific, if the output mask size if H x W, the gt_masks is defineds as a binary matrix G of H x W. gt_masks_{i,j} = 1 if the pixel (i, j) is inside the bounding box of the target object. gt_masks_{i,j} = 0 if pixel (i, j) is outside the bounding box.

  3. You are right. In the paper we consider both YOLACT and SOLOv2 and there are subtle differences in their code bases. Our description in the paper mainly follows YOLACT which takes in cropped RoI features. On SOLOv2, we mostly follow their original implementation to take in the whole feature map. This is to keep the codebase clean and close to the original implementation. However, it should be mentioned that we do use bounding boxes as a mask to restrict the losses, mean field inference, and correspondence computation happening only inside the boxes. Thus everything is still on the RoI level.

@ssssholmes
Copy link
Author

@voidrank Got it! Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants