CN117036711A

CN117036711A - Weak supervision semantic segmentation method based on attention adjustment

Info

Publication number: CN117036711A
Application number: CN202311064941.7A
Authority: CN
Inventors: 苏京峰; 李军侠
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2023-08-23
Filing date: 2023-08-23
Publication date: 2023-11-10

Abstract

The invention discloses a weak supervision semantic segmentation method based on attention regulation, which explores the application of a transducer in a weak supervision semantic segmentation task. The Transformer-based method optimizes the class activation graph using attention, but the class activation graph obtained after optimization has an incomplete activation problem due to the fact that the attention between partial class and block is wrong. In order to solve the problem, the invention provides a novel weak supervision semantic segmentation framework, wherein an attention adjustment strategy is designed in the framework, the attention between classes is adjusted according to the attention between blocks, and more target areas can be activated by the adjusted attention. Compared with the latest method, the method provided by the invention achieves the optimal result on the PASCAL VOC 2012 data set and the MS COCO 2014 data set.

Description

Weak supervision semantic segmentation method based on attention adjustment

Technical Field

The invention belongs to the technical field of image segmentation, and particularly relates to a weak supervision semantic segmentation method based on attention regulation.

Background

Semantic segmentation is one of the fundamental and challenging tasks in the computer vision field, whose research purpose is to classify each pixel in an image and assign it to a specific semantic class. Semantic segmentation has wide application in many fields, such as image recognition, autopilot, medical image analysis, scene understanding, and video analysis, etc., which can help computers better understand the content in images, thereby enabling automated scene understanding and decision making. In recent years, due to the vigorous development of the deep learning method, the semantic segmentation has also made remarkable progress, wherein a fully supervised semantic segmentation model is widely applied and has excellent performance. However, training the fully supervised semantic segmentation model often requires large-scale pixel-level labeling data, and obtaining the pixel-level labeling data is often difficult, time-consuming and labor-consuming. To address this problem, many efforts began to employ weakly supervised semantic segmentation techniques. The semantic segmentation network is trained by weak labels such as boundary box labels, point labels, graffiti labels or image-level labels. The image level labels are the labels which are most convenient to acquire, and are widely studied in weak supervision semantic segmentation.

Although the acquisition of image-level annotations is very convenient, the image-level annotations have a problem in that they do not provide sufficient location supervision information, because they only give information on the class of objects contained in one image, and do not indicate specific location information of the class of objects in the image. The development of class activation diagrams (CAMs) provides an efficient way to obtain location information using only image level labels. For weakly supervised semantic segmentation of image level labels, most existing approaches are usually solved using the following procedure: 1) Training a Convolutional Neural Network (CNN) using image-level labeling, from which class activation maps are generated to obtain seed regions; 2) Expanding the seed area with a certain constraint to obtain a pseudo tag; 3) The fully supervised semantic segmentation network is trained using pseudo labels as real labels. However, the class activation map generated by convolutional neural networks has a problem in that it tends to activate a localized, discernable region, while ignoring the complete object range, resulting in incomplete activation problems. At present, research proves that the characteristic is caused by the inherent characteristic of the convolutional neural network, namely, the convolutional operation in the convolutional neural network can only capture a small range of characteristic dependence and can not explore global characteristic relations, so that the activation object area is too small, the quality of the generated pseudo tag is influenced, and finally, an ideal weak supervision semantic segmentation result is difficult to obtain.

At present, transform has enjoyed tremendous success in many computer vision tasks, mainly due to its own attention mechanisms. The transducer's attention mechanism can model global feature relationships and overcome the above-described drawbacks of convolutional neural networks. Some researchers have started weak-supervision semantic segmentation studies using transformers, which typically use a Transformer structure to extract image features and generate class activation graphs, and then use attention to optimize the class activation graphs to obtain a more complete class activation graph. Although the existing weak supervision semantic segmentation method based on the Transformer generally uses attention to optimize the class activation graph, the class activation graph still cannot completely activate the object region after being subjected to attention optimization due to errors between the attention middle classification generated by the Transformer and the attention between the blocks.

Disclosure of Invention

The invention aims to solve the problem that a target area cannot be completely activated in weak supervision semantic segmentation, and provides a weak supervision semantic segmentation method based on attention fusion.

The technical scheme is as follows: in order to achieve the purpose of the invention, the technical scheme adopted by the invention is as follows: a weak supervision semantic segmentation method based on attention adjustment, comprising the following steps:

step 1, data preparation: acquiring a labeling image data set, and dividing the data set into a training set, a verification set and a test set;

step 2, data preprocessing: carrying out random horizontal overturn and color dithering treatment on the image, carrying out normalization treatment on the image, carrying out random clipping, and taking the clipped image as the input of a weak supervision semantic segmentation model;

step 3, building a model: constructing a weak supervision semantic segmentation model by taking DeiT-S pre-trained on an ImageNet as a backbone of the model;

step 4, model training: optimizing a weak supervision semantic segmentation model by using an Adam optimizer, training the model for a set period by using a training set, and generating a class activation diagram by using a loss function through multi-label cross entropy loss and the trained model;

step 5, distributing a class to each pixel position according to the value of the class activation diagram to generate a pixel-level pseudo tag, and then training a semantic segmentation network deep V2 by using the pixel-level pseudo tag; and inputting the pictures in the verification set and the test set into the trained model to obtain a final segmentation map.

Further, the model construction in the step 3 comprises:

step 3.1, constructing a weak supervision semantic segmentation framework based on attention fusion, segmenting a preprocessed image into N non-overlapping blocks, constructing N block tokens through linear mapping, and splicing C class tokens and N block tokens to obtain an input token of the framework;

step 3.2, inputting the input token into a Transfomer coding layer in the framework to obtain an output token; then extracting the last N block tokens from the output tokens to form an output block token Tp_out, and carrying out recombination and convolution operation on the output block token Tp_out to obtain an initial class activation diagram Original-CAM;

and 3.3, when the input token passes through the Transfomer coding layer, the Attention module calculates the Attention of the input token to generate Attention, and the calculation formula is as follows:

wherein Q and K respectively represent a matrix array and a Key matrix obtained by linear projection of an input token when the input token passes through a transducer coding layer, T represents matrix transposition, and d _k Representing a scaling factor;

step 3.4, attention is further divided into class-to-block attention A _c2p Sum block-to-block attention a _p2p Then pass the attention A between the blocks _p2p Attention a between class and block _c2p Adjusting;

step 3.5, use class-to-block attention A _c2p Sum block-to-block attention a _p2p The initial class activation map is optimized.

Further, class-to-block attention A _c2p Sum block-to-block attention a _p2p The expression is as follows:

A _c2p ＝Attention[1:C,C+1:C+N]

A _p2p ＝Attention[C+1:C+N,C+1:C+N]

the attention between class c and block i is adjusted as follows:

firstly, sorting all blocks according to the order of attention values from big to small according to the attention between the blocks and the block i, and selecting the top p% of the sorted blocks;

then, the attention between class c and the selected block is taken out and calculated to obtain the attention adjustment factor between class c and block i:

wherein r (c, i) represents A _c2p Attention regulator between class C and block i, C e {1,2, …, C } represents the total number of data set classes, i, j represents blocks, i e {1,2, …, N }, j e U, U represents the set of the first p% of blocks with the greatest attention between block i, S represents the number of blocks in U; a is that _c2p (c, j) represents the attention between the representation class c and the block j;

attention adjustment factor r (c, i) is then added to the attention between class c and block i to adjust:

A _c2p (c,i)＝A _c2p (c,i)+α*r(c,i)

wherein A is _c2p (c, i) represents the attention between class c and block i, and α represents the attention regulator coefficient.

Further, class-to-block attention A is used in step 3.5 _c2p Sum block-to-block attention a _p2p Optimizing the initial class activation graph, comprising:

multiplying an initial class activation diagram initial-CAM by class-to-block attention to obtain a preliminary optimized adjustment class activation diagram;

and then further optimizing by matrix multiplication between the block-to-block attention and the adjustment class activation map to obtain a final class activation map.

Further, the model training process in step 4 is as follows:

step 4.1, setting super parameters of a weak supervision semantic segmentation model: model training times Epoch, initial learning rate and model training batch batch_size, wherein an optimizer used in training is an Adam optimizer, and a loss function is multi-label cross entropy loss;

step 4.2, carrying out multi-round training on the weak supervision semantic segmentation model, and storing parameters corresponding to a round of results with the highest training mIoU value;

and 4.3, after the weak supervision semantic segmentation model is trained, loading the stored best parameters into the model, inputting training set data into the model, and generating a complete class activation diagram by the trained model.

The beneficial effects are that: compared with the prior art, the technical scheme of the invention has the following beneficial technical effects:

the invention mainly solves the problem of incomplete activation of the class activation graph in the weak supervision semantic segmentation. A simple and effective weak supervision semantic segmentation framework is provided by taking a transducer as a basic network structure. In the framework, firstly, an attention adjustment strategy is designed, attention between the classes and the blocks is adjusted according to the attention between the blocks, the probability of error association between the classes and the blocks is effectively reduced, then the class activation diagram is optimized by using the adjusted attention, and at the moment, a target area in the obtained class activation diagram can be activated more completely and accurately, and the problem of incomplete activation of the class activation diagram can be better solved.

Drawings

FIG. 1 is a diagram of a weakly supervised semantic segmentation overall framework based on attention fusion.

Fig. 2 is an exemplary graph of segmentation results on a pass VOC 2012 validation set.

Fig. 3 is an exemplary graph of segmentation results on an MS co 2014 verification set.

Detailed Description

The technical scheme of the invention is further described below with reference to the accompanying drawings and examples.

The invention discloses a weak supervision semantic segmentation method based on attention regulation, which provides a novel framework based on a Transformer, which is used for weak supervision semantic segmentation tasks under image level annotation, and the overall structure of the framework is shown in figure 1 and mainly comprises three parts: 1) Performing feature extraction by using a transducer and generating an initial class activation diagram; 2) The attention adjusting module is used for adjusting the attention between the classes and the blocks according to the attention between the blocks, so that the accuracy of the attention between the classes and the blocks is effectively improved; 3) The class activation map is optimized by using the attention, and a more complete and accurate class activation map is obtained. The method comprises the following steps:

step 1: data preparation.

In the present invention, the paspal VOC 2012 dataset and the MS COCO 2014 dataset are used. Wherein the Pascal VOC 2012 dataset has 21 categories, including 20 object classes and one background class; the MS COCO 2014 dataset has 81 categories, including 80 object classes and one background class. The paspal VOC 2012 dataset can be divided into three parts: training set (comprising 1464 images), validation set (comprising 1449 images) and test set (comprising 1456 images), wherein training set is typically 10582 images augmented with additional data; the MS COCO 2014 dataset can be divided into two parts: a training set (comprising 82081 images) and a validation set (comprising 40137 images).

Step 2: and (5) preprocessing data.

And carrying out random horizontal overturn and color dithering treatment on the image, and setting the brightness, contrast and saturation values of the image to 0.3. The image was normalized using transform.normal to be 256×256 in size, and then randomly cropped using transform.random crop to be 224×224 in size. The cropped image is input into the model.

Step 3: and (5) building a model.

Step 3.1: and (3) constructing a weak supervision semantic segmentation framework based on attention fusion, segmenting the image preprocessed in the step (2) into N non-overlapping blocks, constructing N block tokens through linear mapping, and splicing C class tokens and N block tokens to obtain an input token of the framework.

Step 3.2: the input token is input to a Transfomer encoding layer in the framework to obtain an output token. The last N block tokens are then extracted from the output tokens to form an output block token Tp_out, which is subjected to a reorganization (Reshape) and convolution (Conv) operation to obtain an initial class activation map Original-CAM.

Original-CAM＝Conv(Reshape(Tp_out))

Step 3.3: when the input token passes through the Transfomer coding layer, the Attention module calculates the Attention of the input token to generate Attention, the shape is [ C+N, C+N ], and the calculation formula is as follows:

wherein Q, K represents a matrix array and a Key matrix obtained by linear projection of an input token when the input token passes through a transducer coding layer, T represents matrix transposition, and d _k Representing the scaling factor.

Step 3.4: attention can be further divided into class-to-block Attention A _c2p Sum block-to-block attention a _p2p Wherein A is _c2p ＝Attention[1:C,C+1:C+N],A _p2p ＝Attention[C+1:C+N,C+1:C+N]. Then pass the attention A from block to block _p2p To pay attention between classes and blocksForce A _c2p And adjusting. If the attention between the class c and the block i is to be adjusted, firstly, sorting the blocks according to the order of the attention values from big to small according to the attention between the blocks, then selecting some blocks which are ranked 30% before sorting, and then calculating the attention between the blocks to obtain the attention adjustment factor between the class c and the block i:

wherein r (c, i) represents A _c2p Attention regulator between class C and block i, C e {1,2, …, C } represents the total number of data set classes, i, j represents a block, i e {1,2, …, N }, j e U, U represents a set of blocks of greater attention to block i, S represents the number of blocks in U. Attention adjustment factor r (c, i) is then added to the attention between class c and block i to adjust:

A _c2p (c,i)＝A _c2p (c,i)+α*r(c,i)

Step 3.5: using class-to-block attention A _c2p Sum block-to-block attention a _p2p To optimize the initial class activation map. The initial class activation diagram initial-CAM is multiplied by class-to-block attention to obtain a preliminary optimized adjustment class activation diagram, and then the adjustment class activation diagram is further optimized by matrix multiplication between the block-to-block attention and the adjustment class activation diagram to obtain a final class activation diagram.

Step 4: and (5) model training.

Step 4.1: setting relevant super parameters of a weak supervision semantic segmentation model, setting the model training frequency Epoch to 60, setting the model training batch batch_size to 64, setting an optimizer used during training to be an Adam optimizer, wherein the loss function is multi-label cross entropy loss, and setting the initial learning rate to 5e-4.

Step 4.2: and carrying out multi-round training on the weak supervision semantic segmentation model, and storing parameters corresponding to the best round of training result (the highest training mIoU value) by observing the training result.

Step 4.3: after the weak supervision semantic segmentation model is trained, the stored best parameters are loaded into the model, then training set data are input into the model, and the trained model can generate a complete class activation diagram.

Step 5: and (3) assigning a class to each pixel position according to the value of the class activation graph to generate a pixel-level pseudo tag, and then training the existing semantic segmentation network deep V2 by using the pixel-level pseudo tag. The pictures in the verification set and the test set are input into the trained model to obtain a final segmentation map, as shown in fig. 2 and 3, the second column is a real segmentation map, the third column is a prediction segmentation map of the invention, and the model prediction segmentation map of the invention is found to be very close to the real segmentation map.

Claims

1. The weak supervision semantic segmentation method based on attention regulation is characterized by comprising the following steps of:

2. The attention-adjustment-based weak supervision semantic segmentation method according to claim 1, wherein the model building in step 3 comprises:

3. The attention-based weak supervision semantic segmentation method according to claim 2, wherein the method comprises the following steps ofClass-to-block attention A _c2p Sum block-to-block attention a _p2p The expression is as follows:

A _c2p ＝Attention[1：C，C+1：C+N]

A _p2p ＝Attention[C+1：C+N，C+1：C+N]

the attention between class c and block i is adjusted as follows:

wherein r (c, i) represents A _c2p Attention regulator between class C and block i, C e {1,2,..c } represents the total number of data set categories, i, j represents a block, i e {1,2, n., j e U, U representing the set of the first p% of blocks with the greatest attention between block i, S representing the number of blocks in U; a is that _c2p (c, j) represents the attention between the representation class c and the block j;

A _c2p (c，i)＝A _c2p (c，i)+α*r(c，i)

4. The attention-based weak supervision semantic segmentation method according to claim 2, wherein class-to-block attention a is used in step 3.5 _c2p Sum block-to-block attention a _p2p Optimizing the initial class activation graph, comprising:

5. The attention-based weak supervision semantic segmentation method according to any one of claims 1 to 4, wherein the model training process in step 4 is as follows: