CN113066074A - Visual saliency prediction method based on binocular parallax offset fusion - Google Patents
Visual saliency prediction method based on binocular parallax offset fusion Download PDFInfo
- Publication number
- CN113066074A CN113066074A CN202110385471.9A CN202110385471A CN113066074A CN 113066074 A CN113066074 A CN 113066074A CN 202110385471 A CN202110385471 A CN 202110385471A CN 113066074 A CN113066074 A CN 113066074A
- Authority
- CN
- China
- Prior art keywords
- layer
- module
- convolution
- input
- block
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 35
- 230000000007 visual effect Effects 0.000 title claims abstract description 34
- 230000004927 fusion Effects 0.000 title claims abstract description 29
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 26
- 238000012549 training Methods 0.000 claims abstract description 23
- 238000004569 spin polarized scanning tunneling microscopy Methods 0.000 claims abstract 11
- 238000012546 transfer Methods 0.000 claims description 29
- 238000003475 lamination Methods 0.000 claims description 26
- 238000010586 diagram Methods 0.000 claims description 16
- 238000013528 artificial neural network Methods 0.000 claims description 5
- 238000013508 migration Methods 0.000 claims description 4
- 230000005012 migration Effects 0.000 claims description 4
- 238000000605 extraction Methods 0.000 abstract description 19
- 230000006870 function Effects 0.000 abstract description 16
- 238000013135 deep learning Methods 0.000 abstract description 6
- 230000005540 biological transmission Effects 0.000 abstract 1
- 238000005070 sampling Methods 0.000 abstract 1
- 238000010606 normalization Methods 0.000 description 20
- 230000004913 activation Effects 0.000 description 18
- 238000011176 pooling Methods 0.000 description 15
- 230000008569 process Effects 0.000 description 10
- 238000012545 processing Methods 0.000 description 7
- 238000012360 testing method Methods 0.000 description 7
- 238000005096 rolling process Methods 0.000 description 6
- 239000000047 product Substances 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 239000013589 supplement Substances 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 102100025232 Calcium/calmodulin-dependent protein kinase type II subunit beta Human genes 0.000 description 1
- 101001077352 Homo sapiens Calcium/calmodulin-dependent protein kinase type II subunit beta Proteins 0.000 description 1
- 230000003213 activating effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000012467 final product Substances 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/0002—Inspection of images, e.g. flaw detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20212—Image combination
- G06T2207/20221—Image fusion; Image merging
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20228—Disparity calculation for image-based rendering
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Quality & Reliability (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a binocular parallax offset fusion-based visual saliency prediction method, and relates to the field of deep learning. In the training stage, a convolutional neural network is constructed, and the network comprises a feature extraction layer and an upper sampling layer. The feature extraction layer comprises two twin networks, a network framework adopts a ResNet34 framework, and a convolutional neural network feature extraction part consisting of 5 convolutional blocks is used; the upsampling layer includes 4 parts: the device comprises a GCM module, a CAM module, a characteristic cascade transmission module and an SPSM module. Inputting the NCTU binocular image data set into a convolutional neural network for training to obtain a single-channel saliency target prediction map; and then, calculating a loss function value between a prediction graph corresponding to the training set image and a real significance target label graph to obtain an optimal weight vector and a bias term of the convolutional neural network classification training model. The method has the advantage of improving the efficiency and accuracy of the significance target prediction.
Description
Technical Field
The invention relates to the field of deep learning, in particular to a binocular parallax offset fusion-based visual saliency prediction method.
Background
With the availability of mass data brought by the development of the internet, the rapid acquisition of key information from mass image and video data has become a key problem in the field of computer vision. The visual significance detection through object identification, 3D display, visual comfort evaluation and 3D visual quality measurement has important application value in this respect. The human visual system has the ability to quickly search and locate objects of interest when faced with natural scenes, being able to locate the most prominent areas in the image while ignoring other areas. The visual attention mechanism has extremely important significance for people to process visual image information in daily life.
The method for predicting the visual saliency through deep learning can be used for directly predicting the saliency region of an end-to-end (end-to-end) at a pixel level, namely, only images and labels in a training set are required to be input into a model frame for training to obtain weights and models, then prediction is carried out in a testing set to verify the quality of the models, the best prediction model is obtained through continuous tuning optimization, and finally the prediction model is used for predicting pictures in the real world to obtain the visual saliency prediction result of the pictures. The prediction method based on deep learning has the core that a binocular parallax offset fusion visual saliency prediction method constructed by a convolutional neural network is utilized, and the prediction method is strong in multilayer structure and capability of automatically learning features, and can learn the features of multiple layers. The architecture of the convolutional neural network mainly comprises two types: bottom up and top down. Bottom-up refers to the visual attention elicited by the essential features of an image, which are driven by underlying perceptual data, such as a set of image features, e.g., color, brightness, orientation, etc. According to the bottom layer image data, different areas have stronger characteristic difference; by determining the difference between the target area and its surrounding pixels, the saliency of the image area can be calculated. The top-down strategy is based on a task-driven attention saliency mechanism, based on task experience to drive visual attention, and based on knowledge to predict a target saliency region of a current image. For example, in an area, when you are looking for a friend wearing a black hat, you will first notice the prominent features of the black hat.
Most of the existing visual saliency prediction methods adopt a deep learning method, and a model combining convolutional layer batch, batch normalization layer and pooling layer is utilized, so that a better framework is obtained through different combination modes of the convolutional layer batch, the batch normalization layer and the pooling layer, and a better model is obtained.
Disclosure of Invention
In view of the above, the invention provides a binocular parallax offset fusion-based visual saliency prediction method, which has a good prediction effect and is rapid in prediction.
In order to achieve the purpose, the invention adopts the following technical scheme:
a visual saliency prediction method based on binocular parallax offset fusion comprises the following steps:
selecting a plurality of binocular views of natural scenes and movie scenes to form an image data training set;
constructing a convolutional neural network framework, wherein the neural network framework enables high-level semantic information and low-level detail information to be combined with each other;
training the convolutional neural network framework: inputting the binocular view to the convolutional neural network framework, the convolutional neural network framework outputting a grayscale map; the Loss function of the convolutional neural network framework adopts root mean square error, CC-Loss and KLDivloss;
and (5) training for multiple times to obtain a convolutional neural network prediction training model.
Preferably, the specific connection relationship of the neural network framework is as follows:
inputting the left view of the input layer into a 1 st, a 2 nd, a 3 rd, a 4 th and a 5 th volume block in sequence; wherein the 1 st convolution block is input to the 2 nd SPSM module, the 2 nd convolution block is input to the 1 st SPSM module, the 3 rd convolution block is input to the 3 rd GCM module, the 4 th convolution block is input to the 2 nd GCM module, and the 5 th convolution block is input to the 1 st GCM module; the right viewpoint of the input layer is sequentially connected with a 6 th convolution block, a 7 th convolution block, an 8 th convolution block, a 9 th convolution block and a 10 th convolution block; wherein the 6 th convolution block is input to the 2 nd SPSM block, the 7 th convolution block is input to the 1 st SPSM block, the 8 th convolution block is input to the 3 rd CAM block, the 9 th convolution block is input to the 2 nd CAM block, and the 10 th convolution block is input to the 1 st CAM block; the 1 st GCM module is input into a 1 st feature cascade transfer module, the 2 nd GCM module is input into a 2 nd feature cascade transfer module, and the 3 rd GCM module is input into a 3 rd feature cascade transfer module; the 1 st characteristic cascade transfer module outputs to the 1 st CAM module, the 2 nd characteristic cascade transfer module and the 3 rd characteristic cascade transfer module, the 2 nd characteristic cascade transfer module outputs to the 2 nd CAM module, the 3 rd characteristic cascade transfer module outputs to the 3 rd CAM module; the 1 st CAM bank outputs to the 2 nd feature cascade transfer bank, the 2 nd CAM bank outputs to the 3 rd feature cascade transfer bank, and the 3 rd CAM bank outputs to the 1 st SPSM bank and the 1 st high-level volume block; the 1 st SPSM module outputs to the 2 nd SPSM module, and the 2 nd and 3 rd advanced volume blocks are sequentially connected behind the 1 st advanced volume block; the output of the 2 nd SPSM module and the 3 rd high level convolution reaches the output layer via the concatance connection layer.
Preferably, the specific input-output relationship of the SPSM module is as follows:
the left viewpoint features and the right viewpoint features are respectively input into a parallax fusion layer, the parallax fusion layer outputs to a high-level feature fusion layer, the high-level features of the front layer output to the high-level feature fusion layer, the high-level feature fusion layer outputs to a convolution block, and the output of the convolution block is the output of the SPSM module.
Preferably, the specific input-output relationship of the GCM module is as follows:
the convolution characteristic diagrams are respectively input into a 1 st convolution layer, a 1 st hollow convolution layer, a 2 nd hollow convolution layer and a 3 rd hollow convolution layer and then input into a 2 nd convolution layer, and the output of the 2 nd convolution layer is input into a splicing layer.
Preferably, the specific input-output relationship of the characteristic cascade transfer module is as follows:
the 1 st GCM module is input into the 1 st convolution layer, and the 1 st convolution layer is respectively output to the 1 st pixel point convolution layer, the 3 rd pixel point convolution layer, the 1 st characteristic splicing layer and the 1 st CAM module; the 1 st CAM is output to the 2 nd pixel point lamination; the 2 nd GCM module inputs the 1 st pixel point lamination, the 1 st pixel point lamination is input to the 1 st characteristic splicing layer, the 1 st characteristic splicing layer is input to the 2 nd pixel point lamination, the 2 nd pixel point lamination is input to the 2 nd convolution layer, and the 2 nd convolution layer is output to the 2 nd CAM module; the 1 st pixel point lamination is output to the 3 rd pixel point lamination, the 2 nd lamination is output to the 2 nd characteristic splicing layer, and the 2 nd CAM module is output to the 4 th pixel point lamination; the 3 rd GCM module outputs the 3 rd pixel point lamination, the 3 rd pixel point lamination outputs the 2 nd characteristic splicing layer, the 2 nd characteristic splicing layer outputs the 4 th pixel point lamination, the 4 th pixel point lamination outputs the 3 rd convolution, and the 3 rd convolution inputs the 3 rd CAM module.
Compared with the prior art, the visual saliency prediction method based on binocular parallax offset fusion has the following beneficial effects:
1. the invention constructs a convolutional neural network architecture, and a picture data set sampled from the real world is input into a convolutional neural network for training to obtain a convolutional neural network prediction model. And inputting the picture to be predicted into the network, and predicting to obtain a prediction result picture of the visual saliency area of the picture. The method of the invention uses the mutual combination of high-level semantic information and low-level detail information in the neural network architecture process, thereby effectively improving the accuracy of the salient region prediction.
2. The method of the invention constructs two parts of a coding layer and a decoding layer by using a convolutional neural network, wherein the coding layer extracts high-level semantic features and low-level detail features of an image, and the decoding layer is upwards transmitted step by the high-level semantic features and supplements information by combining the low-level detail features. The problem that the detail features are lost when the image features are extracted by a first-level and first-level coding layer structure is solved, and the region of a salient target can be more accurately positioned by the extracted high-level features.
3. The method adopts a feature step-by-step upward transfer module in the upward transfer process of a decoding layer, fully utilizes high-level features, gradually positions to the position of a significant target, and transfers the significant target to a front layer one by one; a sub-pixel shifting module is adopted to fully utilize the mutual fusion of high-level features and low-level features to ensure the utilization rate of the features and predict the visual saliency area of the image with the maximum accuracy.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a diagram of a mold structure according to the present invention;
FIG. 2 is a schematic diagram of an SPSM module of the present invention;
FIG. 3 is a schematic diagram of a GCM module according to the present invention;
FIG. 4 is a schematic diagram of a feature cascade transfer module of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention discloses a binocular parallax offset fusion-based visual saliency prediction method, the overall implementation block diagram of which is shown in fig. 1 and comprises a model training stage and a model testing stage;
the specific steps of the model training process are as follows:
step 1_ 1: and selecting Q binocular images of natural scenes and movie scenes, namely images with a left view point and a right view point, to form a training image data set. And the qth graph in the training set is denoted as { I }q(I, j) }, training set and { I }q(i, j) } corresponding true visual saliency prediction mapsThe images in the natural scene and the movie scene are both RGB three-channel binocular color images, Q is a positive integer, Q is more than or equal to 200, Q is 332, Q is a positive integer, Q is more than or equal to 1 and less than or equal to Q, I is more than or equal to 1 and less than or equal to W, j is more than or equal to 1 and less than or equal to H, W represents the width of W which is 480, and H represents { I { (I) } is a positive integer, Q isq(I, j) } e.g. take W640, H480, Iq(I, j) represents { IqThe pixel value of the pixel point with the coordinate position (i, j) in (i, j),to representThe middle coordinate position is the pixel value of the pixel point of (i, j);
step 1_ 2: convolutional neural network architecture: the convolutional neural network architecture of the present invention is mainly composed of two parts, namely a feature extraction part (coding layer) and an upsampling part (decoding layer).
In the feature extraction part, because the adopted data set is a binocular vision data set and each sample is divided into a left view point and a right view point, the feature extraction part comprises two codes with the same frameworkThe layer is a left view characteristic coding layer and a right view characteristic coding layer respectively, visual characteristic extraction is carried out on the pictures of the left view and the right view, and each coding layer comprises 5 convolution blocks. Namely, the feature extraction section includes left viewpoint feature extraction and right viewpoint feature extraction. The feature extraction of the left and right viewpoints all comprises a 1 st volume block, a 2 nd volume block, a 3 rd volume block, a 4 th volume block and a 5 th volume block. Here, the output of the first two volume blocks is defined as a shallow feature, and the output of the last three volume blocks is defined as a high-level feature. Wherein the convolution block outputs of the last 3, 4, 5 of the left view correspond to the global context module GCM, respectively3、GCM2、GCM1(ii) a Convolution block outputs of the last 3, 4 and 5 of the right viewpoint respectively correspond to the attention fusion module CAM3、CAM2、 CAM1。
The upsampling includes five modules: respectively, a global context module (comprising three GCM units, respectively GCM1、GCM2、GCM3) A feature cascade transfer module (FCM), a channel attention fusion module (comprising three CAM cells, respectively CAM)1、CAM2、CAM3) A sub-pixel shift module (SPSM) and an advanced feature convolution component. Wherein the high-level feature convolution is composed of three convolution blocks. Each convolution block contains a 3 x 3 convolution kernel with step size of 1, padding of 1, and a quantity normalization layer and activation function (ReLU).
First, the input to the network is that the left and right viewpoints of each picture in the binocular data set (the width and height of the picture are W-256, H-256, and the channels are R-channel component, G-channel component, and B-channel component, respectively) extract visual feature information by using feature convolution blocks of the same architecture. The characteristic extraction part adopts the network specification of ResNet34, and the characteristic extraction part comprises 5 Convolution blocks, wherein the first Convolution block comprises a first layer of Convolution layer (Conv), a first layer of Activation layer (Act) and a first layer of maximum pooling layer (Max Pool). Adopting convolution layer configuration with convolution kernel (kernel _ size) of 7, step size (stride) of 2 and edge filling (padding) of 3, then normalizing the convolved feature map by a batch normalization layer, and then enabling the feature map to pass through an activation functionThe non-linear transformation of the (modified linear unit ReLU) finally outputs the feature map of the first volume block from the maximum pooling layer, and the feature map of the layer is made to be the right view point F1 RLeft viewpoint F1 LAt this time, the first convolution block outputs 64 feature maps, and the 64 feature maps constitute a left viewpoint feature map set P1 LRight viewpoint feature map set P1 RAnd the width of each feature map isHas a height of
The second volume block is composed of a second layer volume layer (Conv), a second layer Activation layer (Act), and a second layer max pooling layer (MaxPooling, Pool). The convolution block of the left view is input into a left view feature map set P output by the first left view convolution block1 LAnd the convolution block of the right view is input into a right view feature graph set P output by the first convolution block of the right view1 R. The relevant parameters of the convolutional layer are as follows: the convolution kernel size is 3 × 3, step size 1, and edge padding 1. The number of convolution kernels is 64, i.e. the number of feature maps output by the convolution layer. The convolution layer is then normalized by a batch normalization layer, then passes through a nonlinear activation function (ReLU), and finally is output by a maximum pooling. And starting from the second layer, the convolution signature is processed using a residual structure. Let the input of the convolution block be X, the output of the convolution block be Y, and the feature map output Y 'of the final convolution block be Y' ═ X + Y. The purpose of this is that in the convolution process, the original information and the information after convolution are combined, and the features with more abstract characteristics can be extracted on the premise of retaining the original information to the maximum extent. Let the characteristic diagram of the layer be a right viewpointLeft viewpointAt this time, the second convolution block outputs 64 feature maps, and the 64 feature maps form a left viewpoint feature map setRight viewpoint feature map setWherein P is2Each feature map having a width ofHas a height of
The third volume block is composed of a third volume layer (Conv), a third Activation layer (Act), and a third max pooling layer (MaxPooling, Pool). The convolution block of the left view is input into a left view feature map set output by a second left view convolution blockThe convolution block of the right view is input into a right view feature map set output by a second convolution block of the right viewThe relevant parameters of the convolutional layer are as follows: the convolution kernel size is 3 × 3, step size 1, and edge padding 1. The number of convolution kernels is 128, i.e. the number of feature maps output by the convolution layer. And after the convolution layer is subjected to normalization processing through a batch normalization layer, the normalization layer is subjected to a nonlinear activation function (ReLU), and finally the normalization layer is output by a maximum pooling layer (Max Pooling). Let the characteristic diagram of the layer be a right viewpointLeft viewpointAt this time, the third convolution block outputs 128 feature maps, which willThe 128 feature maps form a left viewpoint feature map setRight viewpoint feature map setP3Each feature map of (1) has a width ofHas a height of
The fourth volume block is composed of a fourth volume layer (Conv), a fourth Activation layer (Act), and a fourth max pooling layer (MaxPooling, Pool). The convolution block of the left view is input into a left view feature map set output by a third left view convolution blockThe convolution block of the right view inputs a right view feature map set output by a third convolution block of the right viewThe relevant parameters of the convolutional layer are as follows: the convolution kernel size is 3 × 3, step size 1, and edge padding 1. The number of convolution kernels is 256, i.e., the number of feature maps output by the convolution layer. And after the convolution layer is subjected to normalization processing through a batch normalization layer, the normalization layer is subjected to a nonlinear activation function (ReLU), and finally the normalization layer is output by a maximum pooling layer (Max Pooling). Let the characteristic diagram of the layer be a right viewpointLeft viewpointAt this time, the fourth convolution block outputs 256 feature maps, and the 256 feature maps form a left viewpoint feature map setRight viewpoint feature map setP4Each feature map of (1) has a width ofHas a height of
The fifth volume block is composed of a fifth volume layer (Conv), a fifth Activation layer (Act), and a fifth max pooling layer (MaxPooling, Pool). The convolution block of the left view is input into a left view feature map set output by a convolution block of a fourth left viewThe convolution block of the right view is input into a right view feature map set output by a convolution block of a fourth right viewThe relevant parameters of the convolutional layer are as follows: the convolution kernel size is 3 × 3, step size 1, and edge padding 1. The number of convolution kernels is 512, i.e. the number of feature maps output by the convolution layer. And after the convolution layer is subjected to normalization processing through a batch normalization layer, the normalization layer is subjected to a nonlinear activation function (ReLU), and finally the normalization layer is output by a maximum pooling layer (Max Pooling). Let the characteristic diagram of the layer be a right viewpointLeft viewpointAt this time, the fourth convolution block outputs 512 feature maps, and the 512 feature maps form a left viewpoint feature map setRight viewpoint feature map setP5Each feature map of (1) has a width ofHas a height of
For the feature extraction part of the coding layer, after extracting the feature map of each of the convolutional blocks of the left and right view data sets, further processing is required. For the left viewpoint, after the feature maps output by the third to fifth convolution blocks, a Global Context Module (GCM) is needed, which has a total of 4 branches, and each branch extracts features under different receptive fields by using different sizes of hole convolution (the hole convolution rate is 1, 3, 5, 7), so as to locate the salient objects with different sizes. In each branch there is a basic volume block: convolution layer with convolution kernel of 3 × 3, batch normalization layer, activation layer (ReLU)
In this embodiment, the network architecture includes a total of 3 GCM modules, and as shown in fig. 3, after the third, fourth and fifth rolling blocks of the feature extraction part, respectively, the GCM module corresponding to the third rolling block is a GCM3The GCM module corresponding to the fourth volume block is GCM2The GCM module corresponding to the fifth volume block is GCM1And the input and output sizes of the characteristic diagram of each GCM module are not changed.
The input of each GCM module is the output F of three rolling blocks after the left viewpointi LWherein i ∈ {3, 4, 5}, and the characteristic after convolution with a hole convolution of 1, 3, 5, 7 is F1 d, The four feature maps are spliced in a concatance mode and then passed through a volumeG is a characteristic diagram obtained by a convolution layer consisting of a convolution layer with a 3 x 3 kernel, a batch normalization layer and an activation layer of a nonlinear activation function (ReLU), and the input characteristic F of the GCM module is spliced by a concatance mode at the momenti LWhere i ∈ {3, 4, 5}, again through the convolutional layer of the same structure, and finally the output of the module isWhere i ∈ {3, 4, 5 }.
In the feature extraction section, for the right viewpoint data set, after the right viewpoint feature map is extracted, it is further processed. For the right view, the feature maps output by the third to fifth convolution blocks need to be passed through a focus fusion module (CAM). The input to the module includes the feature of the right viewpoint Fi RWhere i e {3, 4, 5}, and a feature map from the feature cascade pass module outputWhere i ∈ {3, 4, 5}, where the feature map F of the right viewpointi RAnd feature map of feature cascade module outputPerforming coherence splicing processing, then calculating the weight of each pixel point through an activation layer (softmax), and obtaining the weight of each pixel and the characteristic F of the right viewpointi RPerforming pixel-level dot product operation to obtain a fusion characteristic diagram FsoftmaxThen, the calculation of hierarchical channel attention (LayerNorm) is performed, and the feature diagram F is obtainedsoftmaxObtaining the corresponding weight value of each layer after activating the layer (softmax), and combining the hierarchy weight with the right viewpoint feature Fi RObtaining a feature map F by performing pixel dot productLayerNormThen, the feature map is compared withPerforming dot product operation to obtain a feature map FfusionFinally, the feature map is output by a nonlinear activation function (Sigmoid), and is recorded asWherein i belongs to {1,2,3}, the feature map represents the features from two visual angles of left and right viewpoints, and the features obtained from different visual angles are fully utilized from two aspects of pixel level and hierarchy level to carry out feature interaction.
Similarly, the network architecture of the present invention comprises a total of 3 CAM modules, and after the third, fourth and fifth rolling blocks of the feature extraction part, the CAM module corresponding to the third rolling block is the CAM3The CAM bank corresponding to the fourth volume block is CAM2The CAM bank corresponding to the fifth volume block is CAM1And the input and output sizes of each module feature map are unchanged.
In the upsampling section, as shown in fig. 4, the input of the feature cascade transfer module (FCM) comes from the features of the GCM output left view and the features of the CAM output right view. The FCM module is divided into three stages, corresponding to the 5 th, 4 th, and 3 rd convolution block portions of the feature extraction portion, respectively. The first stage is first advanced feature processing: will feature mapFeatures F obtained after convolution processing3Output to CAM3Performing the following steps; the second stage isAnd F3Performing point multiplication operation and splicing F3Is followed byThe output feature map is subjected to dot product operation, so that the features processed by the CAM module can be fully fused to obtain the positioning information of the object. Then carrying out convolution operation to obtain a feature map F2Output to CAM2 block; the third stageAnd F2Upsampling by a factor of 2 and F3Upsampling by 4 times to perform dot multiplication, and splicing to obtain a final product from F2Is then output with the CAM bankThe feature is multiplied by a point, and finally a feature graph F is obtained after the feature passes through the convolution layer2Feeding into CAM2In a module. The positioning information of the object is learned in the upward transfer process, and the salient region is supplemented step by combining with the characteristics of the previous layer.
Output to CAM by feature cascade passing module (FCM)2In module, via CAM2Module cascade feature output via activation function (Sigmoid)CAM2The output of the module is divided into two parts: the first part passes through three convolutional layers in series and the second part passes through a sub-pixel shift module (SPSM).
As shown in FIG. 2, the input of the sub-pixel shift module (SPSM) is a feature map set P for the left viewpointi LAnd feature map set P of right viewpointi RWhere i ∈ {1,2 }. Each sub-pixel shifting module receives two inputs and adds corresponding pixels of the characteristic graphs of the left viewpoint and the right viewpoint so as to solve the problem of difference caused by different positions of the left viewpoint and the right viewpoint to the same object. Then is reacted withAnd performing pixel dot multiplication to obtain positioning information of the salient object, and performing object boundary supplement by combining the offset characteristic diagram. Then obtaining the characteristic input of the next layer through the output of a convolution layerThe network framework comprises two sub-pixel shift modules (SPSM) which respectively correspond to the features of the 1 st and 2 nd rolling blocks of the feature extraction part. And the shallow layer features are utilized for detail supplement, and the features of all levels are further fully utilized.
And after the output of a sub-pixel migration module (SPSM) and a series convolution layer, splicing the two characteristic graphs at a channel level, and then outputting the two characteristic graphs as a single-channel visual saliency prediction graph through a convolution layer, a batch normalization layer and an activation function (Sigmoid).
Step 1_ 3: and (3) inputting and outputting a model: the input of the model is a binocular data set, namely the input of two RGB three-channel color images of a left viewpoint and a right viewpoint, and the output is a single-channel gray image. Wherein the value range of each pixel value is between 0 and 255.
Step 1_ 4: a model loss function. The Loss function of the model adopts three parts of Loss, namely root mean square error, CC-Loss and KLDivloss. The root mean square error is used for evaluating the difference between each pixel in the label graph and the prediction graph; CC-Loss (Channel Correlation Loss), can limit the specific relationship between classes and channels and maintain the separability within and between classes. The KL divergence (Kullback-LeiblerDrargence) is also called relative entropy and is used to measure the degree of difference between two probability distributions. The loss of the three parts is adopted in the invention to carry out loss calculation on the constructed convolutional neural network.
Step 1_ 5: training process, optimal parameter selection: according to the model framework and the calculation process in the step 1_2, calculating model loss according to the loss function in the step 1_4 according to the input in the step 1_3 to obtain output, continuously repeating the process for V times to obtain a convolutional neural network prediction training model to obtain Q loss function values, finding out the minimum loss value in the Q loss function values, wherein the weight matrix and the offset matrix corresponding to the loss value can be used as the optimal weight and the offset of the convolutional neural network model. Corresponding notation is WbestAnd bbest(ii) a Where V > 1, in this example V is 300.
The specific steps of the model test process are as follows:
step 2_ 1: the test set contains 95 binocular images in totalRepresenting a test image with prediction; wherein, i ' is more than or equal to 1 and less than or equal to W ', j ' is more than or equal to 1 and less than or equal to H ', and W ' representsWidth of (A), H' representsThe height of (a) of (b),to representAnd the middle coordinate position is the pixel value of the pixel point of (i, j).
Step 2_ 2: inputting R, G, B three channels of left and right viewpoints of a test set binocular data set picture into a convolutional neural network, and utilizing WbestAnd bbestThe optimal parameters are predicted to obtain a single-channel visual saliency prediction image of each picture.
To further verify the feasibility and effectiveness of the method of the invention, experiments were performed.
And (3) building a convolutional neural network architecture by using a deep learning library of the pytorech based on the python language. Training is performed using the NCTU dataset.
TABLE 1 evaluation results on test sets using the method of the invention
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (5)
1. A visual saliency prediction method based on binocular parallax offset fusion is characterized by comprising the following steps:
selecting a plurality of binocular views of natural scenes and movie scenes to form an image data training set;
constructing a convolutional neural network framework, wherein the neural network framework enables high-level semantic information and low-level detail information to be combined with each other;
training the convolutional neural network framework;
and (5) training for multiple times to obtain a convolutional neural network prediction training model.
2. The binocular disparity migration fusion-based visual saliency prediction method according to claim 1, wherein the neural network framework has the following specific connection relations:
inputting the left view of the input layer into a 1 st, a 2 nd, a 3 rd, a 4 th and a 5 th volume block in sequence; wherein the 1 st convolution block is input to the 2 nd SPSM module, the 2 nd convolution block is input to the 1 st SPSM module, the 3 rd convolution block is input to the 3 rd GCM module, the 4 th convolution block is input to the 2 nd GCM module, and the 5 th convolution block is input to the 1 st GCM module; the right viewpoint of the input layer is sequentially connected with a 6 th convolution block, a 7 th convolution block, an 8 th convolution block, a 9 th convolution block and a 10 th convolution block; wherein the 6 th convolution block is input to the 2 nd SPSM block, the 7 th convolution block is input to the 1 st SPSM block, the 8 th convolution block is input to the 3 rd CAM block, the 9 th convolution block is input to the 2 nd CAM block, and the 10 th convolution block is input to the 1 st CAM block; the 1 st GCM module is input into a 1 st feature cascade transfer module, the 2 nd GCM module is input into a 2 nd feature cascade transfer module, and the 3 rd GCM module is input into a 3 rd feature cascade transfer module; the 1 st characteristic cascade transfer module outputs to the 1 st CAM module, the 2 nd characteristic cascade transfer module and the 3 rd characteristic cascade transfer module, the 2 nd characteristic cascade transfer module outputs to the 2 nd CAM module, the 3 rd characteristic cascade transfer module outputs to the 3 rd CAM module; the 1 st CAM bank outputs to the 2 nd feature cascade transfer bank, the 2 nd CAM bank outputs to the 3 rd feature cascade transfer bank, and the 3 rd CAM bank outputs to the 1 st SPSM bank and the 1 st high-level volume block; the 1 st SPSM module outputs to the 2 nd SPSM module, and the 2 nd and 3 rd advanced volume blocks are sequentially connected behind the 1 st advanced volume block; the output of the 2 nd SPSM module and the 3 rd high level convolution reaches the output layer via the concatance connection layer.
3. The method for predicting visual saliency based on binocular disparity shift fusion according to claim 2, wherein the specific input-output relationship of the SPSM module is as follows:
the left viewpoint features and the right viewpoint features are respectively input into a parallax fusion layer, the parallax fusion layer outputs to a high-level feature fusion layer, the high-level features of the front layer output to the high-level feature fusion layer, the high-level feature fusion layer outputs to a convolution block, and the output of the convolution block is the output of the SPSM module.
4. The binocular disparity migration fusion-based visual saliency prediction method of claim 2, wherein the specific input-output relationship of the GCM module is as follows:
the convolution characteristic diagrams are respectively input into a 1 st convolution layer, a 1 st hollow convolution layer, a 2 nd hollow convolution layer and a 3 rd hollow convolution layer and then input into a 2 nd convolution layer, and the output of the 2 nd convolution layer is input into a splicing layer.
5. The binocular disparity migration fusion-based visual saliency prediction method of claim 2, wherein the specific input-output relationship of the feature cascade transfer module is as follows:
the 1 st GCM module is input into the 1 st convolution layer, and the 1 st convolution layer is respectively output to the 1 st pixel point convolution layer, the 3 rd pixel point convolution layer, the 1 st characteristic splicing layer and the 1 st CAM module; the 1 st CAM is output to the 2 nd pixel point lamination; the 2 nd GCM module inputs the 1 st pixel point lamination, the 1 st pixel point lamination is input to the 1 st characteristic splicing layer, the 1 st characteristic splicing layer is input to the 2 nd pixel point lamination, the 2 nd pixel point lamination is input to the 2 nd convolution layer, and the 2 nd convolution layer is output to the 2 nd CAM module; the 1 st pixel point lamination is output to the 3 rd pixel point lamination, the 2 nd lamination is output to the 2 nd characteristic splicing layer, and the 2 nd CAM module is output to the 4 th pixel point lamination; the 3 rd GCM module outputs the 3 rd pixel point lamination, the 3 rd pixel point lamination outputs the 2 nd characteristic splicing layer, the 2 nd characteristic splicing layer outputs the 4 th pixel point lamination, the 4 th pixel point lamination outputs the 3 rd convolution, and the 3 rd convolution inputs the 3 rd CAM module.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110385471.9A CN113066074A (en) | 2021-04-10 | 2021-04-10 | Visual saliency prediction method based on binocular parallax offset fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110385471.9A CN113066074A (en) | 2021-04-10 | 2021-04-10 | Visual saliency prediction method based on binocular parallax offset fusion |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113066074A true CN113066074A (en) | 2021-07-02 |
Family
ID=76566592
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110385471.9A Pending CN113066074A (en) | 2021-04-10 | 2021-04-10 | Visual saliency prediction method based on binocular parallax offset fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113066074A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113409319A (en) * | 2021-08-17 | 2021-09-17 | 点内(上海)生物科技有限公司 | Rib fracture detection model training system, method, detection system and detection method |
CN113538379A (en) * | 2021-07-16 | 2021-10-22 | 河南科技学院 | Double-stream coding fusion significance detection method based on RGB and gray level image |
CN118657962A (en) * | 2024-08-19 | 2024-09-17 | 南昌航空大学 | Binocular stereo matching method, equipment, medium and product |
-
2021
- 2021-04-10 CN CN202110385471.9A patent/CN113066074A/en active Pending
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113538379A (en) * | 2021-07-16 | 2021-10-22 | 河南科技学院 | Double-stream coding fusion significance detection method based on RGB and gray level image |
CN113538379B (en) * | 2021-07-16 | 2022-11-22 | 河南科技学院 | Double-stream coding fusion significance detection method based on RGB and gray level images |
CN113409319A (en) * | 2021-08-17 | 2021-09-17 | 点内(上海)生物科技有限公司 | Rib fracture detection model training system, method, detection system and detection method |
CN118657962A (en) * | 2024-08-19 | 2024-09-17 | 南昌航空大学 | Binocular stereo matching method, equipment, medium and product |
CN118657962B (en) * | 2024-08-19 | 2024-10-22 | 南昌航空大学 | Binocular stereo matching method, equipment, medium and product |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110555434B (en) | Method for detecting visual saliency of three-dimensional image through local contrast and global guidance | |
CN110751649B (en) | Video quality evaluation method and device, electronic equipment and storage medium | |
CN114255238A (en) | Three-dimensional point cloud scene segmentation method and system fusing image features | |
CN113066074A (en) | Visual saliency prediction method based on binocular parallax offset fusion | |
CN111681177B (en) | Video processing method and device, computer readable storage medium and electronic equipment | |
CN112101262B (en) | Multi-feature fusion sign language recognition method and network model | |
CN113609896A (en) | Object-level remote sensing change detection method and system based on dual-correlation attention | |
CN110175986A (en) | A kind of stereo-picture vision significance detection method based on convolutional neural networks | |
CN110827312B (en) | Learning method based on cooperative visual attention neural network | |
CN116206133B (en) | RGB-D significance target detection method | |
CN113449691A (en) | Human shape recognition system and method based on non-local attention mechanism | |
CN112668638A (en) | Image aesthetic quality evaluation and semantic recognition combined classification method and system | |
CN115484410A (en) | Event camera video reconstruction method based on deep learning | |
CN117409481A (en) | Action detection method based on 2DCNN and 3DCNN | |
CN117636134A (en) | Panoramic image quality evaluation method and system based on hierarchical moving window attention | |
CN113393434A (en) | RGB-D significance detection method based on asymmetric double-current network architecture | |
CN116189292A (en) | Video action recognition method based on double-flow network | |
CN116596966A (en) | Segmentation and tracking method based on attention and feature fusion | |
CN116580184A (en) | YOLOv 7-based lightweight model | |
CN114882405B (en) | Video saliency detection method based on space-time double-flow pyramid network architecture | |
CN116597503A (en) | Classroom behavior detection method based on space-time characteristics | |
CN110211146B (en) | Video foreground segmentation method and device for cross-view simulation | |
KR20230103790A (en) | Adversarial learning-based image correction method and apparatus for deep learning analysis of heterogeneous images | |
CN113298814A (en) | Indoor scene image processing method based on progressive guidance fusion complementary network | |
CN118014860B (en) | Attention mechanism-based multi-source multi-scale image fusion method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |