CN110175986A - A kind of stereo-picture vision significance detection method based on convolutional neural networks - Google Patents
A kind of stereo-picture vision significance detection method based on convolutional neural networks Download PDFInfo
- Publication number
- CN110175986A CN110175986A CN201910327556.4A CN201910327556A CN110175986A CN 110175986 A CN110175986 A CN 110175986A CN 201910327556 A CN201910327556 A CN 201910327556A CN 110175986 A CN110175986 A CN 110175986A
- Authority
- CN
- China
- Prior art keywords
- layer
- output
- neural network
- input
- feature maps
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000013527 convolutional neural network Methods 0.000 title claims abstract description 41
- 238000001514 detection method Methods 0.000 title claims abstract description 18
- 238000012549 training Methods 0.000 claims abstract description 90
- 238000000605 extraction Methods 0.000 claims abstract description 28
- 230000004927 fusion Effects 0.000 claims abstract description 12
- 238000013528 artificial neural network Methods 0.000 claims description 243
- 238000010606 normalization Methods 0.000 claims description 91
- 230000004913 activation Effects 0.000 claims description 72
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 claims description 67
- 238000011176 pooling Methods 0.000 claims description 35
- 238000005070 sampling Methods 0.000 claims description 32
- 238000000034 method Methods 0.000 claims description 27
- 230000000007 visual effect Effects 0.000 claims description 25
- 238000010586 diagram Methods 0.000 claims description 20
- 230000008569 process Effects 0.000 claims description 7
- 238000012360 testing method Methods 0.000 claims description 5
- 230000001537 neural effect Effects 0.000 claims description 2
- 230000006870 function Effects 0.000 description 14
- 210000001508 eye Anatomy 0.000 description 12
- 238000005516 engineering process Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 210000005252 bulbus oculi Anatomy 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000003475 lamination Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/0002—Inspection of images, e.g. flaw detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/50—Depth or shape recovery
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Quality & Reliability (AREA)
- Health & Medical Sciences (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a kind of stereo-picture vision significance detection method based on convolutional neural networks, it constructs convolutional neural networks, include input layer, hidden layer, output layer, input layer includes RGB figure input layer and depth map input layer, hidden layer includes coding framework and decoding frame, and coding framework is made of RGB feature extraction module, depth characteristic extraction module and Fusion Features module;The left view point image of every width stereo-picture in training set and depth image are input in convolutional neural networks and are trained, the Saliency maps picture of every width stereo-picture in training set is obtained;The loss function value between the Saliency maps picture and true human eye gazing at images of every width stereo-picture in training set is calculated, obtains convolutional neural networks training pattern after repeating repeatedly;The left view point image and depth image of stereo-picture to be tested are input in convolutional neural networks training pattern, and prediction obtains conspicuousness forecast image;Advantage is its vision significance detection accuracy with higher.
Description
Technical Field
The invention relates to a visual saliency detection technology, in particular to a stereo image visual saliency detection method based on a convolutional neural network.
Background
The visual saliency is a popular research topic in many fields such as neuroscience, robotics, and computer vision in recent years. Studies on visual saliency detection can be divided into two broad categories: eyeball gaze prediction and salient object detection. The former is to predict several points of regard of a person when viewing a natural scene, and the latter is to accurately extract an object of interest. In general, visual saliency detection algorithms can be divided into two categories, top-down and bottom-up. The top-down approach is task driven, requiring supervised learning. Whereas bottom-up methods typically use low-level cues such as color features, distance features, and heuristic saliency features. One of the most common heuristic saliency features is contrast, e.g. pixel-based or block-based contrast. Previous research on detecting visual saliency has focused on two-dimensional images. However, it was found that, first, three-dimensional data instead of two-dimensional data is more suitable for practical use; secondly, as visual scenes become more complex, it is not sufficient to extract salient objects using only two-dimensional data. In recent years, with the progress of three-dimensional data acquisition technologies such as Time-of-Flight sensors and Microsoft Kinect, the adoption of a structural finite element method is promoted, and the recognition capability between different objects with similar appearances is improved. The depth data is easy to capture, is independent of light, and can provide geometric clues to improve the prediction of visual saliency. Due to the complementarity of RGB data and depth data, a number of methods have been proposed that combine RGB images with depth images in pairs for visual saliency detection. Previous work has focused primarily on using domain-specific a priori knowledge to construct low-level saliency features, such as humans tend to focus more on closer objects, however this observation is difficult to generalize to all scenarios. In most previous work, the multi-modal fusion problem was solved by directly serializing the RGB-D channels, or processing each modality independently and then combining the decisions of the two modalities. While these strategies have improved greatly, they have difficulty adequately exploring cross-modal complementarity. In recent years, with the success of Convolutional Neural Networks (CNNs) in learning RGB data discriminatory features, more and more work has been undertaken to explore more powerful RGB-D representations of efficient multimodal combinations using CNNs. Most of these works are based on a two-stream architecture, where RGB data and depth data are learned in a separate bottom-up stream and jointly inferred in early or late stages, with features. As the most popular solution, the dual stream architecture achieves a significant improvement over the work based on manual RGB-D features, however, there are the most critical issues: how to effectively utilize multi-modal complementary information in a bottom-up process. Therefore, further research on the RGB-D image visual saliency detection technology is necessary to improve the accuracy of visual saliency detection.
Disclosure of Invention
The invention aims to provide a stereo image visual saliency detection method based on a convolutional neural network, which has higher visual saliency detection accuracy.
The technical scheme adopted by the invention for solving the technical problems is as follows: a stereo image visual saliency detection method based on a convolutional neural network is characterized by comprising a training stage and a testing stage;
the specific steps of the training phase process are as follows:
step 1_ 1: selecting N original three-dimensional images with the width W and the height H; then, all the selected original stereo images and the respective left viewpoint images, depth images and real eye gazing images of all the original stereo images form a training set, and the nth original stereo image in the training set is marked as { I }n(x, y) }, will { InThe left viewpoint image, the depth image and the real human eye gazing image of (x, y) } are correspondingly recorded as{Dn(x,y)}、Wherein N is a positive integer, N is more than or equal to 300, W and H can be evenly divided by 2, N is a positive integer, the initial value of N is 1, N is more than or equal to 1 and less than or equal to N, x is more than or equal to 1 and less than or equal to W, y is more than or equal to 1 and less than or equal to H, In(x, y) represents { InThe coordinate position in (x, y) is the pixel value of the pixel point of (x, y),to representThe pixel value D of the pixel point with the middle coordinate position (x, y)n(x, y) represents { DnThe coordinate position in (x, y) is the pixel value of the pixel point of (x, y),to representThe middle coordinate position is the pixel value of the pixel point of (x, y);
step 1_ 2: constructing a convolutional neural network: the convolutional neural network comprises an input layer, a hidden layer and an output layer, wherein the input layer comprises an RGB (red, green and blue) image input layer and a depth image input layer, the hidden layer comprises a coding frame and a decoding frame, the coding frame comprises an RGB (red, green and blue) feature extraction module, a depth feature extraction module and a feature fusion module, the RGB feature extraction module comprises 1 to 4 neural network blocks and 1 to 3 down-sampling blocks, the depth feature extraction module comprises 5 to 8 neural network blocks and 4 to 6 down-sampling blocks, the feature fusion module comprises 9 to 15 neural network blocks and 1 to 4 maximum pooling layers, and the decoding frame comprises 16 to 19 neural network blocks and 1 to 4 up-sampling layers; the output layer consists of a first convolution layer, a first batch of normalization layers and a first activation layer, the convolution kernel size of the first convolution layer is 3 multiplied by 3, the step size is 1, the number of the convolution kernels is 1, the filling is 1, and the activation mode of the first activation layer is 'Sigmoid';
for the RGB image input layer, the input end of the RGB image input layer receives a left viewpoint image for training, and the output end of the RGB image input layer outputs the left viewpoint image for training to the hidden layer; wherein, the width of the left viewpoint image for training is required to be W and the height is required to be H;
for the depth map input layer, the input end of the depth map input layer receives the training depth image corresponding to the training left viewpoint image received by the input end of the RGB map input layer, and the output end of the depth map input layer outputs the training depth image to the hidden layer; wherein the width of the depth image for training is W and the height of the depth image for training is H;
for the RGB feature extraction module, the input end of the 1 st neural network block receives the left viewpoint image for training output by the output end of the RGB image input layer, and the output end of the 1 st neural network block64 feature maps with width W and height H are output, and the set of all the output feature maps is denoted as P1(ii) a The input of the 1 st downsampling block receives P1Of the 1 st downsampling block, 64 output widths ofAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X1(ii) a The input of the 2 nd neural network block receives X1The output end of the 2 nd neural network block outputs 128 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P2(ii) a The input of the 2 nd downsampling block receives P2Of the 2 nd downsampling block, the output of the 2 nd downsampling block has 128 widthsAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X2(ii) a The input of the 3 rd neural network block receives X2The output end of the 3 rd neural network block outputs 256 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P3(ii) a 3 rd lowerThe input of the sampling block receives P3Of 256 widths at the output of the 3 rd downsampling blockAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X3(ii) a The input of the 4 th neural network block receives X3The output end of the 4 th neural network block outputs 512 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P4;
For the depth feature extraction module, the input end of the 5 th neural network block receives the training depth image output by the output end of the depth map input layer, the output end of the 5 th neural network block outputs 64 feature maps with width W and height H, and the set formed by all the output feature maps is recorded as P5(ii) a The input of the 4 th downsampling block receives P5Of 64 width at the output of the 4 th downsampling blockAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X4(ii) a The input of the 6 th neural network block receives X4The output end of the 6 th neural network block outputs 128 characteristic maps with the width ofAnd is high inIs composed ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P6(ii) a The input of the 5 th downsampling block receives P6Of the output of the 5 th downsampling block, of 128 widthsAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X5(ii) a The input of the 7 th neural network block receives X5The output end of the 7 th neural network block outputs 256 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P7(ii) a The input of the 6 th downsampling block receives P7Of 256 widths at the output of the 6 th downsampling blockAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X6(ii) a The input of the 8 th neural network block receives X6The output end of the 8 th neural network block outputs 512 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P8;
For the feature fusion module, the input end of the 9 th neural network block receives the left viewpoint image for training output by the output end of the RGB image input layer, the output end of the 9 th neural network block outputs 3 feature images with width W and height H, and the set formed by all the output feature images is recorded as P9(ii) a The input end of the 10 th neural network block receives the training depth image output by the output end of the depth map input layer, the output end of the 10 th neural network block outputs 3 feature maps with the width W and the height H, and the set formed by all the output feature maps is recorded as P10(ii) a To P9All feature maps and P in (1)10After Element-wise Summation operation, 3 feature maps with width W and height H are output, and the set of all output feature maps is recorded as E1(ii) a The input of the 11 th neural network block receives E1The output end of the 11 th neural network block outputs 64 feature maps with the width W and the height H, and the set of all the output feature maps is marked as P11(ii) a To P1All characteristic maps, P in5All feature maps and P in (1)11After the Element-wise Summation operation, 64 feature maps with width W and height H are output, and the set of all the output feature maps is recorded as E2(ii) a Input of the 1 st max pooling layer receives E2The output end of the 1 st maximum pooling layer outputs 64 widthAnd has a height ofThe feature map of (1) is a set of all feature maps output as Z1(ii) a Input of 12 th neural network block receives Z1All feature maps in (1), output of 12 th neural network blockEnd output 128 pieces wideAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P12(ii) a To P2All characteristic maps, P in6All feature maps and P in (1)12All the feature maps in the table are subjected to Element-wise Summation operation, and 128 pieces of output width are obtained after the Element-wise Summation operationAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as E3(ii) a Input of 2 nd largest pooling layer receives E3The output end of the 2 nd maximum pooling layer outputs 128 pieces of feature maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps output as Z2(ii) a Input of the 13 th neural network block receives Z2The output end of the 13 th neural network block outputs 256 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P13(ii) a To P3All characteristic maps, P in7All feature maps and P in (1)13All feature maps in (1) are subjected to Element-wise Summation operationThen, after Element-wise Summation operation, 256 output signals with widths ofAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as E4(ii) a Input of the 3 rd largest pooling layer receives E4The output end of the 3 rd maximum pooling layer outputs 256 width mapsAnd has a height ofThe feature map of (1) is a set of all feature maps output as Z3(ii) a The input of the 14 th neural network block receives Z3The output end of the 14 th neural network block outputs 512 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P14(ii) a To P4All characteristic maps, P in8All feature maps and P in (1)14All the feature maps in the table are subjected to Element-wise Summation operation, and 512 output images with the width of 512 are output after the Element-wise Summation operationAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as E5(ii) a Input terminal of 4 th max pooling layer receives E5All feature maps in (1), output of the 4 th max pooling layerThe output end outputs 512 pieces of widthAnd has a height ofThe feature map of (1) is a set of all feature maps output as Z4(ii) a Input of the 15 th neural network block receives Z4The output end of the 15 th neural network block outputs 1024 pieces of characteristic graphs with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P15;
For the decoding framework, the input of the 1 st upsampling layer receives P15The output end of the 1 st up-sampling layer outputs 1024 widthAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, denoted as S1(ii) a The input of the 16 th neural network block receives S1The output end of the 16 th neural network block outputs 256 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P16(ii) a The input of the 2 nd up-sampling layer receives P16The output end of the 2 nd up-sampling layer outputs 256 width characteristic mapsAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, denoted as S2(ii) a The input of the 17 th neural network block receives S2The output end of the 17 th neural network block outputs 128 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P17(ii) a The input of the 3 rd up-sampling layer receives P17The output end of the 3 rd up-sampling layer outputs 128 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, denoted as S3(ii) a The input of the 18 th neural network block receives S3The output end of the 18 th neural network block outputs 64 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P18(ii) a The input of the 4 th up-sampling layer receives P18The 4 th up-sampling layer outputs 64 feature maps with width W and height H, and the set of all output feature maps is denoted as S4(ii) a 19 th neural networkInput of the block receives S4The output end of the 19 th neural network block outputs 64 feature maps with the width W and the height H, and the set of all the output feature maps is marked as P19;
For the output layer, the input of the first convolutional layer receives P19The output end of the first convolution layer outputs a characteristic diagram with width W and height H; the input end of the first batch of normalization layers receives the characteristic diagram output by the output end of the first convolution layer; the input end of the first active layer receives the characteristic diagram output by the output end of the first batch of normalization layers; the output end of the first activation layer outputs a saliency image of a three-dimensional image corresponding to a left viewpoint image for training; wherein the width of the saliency image is W and the height is H;
step 1_ 3: taking the left viewpoint image of each original stereo image in the training set as a training left viewpoint image, taking the depth image of each original stereo image in the training set as a training depth image, inputting the training depth image into a convolutional neural network for training to obtain a saliency image of each original stereo image in the training set, and taking the { I } as a left viewpoint image for trainingn(x, y) } significant image is noted asWherein,to representThe middle coordinate position is the pixel value of the pixel point of (x, y);
step 1_ 4: calculating the loss function value between the significance image of each original stereo image in the training set and the real eye gazing imageAndthe value of the loss function in between is recorded asObtaining by using a mean square error loss function;
step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for V times to obtain a convolutional neural network training model, and obtaining N multiplied by V loss function values; then finding out the loss function value with the minimum value from the N multiplied by V loss function values; and then, correspondingly taking the weight vector and the bias item corresponding to the loss function value with the minimum value as the optimal weight vector and the optimal bias item of the convolutional neural network training model, and correspondingly marking as WbestAnd bbest(ii) a Wherein V is greater than 1;
the test stage process comprises the following specific steps:
step 2_ 1: order toRepresenting a three-dimensional image of width W 'and height H' to be tested, willIs correspondingly recorded asAndwherein x 'is more than or equal to 1 and less than or equal to W', y 'is more than or equal to 1 and less than or equal to H',to representThe pixel value of the pixel point with the middle coordinate position (x ', y'),to representThe pixel value of the pixel point with the middle coordinate position (x ', y'),to representThe pixel value of the pixel point with the middle coordinate position (x ', y');
step 2_ 2: will be provided withAndinputting into a convolutional neural network training model and using WbestAnd bbestMaking a prediction to obtainIs recorded as a saliency predicted imageWherein,to representAnd the pixel value of the pixel point with the middle coordinate position of (x ', y').
In step 1_2, the 1 st to 8 th neural network blocks have the same structure and are composed of a first cavity convolution layer, a second active layer, a first residual block, a second cavity convolution layer and a third cavity convolution layer which are sequentially arranged, wherein the input end of the first cavity convolution layer is the input end of the neural network block where the first cavity convolution layer is located, the input end of the second cavity convolution layer receives all feature maps output by the output end of the first cavity convolution layer, the input end of the second active layer receives all feature maps output by the output end of the second cavity convolution layer, the input end of the first residual block receives all feature maps output by the output end of the second active layer, the input end of the second cavity convolution layer receives all feature maps output by the output end of the first residual block, and the input end of the third cavity convolution layer receives all feature maps output by the output end of the second cavity convolution layer, the output end of the third batch of normalization layers is the output end of the neural network block where the third batch of normalization layers is located; wherein, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 1 st and 5 th neural network blocks is 64, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 2 nd and 6 th neural network blocks is 128, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 3 rd and 7 th neural network blocks is 256, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 4 th and 8 th neural network blocks is 512, the sizes of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 1 st to 8 th neural network blocks are both 3 × 3 and steps are both 1, the holes are all 2, the fillings are all 2, and the activation modes of the second activation layers in the 1 st to 8 th neural network blocks are all 'ReLU';
the 9 th and 10 th neural network blocks have the same structure and are composed of a second convolution layer and a fourth batch of normalization layers which are sequentially arranged, wherein the input end of the second convolution layer is the input end of the neural network block where the second convolution layer is located, the input end of the fourth batch of normalization layers receives all characteristic diagrams output by the output end of the second convolution layer, and the output end of the fourth batch of normalization layers is the output end of the neural network block where the fourth batch of normalization layers is located; the number of convolution kernels of the second convolution layer in each of the 9 th neural network block and the 10 th neural network block is 3, the sizes of the convolution kernels are 7 multiplied by 7, the steps are 1, and the padding is 3;
the 11 th and 12 th neural network blocks have the same structure and are composed of a third convolution layer, a fifth normalization layer and a third activation layer which are arranged in sequence, the input end of the third convolutional layer is the input end of the neural network block where the third convolutional layer is located, the input end of the fifth convolutional layer receives all the feature maps output by the output end of the third convolutional layer, the input end of the third active layer receives all the feature maps output by the output end of the fifth convolutional layer, the input end of the fourth convolutional layer receives all the feature maps output by the output end of the third active layer, the input end of the sixth convolutional layer receives all the feature maps output by the output end of the fourth convolutional layer, and the output end of the sixth convolutional layer is the output end of the neural network block where the sixth convolutional layer is located; the number of convolution kernels of a third convolution layer and a fourth convolution layer in an 11 th neural network block is 64, the number of convolution kernels of a third convolution layer and a fourth convolution layer in a 12 th neural network block is 128, the sizes of convolution kernels of the third convolution layer and the fourth convolution layer in the 11 th neural network block and the 12 th neural network block are both 3 x 3, the steps are both 1, and the padding is both 1; the activation mode of the third activation layer in each of the 11 th and 12 th neural network blocks is "ReLU";
the 13 th to 19 th neural network blocks have the same structure, and are composed of a fifth convolution layer, a seventh normalization layer, a fourth activation layer, a sixth convolution layer, an eighth normalization layer, a fifth activation layer, a seventh convolution layer and a ninth normalization layer which are arranged in sequence, wherein the input end of the fifth convolution layer is the input end of the neural network block where the fifth convolution layer is located, the input end of the seventh normalization layer receives all feature maps output by the output end of the fifth convolution layer, the input end of the fourth activation layer receives all feature maps output by the output end of the seventh normalization layer, the input end of the sixth convolution layer receives all feature maps output by the output end of the fourth activation layer, the input end of the eighth normalization layer receives all feature maps output by the output end of the sixth convolution layer, the input end of the fifth activation layer receives all feature maps output by the output end of the eighth normalization layer, the input end of the seventh convolutional layer receives all the characteristic graphs output by the output end of the fifth activation layer, the input end of the ninth normalization layer receives all the characteristic graphs output by the output end of the seventh convolutional layer, and the output end of the ninth normalization layer is the output end of the neural network block where the ninth normalization layer is located; wherein, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 13 th neural network block is 256, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 14 th neural network block is 512, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 15 th neural network block is 1024, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 16 th neural network block is 512, 512 and 256, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 17 th neural network block is 256, 256 and 128, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution kernel in the 18 th neural network block is 128, 128 and 64, the number of the fifth convolution layer, the sixth convolution layer and the seventh convolution kernels in the 19 th neural network block is 64, convolution kernel sizes of a fifth convolution layer, a sixth convolution layer and a seventh convolution layer in each of the 13 th to 19 th neural network blocks are all 3 × 3, steps are all 1, padding is all 1, and activation modes of a fourth activation layer and a fifth activation layer in each of the 13 th to 19 th neural network blocks are all 'ReLU'.
In step 1_2, the 1 st to 6 th downsampling blocks have the same structure and are formed by the second residual block, the input end of the second residual block is the input end of the downsampling block where the second residual block is located, and the output end of the second residual block is the output end of the downsampling block where the second residual block is located.
The first residual block and the second residual block have the same structure, and comprise 3 convolution layers, 3 batch normalization layers and 3 activation layers, wherein the input end of the 1 st convolution layer is the input end of the residual block where the 1 st convolution layer is located, the input end of the 1 st batch normalization layer receives all characteristic diagrams output by the output end of the 1 st convolution layer, the input end of the 1 st activation layer receives all characteristic diagrams output by the output end of the 1 st batch normalization layer, the input end of the 2 nd convolution layer receives all characteristic diagrams output by the output end of the 1 st activation layer, the input end of the 2 nd batch normalization layer receives all characteristic diagrams output by the output end of the 2 nd convolution layer, the input end of the 2 nd activation layer receives all characteristic diagrams output by the output end of the 2 nd batch normalization layer, the input end of the 3 rd convolution layer receives all characteristic diagrams output by the output end of the 2 nd activation layer, the input end of the 3 rd batch of normalization layers receives all the feature maps output by the output end of the 3 rd convolution layer, all the feature maps received by the input end of the 1 st convolution layer are added with all the feature maps output by the output end of the 3 rd batch of normalization layers, and after passing through the 3 rd activation layer, all the feature maps output by the output end of the 3 rd activation layer are used as all the feature maps output by the output end of the residual block where the feature maps are located; wherein the number of convolution kernels of each convolution layer in the first residual block in each of the 1 st and 5 th neural network blocks is 64, the number of convolution kernels of each convolution layer in the first residual block in each of the 2 nd and 6 th neural network blocks is 128, the number of convolution kernels of each convolution layer in the first residual block in each of the 3 rd and 7 th neural network blocks is 256, the number of convolution kernels of each convolution layer in the first residual block in each of the 4 th and 8 th neural network blocks is 512, the sizes of convolution kernels of the 1 st convolution layer and the 3 rd convolution layer in the first residual block in each of the 1 st to 8 th neural network blocks are both 1 × 1 and step length is 1, the sizes of convolution kernels of the 2 nd convolution layer in the first residual block in each of the 1 st to 8 th neural network blocks are both 3 × 3, the sizes of convolution kernels are both 1 and step length are 1, and the padding is both 1, the number of convolution kernels of each convolution layer in the second residual block in each of the 1 st and 4 th downsampling blocks is 64, the number of convolution kernels of each convolution layer in the second residual block in each of the 2 nd and 5 th downsampling blocks is 128, the number of convolution kernels of each convolution layer in the second residual block in each of the 3 rd and 6 th downsampling blocks is 256, the sizes of convolution kernels of the 1 st convolution layer and the 3 rd convolution layer in the second residual block in each of the 1 st to 6 th downsampling blocks are both 1 × 1 and 1 step, the sizes of convolution kernels of the 2 nd convolution layer in the second residual block in each of the 1 st to 6 th downsampling blocks are both 3 × 3, the steps are both 2 and 1 filling, and the activation modes of the 3 activation layers are both "ReLU".
In step 1_2, the sizes of the pooling windows of the 1 st to 4 th largest pooling layers are all 2 × 2, and the steps are all 2.
In step 1_2, the sampling modes of the 1 st to 4 th upsampling layers are bilinear interpolation, and the scaling factors are 2.
Compared with the prior art, the invention has the advantages that:
1) the method respectively trains a module (namely an RGB feature extraction module and a depth feature extraction module) for RGB images and depth images through a coding frame provided in a constructed convolutional neural network to learn RGB and depth features of different levels, and provides a module specially fusing the RGB and depth features, namely a feature fusion module, which fuses the two features from low level to high level, thereby being beneficial to fully utilizing cross-modal information to form new discrimination features and improving the accuracy of stereo vision significance prediction.
2) The down-sampling blocks in the RGB feature extraction module and the depth feature extraction module in the convolutional neural network constructed by the method utilize the residual block with the stride of 2 to replace the maximum pooling layer used in the prior work, so that the model is favorable for adaptively selecting feature information, and important information is prevented from being lost due to the maximum pooling operation.
3) The RGB feature extraction module and the depth feature extraction module in the convolutional neural network constructed by the method introduce the residual blocks with the cavity convolutional layers in the front and the back, enlarge the acceptance domain of the convolutional kernel, and are beneficial to the constructed convolutional neural network to pay more attention to global information and learn more abundant contents.
Drawings
FIG. 1 is a schematic diagram of the composition of a convolutional neural network constructed by the method of the present invention.
Detailed Description
The invention is described in further detail below with reference to the accompanying examples.
The invention provides a stereo image visual saliency detection method based on a convolutional neural network.
The specific steps of the training phase process are as follows:
step 1_ 1: selecting N original three-dimensional images with the width W and the height H; then, all the selected original stereo images and the respective left viewpoint images, depth images and real eye gazing images of all the original stereo images form a training set, and the nth original stereo image in the training set is marked as { I }n(x, y) }, will { InThe left viewpoint image, the depth image and the real human eye gazing image of (x, y) } are correspondingly recorded as{Dn(x,y)}、Wherein N is a positive integer, N is more than or equal to 300, if N is 600, W and H can be evenly divided by 2, N is a positive integer, the initial value of N is 1, N is more than or equal to 1 and less than or equal to N, x is more than or equal to 1 and less than or equal to W, y is more than or equal to 1 and less than or equal to H, I isn(x, y) represents { InThe coordinate position in (x, y) is the pixel value of the pixel point of (x, y),to representThe pixel value D of the pixel point with the middle coordinate position (x, y)n(x, y) represents { DnThe coordinate position in (x, y) is the pixel value of the pixel point of (x, y),to representThe middle coordinate position is the pixel value of the pixel point of (x, y).
Step 1_ 2: constructing a convolutional neural network: as shown in fig. 1, the convolutional neural network includes an input layer, a hidden layer, and an output layer, where the input layer includes an RGB map input layer and a depth map input layer, the hidden layer includes a coding frame and a decoding frame, the coding frame includes three parts, namely, an RGB feature extraction module, a depth feature extraction module, and a feature fusion module, the RGB feature extraction module includes 1 st to 4 th neural network blocks, and 1 st to 3 rd downsampling blocks, the depth feature extraction module includes 5 th to 8 th neural network blocks, and 4 th to 6 th downsampling blocks, the feature fusion module includes 9 th to 15 th neural network blocks, and 1 st to 4 th maximum pooling layers, and the decoding frame includes 16 th to 19 th neural network blocks, and 1 st to 4 th upsampling layers; the output layer consists of a first convolution layer, a first batch of normalization layers and a first activation layer, the convolution kernel size of the first convolution layer is 3 multiplied by 3, the step size is 1, the number of the convolution kernels is 1, the padding is 1, and the activation mode of the first activation layer is 'Sigmoid'.
For the RGB image input layer, the input end of the RGB image input layer receives a left viewpoint image for training, and the output end of the RGB image input layer outputs the left viewpoint image for training to the hidden layer; here, the width of the left viewpoint image for training is required to be W and the height is required to be H.
For the depth map input layer, the input end of the depth map input layer receives the training depth image corresponding to the training left viewpoint image received by the input end of the RGB map input layer, and the output end of the depth map input layer outputs the training depth image to the hidden layer; the training depth image has a width W and a height H.
For the RGB feature extraction module, the input end of the 1 st neural network block receives the left viewpoint image for training output by the output end of the RGB image input layer, the output end of the 1 st neural network block outputs 64 feature images with width W and height H, and the set formed by all the output feature images is recorded as P1(ii) a The input of the 1 st downsampling block receives P1Of the 1 st downsampling block, 64 output widths ofAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X1(ii) a The input of the 2 nd neural network block receives X1The output end of the 2 nd neural network block outputs 128 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P2(ii) a The input of the 2 nd downsampling block receives P2Of the 2 nd downsampling block, the output of the 2 nd downsampling block has 128 widthsAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X2(ii) a The input of the 3 rd neural network block receives X2The output end of the 3 rd neural network block outputs 256 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P3(ii) a The input of the 3 rd downsampling block receives P3Of 256 widths at the output of the 3 rd downsampling blockAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X3(ii) a The input of the 4 th neural network block receives X3The output end of the 4 th neural network block outputs 512 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P4。
For the depth feature extraction module, the input end of the 5 th neural network block receives the training depth image output by the output end of the depth map input layer, the output end of the 5 th neural network block outputs 64 feature maps with width W and height H, and the set formed by all the output feature maps is recorded as P5(ii) a The input of the 4 th downsampling block receives P5Of 64 width at the output of the 4 th downsampling blockAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X4(ii) a The input of the 6 th neural network block receives X4The output end of the 6 th neural network block outputs 128 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P6(ii) a The input of the 5 th downsampling block receives P6All characteristic maps in (5)The output end of each downsampling block outputs 128 pieces of widthAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X5(ii) a The input of the 7 th neural network block receives X5The output end of the 7 th neural network block outputs 256 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P7(ii) a The input of the 6 th downsampling block receives P7Of 256 widths at the output of the 6 th downsampling blockAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X6(ii) a The input of the 8 th neural network block receives X6The output end of the 8 th neural network block outputs 512 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P8。
For the feature fusion module, the input end of the 9 th neural network block receives the left viewpoint image for training output by the output end of the RGB image input layer, and the 9 th nerveThe output end of the network block outputs 3 characteristic graphs with width W and height H, and the set formed by all the output characteristic graphs is marked as P9(ii) a The input end of the 10 th neural network block receives the training depth image output by the output end of the depth map input layer, the output end of the 10 th neural network block outputs 3 feature maps with the width W and the height H, and the set formed by all the output feature maps is recorded as P10(ii) a To P9All feature maps and P in (1)10After Element-wise Summation operation, 3 feature maps with width W and height H are output, and the set of all output feature maps is recorded as E1(ii) a The input of the 11 th neural network block receives E1The output end of the 11 th neural network block outputs 64 feature maps with the width W and the height H, and the set of all the output feature maps is marked as P11(ii) a To P1All characteristic maps, P in5All feature maps and P in (1)11After the Element-wise Summation operation, 64 feature maps with width W and height H are output, and the set of all the output feature maps is recorded as E2(ii) a Input of the 1 st max pooling layer receives E2The output end of the 1 st maximum pooling layer outputs 64 widthAnd has a height ofThe feature map of (1) is a set of all feature maps output as Z1(ii) a Input of 12 th neural network block receives Z1The output end of the 12 th neural network block outputs 128 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P12(ii) a To P2All characteristic maps, P in6All feature maps and P in (1)12All the feature maps in the table are subjected to Element-wise Summation operation, and 128 pieces of output width are obtained after the Element-wise Summation operationAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as E3(ii) a Input of 2 nd largest pooling layer receives E3The output end of the 2 nd maximum pooling layer outputs 128 pieces of feature maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps output as Z2(ii) a Input of the 13 th neural network block receives Z2The output end of the 13 th neural network block outputs 256 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P13(ii) a To P3All characteristic maps, P in7All feature maps and P in (1)13All the feature maps in the table are subjected to Element-wise Summation operation, and 256 pieces of feature maps with the width of 256 are output after the Element-wise Summation operationAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as E4(ii) a Input of the 3 rd largest pooling layer receives E4The output end of the 3 rd maximum pooling layer outputs 256 width mapsAnd has a height ofThe feature map of (1) is a set of all feature maps output as Z3(ii) a The input of the 14 th neural network block receives Z3The output end of the 14 th neural network block outputs 512 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P14(ii) a To P4All characteristic maps, P in8All feature maps and P in (1)14All the feature maps in the table are subjected to Element-wise Summation operation, and 512 output images with the width of 512 are output after the Element-wise Summation operationAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as E5(ii) a Input terminal of 4 th max pooling layer receives E5The output end of the 4 th maximum pooling layer outputs 512 widthAnd has a height ofThe feature map of (1) is a set of all feature maps output as Z4(ii) a Input of the 15 th neural network block receives Z4The output end of the 15 th neural network block outputs 1024 pieces of characteristic graphs with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P15。
For the decoding framework, the input of the 1 st upsampling layer receives P15The output end of the 1 st up-sampling layer outputs 1024 widthAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, denoted as S1(ii) a The input of the 16 th neural network block receives S1The output end of the 16 th neural network block outputs 256 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P16(ii) a The input of the 2 nd up-sampling layer receives P16The output end of the 2 nd up-sampling layer outputs 256 width characteristic mapsAnd has a height ofA characteristic diagram ofThe set of all the output feature maps is denoted as S2(ii) a The input of the 17 th neural network block receives S2The output end of the 17 th neural network block outputs 128 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P17(ii) a The input of the 3 rd up-sampling layer receives P17The output end of the 3 rd up-sampling layer outputs 128 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, denoted as S3(ii) a The input of the 18 th neural network block receives S3The output end of the 18 th neural network block outputs 64 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P18(ii) a The input of the 4 th up-sampling layer receives P18The 4 th up-sampling layer outputs 64 feature maps with width W and height H, and the set of all output feature maps is denoted as S4(ii) a The input of the 19 th neural network block receives S4The output end of the 19 th neural network block outputs 64 feature maps with the width W and the height H, and the set of all the output feature maps is marked as P19。
For the output layer, the first rollThe input end of the lamination receives P19The output end of the first convolution layer outputs a characteristic diagram with width W and height H; the input end of the first batch of normalization layers receives the characteristic diagram output by the output end of the first convolution layer; the input end of the first active layer receives the characteristic diagram output by the output end of the first batch of normalization layers; the output end of the first activation layer outputs a saliency image of a three-dimensional image corresponding to a left viewpoint image for training; wherein the width of the saliency image is W and the height is H.
Step 1_ 3: taking the left viewpoint image of each original stereo image in the training set as a training left viewpoint image, taking the depth image of each original stereo image in the training set as a training depth image, inputting the training depth image into a convolutional neural network for training to obtain a saliency image of each original stereo image in the training set, and taking the { I } as a left viewpoint image for trainingn(x, y) } significant image is noted asWherein,to representThe middle coordinate position is the pixel value of the pixel point of (x, y).
Step 1_ 4: calculating the loss function value between the significance image of each original stereo image in the training set and the real eye gazing imageAndthe value of the loss function in between is recorded asObtained by using a mean square error loss function.
Step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for V times to obtain a convolutional neural network training model, and obtaining N multiplied by V loss function values; then finding out the loss function value with the minimum value from the N multiplied by V loss function values; and then, correspondingly taking the weight vector and the bias item corresponding to the loss function value with the minimum value as the optimal weight vector and the optimal bias item of the convolutional neural network training model, and correspondingly marking as WbestAnd bbest(ii) a Wherein, V is more than 1, and if V is 50.
The test stage process comprises the following specific steps:
step 2_ 1: order toRepresenting a three-dimensional image of width W 'and height H' to be tested, willIs correspondingly recorded asAndwherein x 'is more than or equal to 1 and less than or equal to W', y 'is more than or equal to 1 and less than or equal to H',to representThe pixel value of the pixel point with the middle coordinate position (x ', y'),to representThe pixel value of the pixel point with the middle coordinate position (x ', y'),to representAnd the pixel value of the pixel point with the middle coordinate position of (x ', y').
Step 2_ 2: will be provided withAndinputting into a convolutional neural network training model and using WbestAnd bbestMaking a prediction to obtainIs recorded as a saliency predicted imageWherein,to representAnd the pixel value of the pixel point with the middle coordinate position of (x ', y').
In this embodiment, in step 1_2, the 1 st to 8 th neural network blocks have the same structure and are composed of a first hole convolution layer, a second normalization layer, a second active layer, a first residual block, a second hole convolution layer and a third normalization layer, which are sequentially arranged, wherein an input end of the first hole convolution layer is an input end of the neural network block where the first hole convolution layer is located, an input end of the second normalization layer receives all feature maps output by an output end of the first hole convolution layer, an input end of the second active layer receives all feature maps output by an output end of the second normalization layer, an input end of the first residual block receives all feature maps output by an output end of the second active layer, an input end of the second hole convolution layer receives all feature maps output by an output end of the first residual block, an input end of the third normalization layer receives all feature maps output by an output end of the second hole convolution layer, the output end of the third batch of normalization layers is the output end of the neural network block where the third batch of normalization layers is located; wherein, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 1 st and 5 th neural network blocks is 64, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 2 nd and 6 th neural network blocks is 128, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 3 rd and 7 th neural network blocks is 256, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 4 th and 8 th neural network blocks is 512, the sizes of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 1 st to 8 th neural network blocks are both 3 × 3 and steps are both 1, the holes are all 2, the fillings are all 2, and the activation modes of the second activation layers in the 1 st to 8 th neural network blocks are all 'ReLU'.
The 9 th and 10 th neural network blocks have the same structure and are composed of a second convolution layer and a fourth batch of normalization layers which are sequentially arranged, wherein the input end of the second convolution layer is the input end of the neural network block where the second convolution layer is located, the input end of the fourth batch of normalization layers receives all characteristic diagrams output by the output end of the second convolution layer, and the output end of the fourth batch of normalization layers is the output end of the neural network block where the fourth batch of normalization layers is located; the number of convolution kernels of the second convolution layer in each of the 9 th neural network block and the 10 th neural network block is 3, the sizes of the convolution kernels are 7 multiplied by 7, the steps are 1, and the padding is 3.
The 11 th and 12 th neural network blocks have the same structure and are composed of a third convolution layer, a fifth normalization layer and a third activation layer which are arranged in sequence, the input end of the third convolutional layer is the input end of the neural network block where the third convolutional layer is located, the input end of the fifth convolutional layer receives all the feature maps output by the output end of the third convolutional layer, the input end of the third active layer receives all the feature maps output by the output end of the fifth convolutional layer, the input end of the fourth convolutional layer receives all the feature maps output by the output end of the third active layer, the input end of the sixth convolutional layer receives all the feature maps output by the output end of the fourth convolutional layer, and the output end of the sixth convolutional layer is the output end of the neural network block where the sixth convolutional layer is located; the number of convolution kernels of a third convolution layer and a fourth convolution layer in an 11 th neural network block is 64, the number of convolution kernels of a third convolution layer and a fourth convolution layer in a 12 th neural network block is 128, the sizes of convolution kernels of the third convolution layer and the fourth convolution layer in the 11 th neural network block and the 12 th neural network block are both 3 x 3, the steps are both 1, and the padding is both 1; the activation mode of the third activation layer in each of the 11 th and 12 th neural network blocks is "ReLU".
The 13 th to 19 th neural network blocks have the same structure, and are composed of a fifth convolution layer, a seventh normalization layer, a fourth activation layer, a sixth convolution layer, an eighth normalization layer, a fifth activation layer, a seventh convolution layer and a ninth normalization layer which are arranged in sequence, wherein the input end of the fifth convolution layer is the input end of the neural network block where the fifth convolution layer is located, the input end of the seventh normalization layer receives all feature maps output by the output end of the fifth convolution layer, the input end of the fourth activation layer receives all feature maps output by the output end of the seventh normalization layer, the input end of the sixth convolution layer receives all feature maps output by the output end of the fourth activation layer, the input end of the eighth normalization layer receives all feature maps output by the output end of the sixth convolution layer, the input end of the fifth activation layer receives all feature maps output by the output end of the eighth normalization layer, the input end of the seventh convolutional layer receives all the characteristic graphs output by the output end of the fifth activation layer, the input end of the ninth normalization layer receives all the characteristic graphs output by the output end of the seventh convolutional layer, and the output end of the ninth normalization layer is the output end of the neural network block where the ninth normalization layer is located; wherein, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 13 th neural network block is 256, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 14 th neural network block is 512, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 15 th neural network block is 1024, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 16 th neural network block is 512, 512 and 256, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 17 th neural network block is 256, 256 and 128, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution kernel in the 18 th neural network block is 128, 128 and 64, the number of the fifth convolution layer, the sixth convolution layer and the seventh convolution kernels in the 19 th neural network block is 64, convolution kernel sizes of a fifth convolution layer, a sixth convolution layer and a seventh convolution layer in each of the 13 th to 19 th neural network blocks are all 3 × 3, steps are all 1, padding is all 1, and activation modes of a fourth activation layer and a fifth activation layer in each of the 13 th to 19 th neural network blocks are all 'ReLU'.
In this embodiment, in step 1_2, the structure of the 1 st to 6 th downsampling blocks is the same, and they are composed of the second residual block, the input end of the second residual block is the input end of the downsampling block where it is located, and the output end of the second residual block is the output end of the downsampling block where it is located.
In this specific embodiment, the first residual block and the second residual block have the same structure, and include 3 convolutional layers, 3 batch normalization layers, and 3 active layers, where the input end of the 1 st convolutional layer is the input end of the residual block where it is located, the input end of the 1 st batch normalization layer receives all the feature maps output by the output end of the 1 st convolutional layer, the input end of the 1 st active layer receives all the feature maps output by the output end of the 1 st batch normalization layer, the input end of the 2 nd convolutional layer receives all the feature maps output by the output end of the 1 st active layer, the input end of the 2 nd batch normalization layer receives all the feature maps output by the output end of the 2 nd convolutional layer, the input end of the 2 nd active layer receives all the feature maps output by the output end of the 2 nd batch normalization layer, the input end of the 3 rd convolutional layer receives all the feature maps output by the output end of the 2 nd active layer, the input end of the 3 rd batch of normalization layers receives all the feature maps output by the output end of the 3 rd convolution layer, all the feature maps received by the input end of the 1 st convolution layer are added with all the feature maps output by the output end of the 3 rd batch of normalization layers, and after passing through the 3 rd activation layer, all the feature maps output by the output end of the 3 rd activation layer are used as all the feature maps output by the output end of the residual block where the feature maps are located; wherein the number of convolution kernels of each convolution layer in the first residual block in each of the 1 st and 5 th neural network blocks is 64, the number of convolution kernels of each convolution layer in the first residual block in each of the 2 nd and 6 th neural network blocks is 128, the number of convolution kernels of each convolution layer in the first residual block in each of the 3 rd and 7 th neural network blocks is 256, the number of convolution kernels of each convolution layer in the first residual block in each of the 4 th and 8 th neural network blocks is 512, the sizes of convolution kernels of the 1 st convolution layer and the 3 rd convolution layer in the first residual block in each of the 1 st to 8 th neural network blocks are both 1 × 1 and step length is 1, the sizes of convolution kernels of the 2 nd convolution layer in the first residual block in each of the 1 st to 8 th neural network blocks are both 3 × 3, the sizes of convolution kernels are both 1 and step length are 1, and the padding is both 1, the number of convolution kernels of each convolution layer in the second residual block in each of the 1 st and 4 th downsampling blocks is 64, the number of convolution kernels of each convolution layer in the second residual block in each of the 2 nd and 5 th downsampling blocks is 128, the number of convolution kernels of each convolution layer in the second residual block in each of the 3 rd and 6 th downsampling blocks is 256, the sizes of convolution kernels of the 1 st convolution layer and the 3 rd convolution layer in the second residual block in each of the 1 st to 6 th downsampling blocks are both 1 × 1 and 1 step, the sizes of convolution kernels of the 2 nd convolution layer in the second residual block in each of the 1 st to 6 th downsampling blocks are both 3 × 3, the steps are both 2 and 1 filling, and the activation modes of the 3 activation layers are both "ReLU".
In this embodiment, in step 1_2, the pooling windows of the 1 st to 4 th largest pooling layers are all 2 × 2 in size and all 2 in steps.
In this embodiment, in step 1_2, the sampling modes of the 1 st to 4 th upsampling layers are bilinear interpolation, and the scaling factors are all 2.
To verify the feasibility and effectiveness of the method of the invention, experiments were performed.
Here, the accuracy and stability of the method of the present invention was analyzed using a three-dimensional human eye tracking database (NCTU-3DFixation) provided by Taiwan university of transportation. Here, 4 common objective parameters for evaluating the visual Saliency extraction method are used as evaluation indexes, namely, a Linear Correlation Coefficient (CC), a Kullback-Leibler Divergence Coefficient (KLD), an AUC parameter (AUC), and a normalized scan path Saliency (NSS).
The method is used for obtaining the significance prediction image of each three-dimensional image in the three-dimensional human eye tracking database provided by Taiwan traffic university, and comparing the significance prediction image with a subjective visual significance map of each three-dimensional image in the three-dimensional human eye tracking database, namely a real human eye gazing image (existing in the three-dimensional human eye tracking database), wherein the higher the CC, AUC and NSS values are, the lower the KLD value is, the better the consistency between the significance prediction image obtained by the method and the subjective visual significance map is. The CC, KLD, AUC and NSS related indices reflecting the significant extraction performance of the method of the invention are listed in Table 1.
TABLE 1 accuracy and stability of the saliency predicted images and subjective visual saliency maps obtained by the method of the invention
Performance index | CC | KLD | AUC(Borji) | NSS |
Performance index value | 0.7583 | 0.4868 | 0.8789 | 2.0692 |
As can be seen from the data listed in Table 1, the accuracy and stability of the saliency predicted image obtained by the method of the invention and the subjective visual saliency map are good, which indicates that the objective detection result is more consistent with the result of subjective perception of human eyes, and is enough to illustrate the feasibility and effectiveness of the method of the invention.
Claims (6)
1. A stereo image visual saliency detection method based on a convolutional neural network is characterized by comprising a training stage and a testing stage;
the specific steps of the training phase process are as follows:
step 1_ 1: selecting N original three-dimensional images with the width W and the height H; then, all the selected original stereo images and the respective left viewpoint images, depth images and real eye gazing images of all the original stereo images form a training set, and the nth original stereo image in the training set is marked as { I }n(x, y) }, will { InThe left viewpoint image, the depth image and the real human eye gazing image of (x, y) } are correspondingly recorded as{Dn(x,y)}、Wherein N is a positive integer, N is more than or equal to 300, W and H can be evenly divided by 2, N is a positive integer, the initial value of N is 1, N is more than or equal to 1 and less than or equal to N, x is more than or equal to 1 and less than or equal to W, y is more than or equal to 1 and less than or equal to H, In(x, y) represents { InThe coordinate position in (x, y) is the pixel value of the pixel point of (x, y),to representThe pixel value D of the pixel point with the middle coordinate position (x, y)n(x, y) represents { DnThe coordinate position in (x, y) is the pixel value of the pixel point of (x, y),to representThe middle coordinate position is the pixel value of the pixel point of (x, y);
step 1_ 2: constructing a convolutional neural network: the convolutional neural network comprises an input layer, a hidden layer and an output layer, wherein the input layer comprises an RGB (red, green and blue) image input layer and a depth image input layer, the hidden layer comprises a coding frame and a decoding frame, the coding frame comprises an RGB (red, green and blue) feature extraction module, a depth feature extraction module and a feature fusion module, the RGB feature extraction module comprises 1 to 4 neural network blocks and 1 to 3 down-sampling blocks, the depth feature extraction module comprises 5 to 8 neural network blocks and 4 to 6 down-sampling blocks, the feature fusion module comprises 9 to 15 neural network blocks and 1 to 4 maximum pooling layers, and the decoding frame comprises 16 to 19 neural network blocks and 1 to 4 up-sampling layers; the output layer consists of a first convolution layer, a first batch of normalization layers and a first activation layer, the convolution kernel size of the first convolution layer is 3 multiplied by 3, the step size is 1, the number of the convolution kernels is 1, the filling is 1, and the activation mode of the first activation layer is 'Sigmoid';
for the RGB image input layer, the input end of the RGB image input layer receives a left viewpoint image for training, and the output end of the RGB image input layer outputs the left viewpoint image for training to the hidden layer; wherein, the width of the left viewpoint image for training is required to be W and the height is required to be H;
for the depth map input layer, the input end of the depth map input layer receives the training depth image corresponding to the training left viewpoint image received by the input end of the RGB map input layer, and the output end of the depth map input layer outputs the training depth image to the hidden layer; wherein the width of the depth image for training is W and the height of the depth image for training is H;
for the RGB feature extraction module, the input end of the 1 st neural network block receives the left viewpoint image for training output by the output end of the RGB image input layer, the output end of the 1 st neural network block outputs 64 feature images with width W and height H, and the set formed by all the output feature images is recorded as P1(ii) a The input of the 1 st downsampling block receives P1Of the 1 st downsampling block, 64 output widths ofAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X1(ii) a The input of the 2 nd neural network block receives X1The output end of the 2 nd neural network block outputs 128 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P2(ii) a The input of the 2 nd downsampling block receives P2Of the 2 nd downsampling block, the output of the 2 nd downsampling block has 128 widthsAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X2(ii) a The input of the 3 rd neural network block receives X2The output end of the 3 rd neural network block outputs 256 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P3(ii) a The input of the 3 rd downsampling block receives P3Of 256 widths at the output of the 3 rd downsampling blockAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X3(ii) a The input of the 4 th neural network block receives X3The output end of the 4 th neural network block outputs 512 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P4;
For the depth feature extraction module, the input end of the 5 th neural network block receives the training depth image output by the output end of the depth map input layer, the output end of the 5 th neural network block outputs 64 feature maps with width W and height H, and the set formed by all the output feature maps is recorded as P5(ii) a The input of the 4 th downsampling block receives P5Of 64 width at the output of the 4 th downsampling blockAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X4(ii) a The input of the 6 th neural network block receives X4The output end of the 6 th neural network block outputs 128 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P6(ii) a The input of the 5 th downsampling block receives P6Of the output of the 5 th downsampling block, of 128 widthsAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X5(ii) a The input of the 7 th neural network block receives X5All feature maps in (1), output of the 7 th neural network blockOutput 256 widths ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P7(ii) a The input of the 6 th downsampling block receives P7Of 256 widths at the output of the 6 th downsampling blockAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X6(ii) a The input of the 8 th neural network block receives X6The output end of the 8 th neural network block outputs 512 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P8;
For the feature fusion module, the input end of the 9 th neural network block receives the left viewpoint image for training output by the output end of the RGB image input layer, the output end of the 9 th neural network block outputs 3 feature images with width W and height H, and the set formed by all the output feature images is recorded as P9(ii) a The input end of the 10 th neural network block receives the training depth image output by the output end of the depth map input layer, the output end of the 10 th neural network block outputs 3 feature maps with the width W and the height H, and the set formed by all the output feature maps is recorded as P10(ii) a To P9All feature maps and P in (1)10All of (1)Element-wise Summation operation is carried out on the feature maps, 3 feature maps with width W and height H are output after the Element-wise Summation operation, and a set formed by all the output feature maps is recorded as E1(ii) a The input of the 11 th neural network block receives E1The output end of the 11 th neural network block outputs 64 feature maps with the width W and the height H, and the set of all the output feature maps is marked as P11(ii) a To P1All characteristic maps, P in5All feature maps and P in (1)11After the Element-wise Summation operation, 64 feature maps with width W and height H are output, and the set of all the output feature maps is recorded as E2(ii) a Input of the 1 st max pooling layer receives E2The output end of the 1 st maximum pooling layer outputs 64 widthAnd has a height ofThe feature map of (1) is a set of all feature maps output as Z1(ii) a Input of 12 th neural network block receives Z1The output end of the 12 th neural network block outputs 128 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P12(ii) a To P2All characteristic maps, P in6All feature maps and P in (1)12All the feature maps in the table are subjected to Element-wise Summation operation, and 128 pieces of output width are obtained after the Element-wise Summation operationAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as E3(ii) a Input of 2 nd largest pooling layer receives E3The output end of the 2 nd maximum pooling layer outputs 128 pieces of feature maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps output as Z2(ii) a Input of the 13 th neural network block receives Z2The output end of the 13 th neural network block outputs 256 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P13(ii) a To P3All characteristic maps, P in7All feature maps and P in (1)13All the feature maps in the table are subjected to Element-wise Summation operation, and 256 pieces of feature maps with the width of 256 are output after the Element-wise Summation operationAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as E4(ii) a Input of the 3 rd largest pooling layer receives E4The output end of the 3 rd maximum pooling layer outputs 256 width mapsAnd has a height ofThe feature map of (1) is a set of all feature maps output as Z3(ii) a The input of the 14 th neural network block receives Z3The output end of the 14 th neural network block outputs 512 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P14(ii) a To P4All characteristic maps, P in8All feature maps and P in (1)14All the feature maps in the table are subjected to Element-wise Summation operation, and 512 output images with the width of 512 are output after the Element-wise Summation operationAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as E5(ii) a Input terminal of 4 th max pooling layer receives E5The output end of the 4 th maximum pooling layer outputs 512 widthAnd has a height ofThe feature map of (1) is a set of all feature maps output as Z4(ii) a Input of the 15 th neural network block receives Z4All feature maps in (1), the 15 th neural networkThe output end of the block outputs 1024 pieces of widthAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P15;
For the decoding framework, the input of the 1 st upsampling layer receives P15The output end of the 1 st up-sampling layer outputs 1024 widthAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, denoted as S1(ii) a The input of the 16 th neural network block receives S1The output end of the 16 th neural network block outputs 256 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P16(ii) a The input of the 2 nd up-sampling layer receives P16The output end of the 2 nd up-sampling layer outputs 256 width characteristic mapsAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, denoted as S2(ii) a The input of the 17 th neural network block receives S2The output end of the 17 th neural network block outputs 128 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P17(ii) a The input of the 3 rd up-sampling layer receives P17The output end of the 3 rd up-sampling layer outputs 128 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, denoted as S3(ii) a The input of the 18 th neural network block receives S3The output end of the 18 th neural network block outputs 64 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P18(ii) a The input of the 4 th up-sampling layer receives P18The 4 th up-sampling layer outputs 64 feature maps with width W and height H, and the set of all output feature maps is denoted as S4(ii) a The input of the 19 th neural network block receives S4The output end of the 19 th neural network block outputs 64 feature maps with the width W and the height H, and the set of all the output feature maps is marked as P19;
For the output layer, the input of the first convolutional layer receives P19All of (1)The output end of the first convolution layer outputs a characteristic diagram with width W and height H; the input end of the first batch of normalization layers receives the characteristic diagram output by the output end of the first convolution layer; the input end of the first active layer receives the characteristic diagram output by the output end of the first batch of normalization layers; the output end of the first activation layer outputs a saliency image of a three-dimensional image corresponding to a left viewpoint image for training; wherein the width of the saliency image is W and the height is H;
step 1_ 3: taking the left viewpoint image of each original stereo image in the training set as a training left viewpoint image, taking the depth image of each original stereo image in the training set as a training depth image, inputting the training depth image into a convolutional neural network for training to obtain a saliency image of each original stereo image in the training set, and taking the { I } as a left viewpoint image for trainingn(x, y) } significant image is noted asWherein,to representThe middle coordinate position is the pixel value of the pixel point of (x, y);
step 1_ 4: calculating the loss function value between the significance image of each original stereo image in the training set and the real eye gazing imageAndthe value of the loss function in between is recorded asObtaining by using a mean square error loss function;
step 1_ 5: repeatedly executing steps 1_3 andstep 1_4, obtaining a convolutional neural network training model for V times, and obtaining N multiplied by V loss function values; then finding out the loss function value with the minimum value from the N multiplied by V loss function values; and then, correspondingly taking the weight vector and the bias item corresponding to the loss function value with the minimum value as the optimal weight vector and the optimal bias item of the convolutional neural network training model, and correspondingly marking as WbestAnd bbest(ii) a Wherein V is greater than 1;
the test stage process comprises the following specific steps:
step 2_ 1: order toRepresenting a three-dimensional image of width W 'and height H' to be tested, willIs correspondingly recorded asAndwherein x 'is more than or equal to 1 and less than or equal to W', y 'is more than or equal to 1 and less than or equal to H',to representThe pixel value of the pixel point with the middle coordinate position (x ', y'),to representThe pixel value of the pixel point with the middle coordinate position (x ', y'),to representThe pixel value of the pixel point with the middle coordinate position (x ', y');
step 2_ 2: will be provided withAndinputting into a convolutional neural network training model and using WbestAnd bbestMaking a prediction to obtainIs recorded as a saliency predicted imageWherein,to representAnd the pixel value of the pixel point with the middle coordinate position of (x ', y').
2. The method according to claim 1, wherein in step 1_2, the 1 st to 8 th neural network blocks have the same structure and are composed of a first hole convolution layer, a second normalization layer, a second activation layer, a first residual block, a second hole convolution layer, and a third normalization layer, which are sequentially arranged, wherein an input end of the first hole convolution layer is an input end of the neural network block where the first hole convolution layer is located, an input end of the second normalization layer receives all feature maps output by an output end of the first hole convolution layer, an input end of the second activation layer receives all feature maps output by an output end of the second normalization layer, an input end of the first residual block receives all feature maps output by an output end of the second activation layer, and an input end of the second hole convolution layer receives all feature maps output by an output end of the first residual block, the input end of the third batch of normalization layers receives all characteristic graphs output by the output end of the second cavity convolution layer, and the output end of the third batch of normalization layers is the output end of the neural network block where the third batch of normalization layers is located; wherein, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 1 st and 5 th neural network blocks is 64, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 2 nd and 6 th neural network blocks is 128, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 3 rd and 7 th neural network blocks is 256, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 4 th and 8 th neural network blocks is 512, the sizes of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 1 st to 8 th neural network blocks are both 3 × 3 and steps are both 1, the holes are all 2, the fillings are all 2, and the activation modes of the second activation layers in the 1 st to 8 th neural network blocks are all 'ReLU';
the 9 th and 10 th neural network blocks have the same structure and are composed of a second convolution layer and a fourth batch of normalization layers which are sequentially arranged, wherein the input end of the second convolution layer is the input end of the neural network block where the second convolution layer is located, the input end of the fourth batch of normalization layers receives all characteristic diagrams output by the output end of the second convolution layer, and the output end of the fourth batch of normalization layers is the output end of the neural network block where the fourth batch of normalization layers is located; the number of convolution kernels of the second convolution layer in each of the 9 th neural network block and the 10 th neural network block is 3, the sizes of the convolution kernels are 7 multiplied by 7, the steps are 1, and the padding is 3;
the 11 th and 12 th neural network blocks have the same structure and are composed of a third convolution layer, a fifth normalization layer and a third activation layer which are arranged in sequence, the input end of the third convolutional layer is the input end of the neural network block where the third convolutional layer is located, the input end of the fifth convolutional layer receives all the feature maps output by the output end of the third convolutional layer, the input end of the third active layer receives all the feature maps output by the output end of the fifth convolutional layer, the input end of the fourth convolutional layer receives all the feature maps output by the output end of the third active layer, the input end of the sixth convolutional layer receives all the feature maps output by the output end of the fourth convolutional layer, and the output end of the sixth convolutional layer is the output end of the neural network block where the sixth convolutional layer is located; the number of convolution kernels of a third convolution layer and a fourth convolution layer in an 11 th neural network block is 64, the number of convolution kernels of a third convolution layer and a fourth convolution layer in a 12 th neural network block is 128, the sizes of convolution kernels of the third convolution layer and the fourth convolution layer in the 11 th neural network block and the 12 th neural network block are both 3 x 3, the steps are both 1, and the padding is both 1; the activation mode of the third activation layer in each of the 11 th and 12 th neural network blocks is "ReLU";
the 13 th to 19 th neural network blocks have the same structure, and are composed of a fifth convolution layer, a seventh normalization layer, a fourth activation layer, a sixth convolution layer, an eighth normalization layer, a fifth activation layer, a seventh convolution layer and a ninth normalization layer which are arranged in sequence, wherein the input end of the fifth convolution layer is the input end of the neural network block where the fifth convolution layer is located, the input end of the seventh normalization layer receives all feature maps output by the output end of the fifth convolution layer, the input end of the fourth activation layer receives all feature maps output by the output end of the seventh normalization layer, the input end of the sixth convolution layer receives all feature maps output by the output end of the fourth activation layer, the input end of the eighth normalization layer receives all feature maps output by the output end of the sixth convolution layer, the input end of the fifth activation layer receives all feature maps output by the output end of the eighth normalization layer, the input end of the seventh convolutional layer receives all the characteristic graphs output by the output end of the fifth activation layer, the input end of the ninth normalization layer receives all the characteristic graphs output by the output end of the seventh convolutional layer, and the output end of the ninth normalization layer is the output end of the neural network block where the ninth normalization layer is located; wherein, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 13 th neural network block is 256, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 14 th neural network block is 512, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 15 th neural network block is 1024, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 16 th neural network block is 512, 512 and 256, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 17 th neural network block is 256, 256 and 128, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution kernel in the 18 th neural network block is 128, 128 and 64, the number of the fifth convolution layer, the sixth convolution layer and the seventh convolution kernels in the 19 th neural network block is 64, convolution kernel sizes of a fifth convolution layer, a sixth convolution layer and a seventh convolution layer in each of the 13 th to 19 th neural network blocks are all 3 × 3, steps are all 1, padding is all 1, and activation modes of a fourth activation layer and a fifth activation layer in each of the 13 th to 19 th neural network blocks are all 'ReLU'.
3. The method for detecting visual saliency of stereoscopic images based on convolutional neural network as claimed in claim 2, wherein in step 1_2, the 1 st to 6 th downsampling blocks have the same structure and are composed of the second residual block, the input end of the second residual block is the input end of the downsampling block where it is located, and the output end of the second residual block is the output end of the downsampling block where it is located.
4. The method according to claim 3, wherein the first residual block and the second residual block have the same structure, and include 3 convolutional layers, 3 batch normalization layers, and 3 active layers, an input of a 1 st convolutional layer is an input of the residual block, an input of a 1 st batch normalization layer receives all feature maps output by an output of the 1 st convolutional layer, an input of a 1 st active layer receives all feature maps output by an output of the 1 st batch normalization layer, an input of a 2 nd convolutional layer receives all feature maps output by an output of the 1 st active layer, an input of a 2 nd batch normalization layer receives all feature maps output by an output of the 2 nd convolutional layer, an input of a 2 nd active layer receives all feature maps output by an output of the 2 nd batch normalization layer, the input end of the 3 rd convolutional layer receives all the feature maps output by the output end of the 2 nd active layer, the input end of the 3 rd batch of normalization layers receives all the feature maps output by the output end of the 3 rd convolutional layer, all the feature maps received by the input end of the 1 st convolutional layer are added with all the feature maps output by the output end of the 3 rd batch of normalization layers, and all the feature maps output by the output end of the 3 rd active layer after passing through the 3 rd active layer are used as all the feature maps output by the output end of the residual block; wherein the number of convolution kernels of each convolution layer in the first residual block in each of the 1 st and 5 th neural network blocks is 64, the number of convolution kernels of each convolution layer in the first residual block in each of the 2 nd and 6 th neural network blocks is 128, the number of convolution kernels of each convolution layer in the first residual block in each of the 3 rd and 7 th neural network blocks is 256, the number of convolution kernels of each convolution layer in the first residual block in each of the 4 th and 8 th neural network blocks is 512, the sizes of convolution kernels of the 1 st convolution layer and the 3 rd convolution layer in the first residual block in each of the 1 st to 8 th neural network blocks are both 1 × 1 and step length is 1, the sizes of convolution kernels of the 2 nd convolution layer in the first residual block in each of the 1 st to 8 th neural network blocks are both 3 × 3, the sizes of convolution kernels are both 1 and step length are 1, and the padding is both 1, the number of convolution kernels of each convolution layer in the second residual block in each of the 1 st and 4 th downsampling blocks is 64, the number of convolution kernels of each convolution layer in the second residual block in each of the 2 nd and 5 th downsampling blocks is 128, the number of convolution kernels of each convolution layer in the second residual block in each of the 3 rd and 6 th downsampling blocks is 256, the sizes of convolution kernels of the 1 st convolution layer and the 3 rd convolution layer in the second residual block in each of the 1 st to 6 th downsampling blocks are both 1 × 1 and 1 step, the sizes of convolution kernels of the 2 nd convolution layer in the second residual block in each of the 1 st to 6 th downsampling blocks are both 3 × 3, the steps are both 2 and 1 filling, and the activation modes of the 3 activation layers are both "ReLU".
5. The method for detecting the visual saliency of stereoscopic images based on convolutional neural network as claimed in any one of claims 1 to 4, wherein in step 1_2, the sizes of the pooling windows of the 1 st to 4 th maximum pooling layers are all 2 x 2 and the steps are all 2.
6. The method for detecting visual saliency of stereoscopic images based on a convolutional neural network as claimed in claim 5, wherein in step 1_2, the sampling modes of the 1 st to 4 th upsampling layers are all bilinear interpolation, and the scaling factor is all 2.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910327556.4A CN110175986B (en) | 2019-04-23 | 2019-04-23 | Stereo image visual saliency detection method based on convolutional neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910327556.4A CN110175986B (en) | 2019-04-23 | 2019-04-23 | Stereo image visual saliency detection method based on convolutional neural network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110175986A true CN110175986A (en) | 2019-08-27 |
CN110175986B CN110175986B (en) | 2021-01-08 |
Family
ID=67689881
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910327556.4A Active CN110175986B (en) | 2019-04-23 | 2019-04-23 | Stereo image visual saliency detection method based on convolutional neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110175986B (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110555434A (en) * | 2019-09-03 | 2019-12-10 | 浙江科技学院 | method for detecting visual saliency of three-dimensional image through local contrast and global guidance |
CN110782458A (en) * | 2019-10-23 | 2020-02-11 | 浙江科技学院 | Object image 3D semantic prediction segmentation method of asymmetric coding network |
CN111369506A (en) * | 2020-02-26 | 2020-07-03 | 四川大学 | Lens turbidity grading method based on eye B-ultrasonic image |
CN111582316A (en) * | 2020-04-10 | 2020-08-25 | 天津大学 | RGB-D significance target detection method |
CN111612832A (en) * | 2020-04-29 | 2020-09-01 | 杭州电子科技大学 | Method for improving depth estimation accuracy by utilizing multitask complementation |
CN112528900A (en) * | 2020-12-17 | 2021-03-19 | 南开大学 | Image salient object detection method and system based on extreme down-sampling |
CN112528899A (en) * | 2020-12-17 | 2021-03-19 | 南开大学 | Image salient object detection method and system based on implicit depth information recovery |
WO2021096806A1 (en) * | 2019-11-14 | 2021-05-20 | Zoox, Inc | Depth data model training with upsampling, losses, and loss balancing |
CN113192073A (en) * | 2021-04-06 | 2021-07-30 | 浙江科技学院 | Clothing semantic segmentation method based on cross fusion network |
US11157774B2 (en) * | 2019-11-14 | 2021-10-26 | Zoox, Inc. | Depth data model training with upsampling, losses, and loss balancing |
CN113592795A (en) * | 2021-07-19 | 2021-11-02 | 深圳大学 | Visual saliency detection method of stereoscopic image, thumbnail generation method and device |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106462771A (en) * | 2016-08-05 | 2017-02-22 | 深圳大学 | 3D image significance detection method |
CN106778687A (en) * | 2017-01-16 | 2017-05-31 | 大连理工大学 | Method for viewing points detecting based on local evaluation and global optimization |
US20170351941A1 (en) * | 2016-06-03 | 2017-12-07 | Miovision Technologies Incorporated | System and Method for Performing Saliency Detection Using Deep Active Contours |
CN109146944A (en) * | 2018-10-30 | 2019-01-04 | 浙江科技学院 | A kind of space or depth perception estimation method based on the revoluble long-pending neural network of depth |
CN109376611A (en) * | 2018-09-27 | 2019-02-22 | 方玉明 | A kind of saliency detection method based on 3D convolutional neural networks |
CN109598268A (en) * | 2018-11-23 | 2019-04-09 | 安徽大学 | A kind of RGB-D well-marked target detection method based on single flow depth degree network |
CN109635822A (en) * | 2018-12-07 | 2019-04-16 | 浙江科技学院 | The significant extracting method of stereo-picture vision based on deep learning coding and decoding network |
-
2019
- 2019-04-23 CN CN201910327556.4A patent/CN110175986B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170351941A1 (en) * | 2016-06-03 | 2017-12-07 | Miovision Technologies Incorporated | System and Method for Performing Saliency Detection Using Deep Active Contours |
CN106462771A (en) * | 2016-08-05 | 2017-02-22 | 深圳大学 | 3D image significance detection method |
CN106778687A (en) * | 2017-01-16 | 2017-05-31 | 大连理工大学 | Method for viewing points detecting based on local evaluation and global optimization |
CN109376611A (en) * | 2018-09-27 | 2019-02-22 | 方玉明 | A kind of saliency detection method based on 3D convolutional neural networks |
CN109146944A (en) * | 2018-10-30 | 2019-01-04 | 浙江科技学院 | A kind of space or depth perception estimation method based on the revoluble long-pending neural network of depth |
CN109598268A (en) * | 2018-11-23 | 2019-04-09 | 安徽大学 | A kind of RGB-D well-marked target detection method based on single flow depth degree network |
CN109635822A (en) * | 2018-12-07 | 2019-04-16 | 浙江科技学院 | The significant extracting method of stereo-picture vision based on deep learning coding and decoding network |
Non-Patent Citations (3)
Title |
---|
CHEN, HAO 等: "RGB-D Saliency Detection by Multi-stream Late Fusion Network", 《COMPUTER VISION SYSTEMS》 * |
XINGYU CAI 等: "Saliency detection for stereoscopic 3D images in the quaternion frequency domain", 《3D RESEARCH》 * |
李荣 等: "利用卷积神经网络的显著性区域预测方法", 《重庆邮电大学学报( 自然科学版)》 * |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110555434A (en) * | 2019-09-03 | 2019-12-10 | 浙江科技学院 | method for detecting visual saliency of three-dimensional image through local contrast and global guidance |
CN110555434B (en) * | 2019-09-03 | 2022-03-29 | 浙江科技学院 | Method for detecting visual saliency of three-dimensional image through local contrast and global guidance |
CN110782458A (en) * | 2019-10-23 | 2020-02-11 | 浙江科技学院 | Object image 3D semantic prediction segmentation method of asymmetric coding network |
CN110782458B (en) * | 2019-10-23 | 2022-05-31 | 浙江科技学院 | Object image 3D semantic prediction segmentation method of asymmetric coding network |
US11681046B2 (en) | 2019-11-14 | 2023-06-20 | Zoox, Inc. | Depth data model training with upsampling, losses and loss balancing |
WO2021096806A1 (en) * | 2019-11-14 | 2021-05-20 | Zoox, Inc | Depth data model training with upsampling, losses, and loss balancing |
US11157774B2 (en) * | 2019-11-14 | 2021-10-26 | Zoox, Inc. | Depth data model training with upsampling, losses, and loss balancing |
CN111369506A (en) * | 2020-02-26 | 2020-07-03 | 四川大学 | Lens turbidity grading method based on eye B-ultrasonic image |
CN111582316A (en) * | 2020-04-10 | 2020-08-25 | 天津大学 | RGB-D significance target detection method |
CN111582316B (en) * | 2020-04-10 | 2022-06-28 | 天津大学 | RGB-D significance target detection method |
CN111612832A (en) * | 2020-04-29 | 2020-09-01 | 杭州电子科技大学 | Method for improving depth estimation accuracy by utilizing multitask complementation |
CN111612832B (en) * | 2020-04-29 | 2023-04-18 | 杭州电子科技大学 | Method for improving depth estimation accuracy by utilizing multitask complementation |
CN112528899B (en) * | 2020-12-17 | 2022-04-12 | 南开大学 | Image salient object detection method and system based on implicit depth information recovery |
CN112528900B (en) * | 2020-12-17 | 2022-09-16 | 南开大学 | Image salient object detection method and system based on extreme down-sampling |
CN112528899A (en) * | 2020-12-17 | 2021-03-19 | 南开大学 | Image salient object detection method and system based on implicit depth information recovery |
CN112528900A (en) * | 2020-12-17 | 2021-03-19 | 南开大学 | Image salient object detection method and system based on extreme down-sampling |
CN113192073A (en) * | 2021-04-06 | 2021-07-30 | 浙江科技学院 | Clothing semantic segmentation method based on cross fusion network |
CN113592795A (en) * | 2021-07-19 | 2021-11-02 | 深圳大学 | Visual saliency detection method of stereoscopic image, thumbnail generation method and device |
CN113592795B (en) * | 2021-07-19 | 2024-04-12 | 深圳大学 | Visual saliency detection method for stereoscopic image, thumbnail generation method and device |
Also Published As
Publication number | Publication date |
---|---|
CN110175986B (en) | 2021-01-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110175986B (en) | Stereo image visual saliency detection method based on convolutional neural network | |
CN110555434B (en) | Method for detecting visual saliency of three-dimensional image through local contrast and global guidance | |
CN108520535B (en) | Object classification method based on depth recovery information | |
CN109615582B (en) | Face image super-resolution reconstruction method for generating countermeasure network based on attribute description | |
CN108520503B (en) | Face defect image restoration method based on self-encoder and generation countermeasure network | |
CN110032926B (en) | Video classification method and device based on deep learning | |
CN107977932B (en) | Face image super-resolution reconstruction method based on discriminable attribute constraint generation countermeasure network | |
CN110059728B (en) | RGB-D image visual saliency detection method based on attention model | |
CN107154023B (en) | Based on the face super-resolution reconstruction method for generating confrontation network and sub-pix convolution | |
CN110619638A (en) | Multi-mode fusion significance detection method based on convolution block attention module | |
CN110689599B (en) | 3D visual saliency prediction method based on non-local enhancement generation countermeasure network | |
CN111563418A (en) | Asymmetric multi-mode fusion significance detection method based on attention mechanism | |
CN112734915A (en) | Multi-view stereoscopic vision three-dimensional scene reconstruction method based on deep learning | |
CN110210492B (en) | Stereo image visual saliency detection method based on deep learning | |
CN110929736A (en) | Multi-feature cascade RGB-D significance target detection method | |
CN116309648A (en) | Medical image segmentation model construction method based on multi-attention fusion | |
CN110705566B (en) | Multi-mode fusion significance detection method based on spatial pyramid pool | |
Wei et al. | Bidirectional hybrid LSTM based recurrent neural network for multi-view stereo | |
CN112149662A (en) | Multi-mode fusion significance detection method based on expansion volume block | |
CN110458178A (en) | The multi-modal RGB-D conspicuousness object detection method spliced more | |
CN110532959B (en) | Real-time violent behavior detection system based on two-channel three-dimensional convolutional neural network | |
CN111882516B (en) | Image quality evaluation method based on visual saliency and deep neural network | |
Luo et al. | Bi-GANs-ST for perceptual image super-resolution | |
CN113066074A (en) | Visual saliency prediction method based on binocular parallax offset fusion | |
CN107909565A (en) | Stereo-picture Comfort Evaluation method based on convolutional neural networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |