CN113963182A

CN113963182A - Hyperspectral image classification method based on multi-scale void convolution attention network

Info

Publication number: CN113963182A
Application number: CN202111230835.2A
Authority: CN
Inventors: 杨琪
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2021-10-22
Filing date: 2021-10-22
Publication date: 2022-01-21

Abstract

The invention discloses a hyperspectral image classification method based on a multiscale cavity convolution attention network, which can be used for efficiently extracting the characteristics of a hyperspectral image and remarkably improving the precision of ground object classification. Compared with the traditional models such as 2D-CNN and 3D-CNN, the classification performance has obvious advantages in remote sensing image classification. The traditional hyperspectral image classification algorithm (such as 2D-CNN, 3D-CNN and the like) cannot fully utilize effective spectrum-space information, but mistakenly utilizes interference redundant information when extracting features, and based on a multi-scale cavity convolution attention network model, by introducing a shallow feature pre-extraction module, a multi-scale filter, a cavity convolution and an attention module, richer space-spectrum discrimination features can be obtained, the attention of a feature map is focused on the features containing a large amount of useful information, better pixel level attention is provided for high-level features, and the classification precision of the model is greatly improved.

Description

Hyperspectral image classification method based on multi-scale void convolution attention network

Technical Field

The invention relates to the field of hyperspectral remote sensing image classification, in particular to a remote sensing image segmentation algorithm based on a multiscale cavity convolution attention network.

Background

In recent years, deep learning research has been receiving attention from a large number of scholars. Deep learning is a field of machine learning, and with the improvement of computer computing power, relatively more optimized algorithm proposal and the increase of available data volume, the performance of various methods in the field of deep learning is greatly improved, and the method is widely applied in many fields. Meanwhile, deep learning shows higher effectiveness and stronger robustness in computer vision tasks such as image classification and image segmentation, and also causes research surge in the field of hyperspectral remote sensing image classification. Stacked Automatic Encoders (SAEs), Deep Belief Networks (DBNs), Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and andGANs are increasingly used in the HSIC field.

Convolutional Neural Networks (CNNs) can effectively extract features from original data by using information of spatial pixels and neighborhoods thereof and perform high-level abstraction to realize excellent classification results, so that a plurality of 2D-CNN-based hyperspectral classification algorithms are proposed to process HSIC tasks, and the classification performance is improved by improving the utilization rate of HSI spatial information. However, as the depth of the 2D-CNN network increases, the network faces degradation problems, which have a certain impact on the classification results. Therefore, convolutional neural network models ResNet and DenseNuts with more optimized and deeper structures are provided, the networks can effectively relieve the overfitting phenomenon from causing wide attention in the image classification field, and become a plurality of models which are widely researched and applied in the HSI classification field at present. Song et al studied 2D-CNN feature fusion based on residual learning and established a network with better discrimination ability for the hyperspectral data classification task. However, the 2D-CNN network has a problem of losing effective spectrum information in completing the task of classifying hyperspectral remote sensing images, which may affect the classification performance even in some complex scenes, so researchers try to apply the 3D-CNN network to the task of HSIC. The SSRN algorithm proposed by Zilong Zhong et al combines a residual error network with 3-D CNN, directly inputs HSI images into SSRN without any pretreatment, and simultaneously extracts empty-spectral features to perform classification tasks, thereby achieving a good classification effect. Mou et al propose a complete conv-deconv network based on residual learning, aiming to realize end-to-end unsupervised spectral-spatial feature learning. In addition to the over-fitting problem, HSIC also faces the problem of insufficient labeled samples. Whether 2D-CNN or 3D-CNN, the basic CNN model is often used for directly finding the intrinsic characteristics of the mode when exploring HSI of a spectrum-space domain, and the characteristics of convolutional layer processing may contain a lot of useless or interference information. Therefore, how to process the feature map after the convolutional layer and focus attention on those features containing a lot of useful information is another key of the HSI classification task.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to provide a hyperspectral image classification method based on a multi-scale cavity convolution attention network, which can solve the defects in the prior art.

The technical scheme is as follows: the invention discloses a hyperspectral image classification method based on a multiscale cavity convolution attention network, which comprises the following steps of:

and S1, performing dimensionality reduction on the spectrum dimension of the hyperspectral image by adopting a PCA technology, compressing the spectrum dimension of the original data set in the experiment to 3 dimensions by using a PCA method, and greatly reducing the data volume on the basis of keeping effective spectrum information so as to achieve the purpose of reducing the training time. And then carrying out standard deviation normalization processing on the dimension reduced data. As shown in formula (1)

Wherein x' is an output value after standard deviation normalization, x is an input data value after dimensionality reduction, mu is a data set sample mean value, and sigma is a standard deviation.

And S2, taking the data after the standard deviation normalization as a pixel to be classified, and then taking the pixel to be classified as the center to extract an image block (Patch) to be classified with a corresponding size. The image blocks are conveyed to a multi-scale hole convolution attention network model which is improved on the basis of a traditional multi-scale CNN model. It includes a near-remote feature extraction module and a channel attention module. The specific model architecture is shown in fig. 1. The near-remote feature extraction module comprises a shallow feature pre-extraction module and a multi-scale feature extraction module, as shown in fig. 2. The shallow layer feature extraction module is a set with convolution kernel sizes of 1 × 1, 3 × 3 and 5 × 5, and each convolution layer is configured with a Batch Normalization layer (BN) and an activation function ReLU for accelerated training and nonlinear processing. On the premise of considering both the calculation amount and the classification precision, the multi-scale feature extraction module selects and stacks 4 filters with different sizes to obtain rich deep feature information of the image. The short-range characteristic information of the image is captured by common convolution with convolution kernel sizes of 1 × 1 and 3 × 3, the long-range characteristic information of the image is obtained by convolution with convolution kernel sizes of 5 × 5 (the void rate is 2) and 7 × 7 (the void rate is 3), and finally the rich characteristic information of the image is obtained in a characteristic fusion mode. The near-remote feature extraction module makes full use of feature maps extracted by each level of convolution layers, loss of effective information is avoided to the greatest extent, and due to the fact that the hole convolution is added, training parameters are reduced, and the whole model is lighter. The method sets the number of convolution kernels of 3 near-remote feature extraction modules as 32, 64 and 128 in sequence, and adds 1 multiplied by 1 convolution layer between adjacent feature graphs for connection between feature extraction layers by using the thought of the transition layer in the DenseNet, thereby greatly reducing the operation amount of the network while ensuring the complete transmission of feature information. Adding an average pooling layer with convolution kernel size 2 x 2 after the 1x1 convolutional layer suppresses the overfitting phenomenon and maintains lower training parameters. And finally, inputting the feature map obtained by the last near-remote feature extraction module into a channel attention module to reconstruct feature information, so that the feature information of each pixel point in the feature map has higher discrimination.

And S3, training the model and inputting the training samples into the model. In the training process of the method, partial neurons in the full-connection layer are randomly hidden by using a dropout method (the hiding probability is 0.5), the method can effectively inhibit the overfitting phenomenon, regularization is realized to a certain extent, and the control force on the output data of the activation function is increased.

S4: after the multi-scale void convolution attention network is trained by the samples, the test samples are input into the network to be classified. In the method, a BN layer and a Relu activation function are added after each convolution layer, and a final label is generated through a Softmax classifier. The model selects cross entropy as a loss function, and uses an SGD with momentum optimization algorithm to ensure that the loss is rapidly converged to the global minimum.

Has the advantages that:

the feature dimension is effectively reduced, the function of providing useful feature information for the classification task is adaptively enhanced, rich and effective feature information is extracted, the classification performance is more excellent under the condition of a small sample, and the classification capability and the classification precision of the model are obviously superior to those of the traditional 2D-CNN and 3D-CNN models.

Drawings

FIG. 1 is a multi-scale hole convolution attention network (PMACNN) architecture;

FIG. 2 is a schematic diagram of a near-remote feature extraction module;

FIG. 3 is a diagram illustrating the classification results of three data sets with different initial learning rates (lr);

FIG. 4 is a diagram illustrating the classification results of three data sets with different Patch sizes;

in FIG. 5, 5a is a gray scale map of land features of an Indian data set; 5b is a classification result diagram of the SVM to the Indian Pines data set; 5c is a classification result graph of the 2D-CNN to the Indian Pines data set; 5D is a classification result graph of the 3D-CNN to the Indian Pines data set; 5e is a classification result diagram of the hybrid SN to the Indian Pines data set; 5f is a classification result diagram of the RSSAN to the Indian Pines data set; 5g is a classification result diagram of the PMACNN to Indian Pines data set;

in FIG. 6, 6a is a gray scale map of the feature in the Pavia University data set; 6b is a classification result graph of the SVM on the Pavia University data set; 6c is a classification result graph of the 2D-CNN on the Pavia University data set; 6D is a classification result graph of the 3D-CNN on the Pavia University data set; 6e is a classification result graph of hybrid SN on the Pavia University data set; 6f is a classification result graph of the RSSAN for the Pavia University data set; 6g is a classification result graph of the Pavia University data set by the PMACNN;

in fig. 7, 7a is a salanas data set ground object gray scale map; 7b is a classification result graph of the SVM to the Salinas data set; 7c is a classification result graph of the Salinas data set by the 2D-CNN; 7D is a classification result graph of the Salinas data set by the 3D-CNN; 7e is a classification result graph of the HybridSN to the Salinas data set; 7f is a classification result graph of the Salinas data set by the RSSAN; 7g is a classification result graph of the PMACNN to the Salinas data set;

in FIG. 8, 8a is a graph of the influence of different training sample numbers on the classification result under the Indian Pines data set; 8b is an influence graph of different training sample numbers on the classification result under the University data set; 8c is an influence graph of different training sample numbers on the classification result under the Salinas data set;

Detailed Description

The technical solution of the present invention will be further described with reference to the following detailed description and the accompanying drawings.

The specific embodiment discloses a hyperspectral image classification method based on a multiscale cavity convolution attention network, which comprises the following steps of:

The experimental simulation process and results are described below.

1. Experimental images

The hyperspectral remote sensing images required by the experiment select three datasets, namely Indian Pines, Pavia University and Salinas. The test environment of the experiment is Intel Core i5-7200U CPU, 64-bit Windows 10 operating system, and Tesla T4 at GPU. The simulation experiment tool used was in the deep learning framework of Keras with a Python version number of 3.6.6.

2. Procedure of experiment

IN the experiment, three data sets, Indian pipes (IN), Pavia University (PU), Salinas (SA), were selected for use. Table 1 is the basic information for the three data sets.

TABLE 1 data set characterization

Tables 2-1, 2-2, 2-3 show the number of training samples and test samples for each data set, respectively, and for the Indian Pines data set, we selected 10% of the total samples as training samples and 90% as test samples. For the Pavia University and Salinas datasets, we selected 5% of the total samples as training samples and 95% as testing samples.

TABLE 2-1 training set and test set counts for Indian Pines data sets

TABLE 2-2 training set and test set quantities for the University of Pavia dataset

TABLE 2-3 training set and test set counts for Salinas dataset

In the aspect of experimental parameter selection, several factors influencing the network training and the final classification performance are analyzed, including the setting of the learning rate, the selection of the Patch size and the number of training data sets. First, experiments were performed with initial learning rates of {0.01,0.03,0.001,0.003,0.0001,0.0003} respectively, and the experimental results are shown in fig. 3, and it can be seen that the classification effect is the best when the initial learning rate lr is 0.01, so that the experiment on three data sets is completed by selecting lr 0.01 herein. After that, the experiment was performed by selecting

Patch sizes

5,11,17,21, and 25, respectively, and as a result of the experiment, as shown in fig. 4, it was found that the overall classification accuracy of the three hyperspectral image datasets gradually increased and decreased with the increase in Patch size, and the classification effect was the best when the Patch size was 25, and therefore, in the experiment of the three datasets, the comparison experiment was performed by selecting the input image size Patch of 25.

To highlight the advancement of the method, we will compare the results of quantitative classification with the other four typical methods. The four typical methods are respectively a Support Vector Machine (SVM), a 2-D CNN (for proving the superiority of the network designed herein, the 2-D CNN network structure is the same as the PMACNN network except that a channel attention module is not added), a 3D-CNN, a hybrid SN and an RSSAN. In order to ensure the accuracy of the comparison test, the input Patch sizes and parameter selection of the 2-D CNN, the 3D-CNN, the hybrid SN and the RSSAN are kept the same as the multi-scale cavity convolution attention network (PMACNN). The other settings are set with reference to the related data. In three hyperspectral images, we tested the performance of various methods under the condition that the size of the training sample is fixed.

In a comparative experiment with Indian Pines, 10% of the training samples were randomly selected and the remaining 90% were used as test samples. Fig. 5 shows classification diagrams and ground object gray level diagrams (GT, ground route) of 6 methods, and fig. 5 clearly shows that the classification diagram of the SVM has the worst effect and has a lot of noise, because it is a shallow model classification method, and the generalization capability is poor, which is not enough to cope with the complicated spectral-spatial distribution of the hyperspectral image, and it can be found that the classification performance of deep learning algorithms such as PMACNN, RSSAN, 3D-CNN, 2D-CNN, etc. is superior to the traditional SVM algorithm, because the SVM can only extract spatial features, and cannot combine spectral information features with spatial information features, which results in insufficient features extracted from the image. The 2D-CNN and the 3D-CNN only extract space-spectrum characteristic information from space dimensions, and the extracted characteristic expression capability is not strong, so that the classification accuracy of the classification algorithm is not high. The hybrid Net can effectively extract the empty spectrum characteristic information and the spectrum characteristic information of the image at the same time, and can achieve a better classification effect, but the defects are that a network designed by the method is large in calculation amount, the spectrum characteristic information of the HSI image is processed consistently, however, different spectrums have different contribution rates to different ground objects, the classification performance of the hybrid Net model is weakened to a certain extent, and the RSSAN adds an attention mechanism on the basis of the SSRN, so that the model effect is improved to a certain extent. Compared with RSSAN, PMACNN adopts a network combining common convolution and cavity convolution to capture near-remote information of spatial features in a feature extraction network, adopts convolution kernels with different sizes in the network to check images to extract different features, and fuses the features in spectral dimensions to obtain a multi-level feature map with rich features. The final channel attention module enhances the representation effect of useful characteristic information, so that the classification effect of the whole model shows obvious advantages, and experiments prove that the classification effect of the model on the HSIC task is optimal. As can be seen from Table 3, compared with the current leading classification model RSSAN in the field, the classification accuracy of PMACNN is improved by 0.66%, 0.61% and 9.87% on three evaluation indexes of OA, AA and Kappa. It can be seen that PMACNN is a superior algorithm model.

TABLE 3 results of six methods on the Indian Pines data set (%)

Comparative experiments on the University of Pavia and Salinas datasets randomly selected 5% and 5% of the training samples, with the remaining 95% and 95% of the samples being tested. Fig. 6 and 7 show classification maps obtained by different classification methods and surface feature gray scale maps of the two data sets, and table 4 and table 5 show the quantitative analysis results of the various methods. In these two data set comparison tests, the OA accuracy of the method (PMACNN) on the PU and SA data sets reached 99.73% and 99.97%, respectively, with the highest accuracy in all comparison methods. Overall, the PMACNN achieves optimal accuracy for all three index performances on the IN, PU and SA datasets.

TABLE 4 results of six method classifications on the University of Pavia dataset (%)

TABLE 5 Salinas data set results of six methods of classification (%)

IN order to verify the performance of the method under the condition of small samples, 5%, 7%, 10% and 15% of samples are respectively selected as training sets for an IN data set, and the rest samples are used as test sets; experiments were performed on PU and SA datasets randomly partitioned by 0.5%, 1%, 2% and 5% of sample data as training sets. The experimental results obtained on the IN, PU and SA data sets are shown IN fig. 8(a), (b) and (c), respectively.

As can be seen from the graph, on the three data sets, as the number of training samples increases, the classification precision of different classification methods tends to increase. When the training samples are enough, the speed of the classification precision rise gradually becomes slow, and the classification result tends to be stable. In addition, fig. 8 shows that the PMACNN classification works best in small samples.

In addition, in order to verify the effect of the shallow feature pre-extraction module in the method, the shallow feature pre-extraction module is selected from a combination of a:3x3 convolution, b:1x1 convolution, c:1x1 convolution, 3x3 convolution and 5x5 convolution kernels (the module adopted herein) and d: a comparison experiment is carried out without adding any shallow feature pre-extraction module, and as can be seen from table 6, the combination of c:1x1 convolution, 3x3 convolution and 5x5 convolution kernels (the module adopted herein) serves as the shallow feature pre-extraction module, so that the multi-scale feature extraction module can exert the best effect, and the combination of the two can optimize the classification performance of the model.

TABLE 6 influence of different shallow feature pre-extraction modules on classification results

To verify the validity of the channel attention module in the method, we add the attention module at different positions in the model (after the three near and remote feature extraction modules, respectively) to test and record the classification results, and the results are shown in table 7.

TABLE 7 influence of different position channel attention modules on classification results

It can be seen that after the attention module is added, the classification performance of the model is better than that of the model without the attention module, and the classification result of the attention module is optimal after the last feature extraction module is added, because the network extracts more similar features in the image and the feature expression is more sufficient after the image passes through the third feature extraction module. At this time, an attention module is added, so that the attention of the feature map is focused on the features containing a large amount of useful information, and better pixel-level attention is provided for high-level features, so that the classification precision of the algorithm is effectively improved.

Claims

1. The hyperspectral image classification method based on the multi-scale void convolution attention network is characterized by comprising the following steps of: the method comprises the following steps:

s1, performing dimensionality reduction on the spectral dimension of the hyperspectral image by adopting a PCA (principal component analysis) technology, compressing the spectral dimension of the original data set in the experiment to 3 dimensions by using a PCA method, greatly reducing the data volume on the basis of keeping effective spectral information so as to achieve the purpose of reducing the training time, and then performing standard deviation normalization processing on the dimensionality reduced data, as shown in formula (1)

Wherein x' is an output value after standard deviation normalization, x is an input data value after dimensionality reduction, mu is a data set sample mean value, and sigma is a standard deviation;

s2, using the normalized data of the standard deviation as the pixel to be classified, then taking the pixel to be classified as the center, extracting the image block (Patch) to be classified with the corresponding size, the image block will be transmitted to the multi-scale cavity convolution attention network model, the model is improved based on the traditional multi-scale CNN model, it includes the near remote characteristic extraction module and the channel attention module, the specific model structure is shown in figure 1, the near-remote characteristic extraction module includes the shallow layer characteristic pre-extraction module and the multi-scale characteristic extraction module, as shown in figure 2, the shallow layer characteristic extraction module is the set with convolution kernel size of 1 × 1, 3 × 3, 5 × 5, and each convolution layer is configured with the batch normalization layer (BN, Batchnormalization) and the activation function ReLU for the accelerated training and the nonlinear processing, under the premise of both the operation amount and the classification precision, the multi-scale feature extraction module selects and stacks 4 filters with different sizes to obtain rich deep feature information of an image, wherein the short-range feature information of the image is captured by common convolution with convolution kernel sizes of 1 × 1 and 3 × 3, the long-range feature information of the image is obtained by convolution kernels with sizes of 5 × 5 (void ratio of 2) and 7 × 7 (void ratio of 3), and finally the rich feature information of the image is obtained in a feature fusion mode, the near-remote feature extraction module fully utilizes feature maps extracted by convolution layers at all levels, loss of effective information is avoided to the maximum extent, void convolution is added to reduce training parameters, so that an integral model is lighter, the method sets the number of convolution kernels of 3 near-remote feature extraction modules to be 32, 64 and 128 in sequence, and adds 1 × 1 convolution layer between adjacent feature maps for connection between feature extraction layers by taking advantage of the concept of transition layer in DenseNet, the method has the advantages that the complete transmission of the feature information is guaranteed, the operation amount of a network is greatly reduced, an average pooling layer with a convolution kernel size of 2 x 2 is added after a 1x1 convolution layer to inhibit the overfitting phenomenon, lower training parameters are maintained, and finally the feature graph obtained by the last near-remote feature extraction module is input into a channel attention module to reconstruct the feature information, so that the feature information of each pixel point in the feature graph is more discriminative;

s3, training the model, inputting the training sample into the model, and randomly hiding part of neurons in the full-connection layer by using a dropout method in the training process of the method (the hiding probability is 0.5), so that the method can effectively inhibit the overfitting phenomenon, realize regularization to a certain extent, and increase the control force on the output data of the activation function;

s4: after a multi-scale void convolution attention network is trained by a sample, a test sample is input into the network for classification of the test sample, a BN layer and a Relu activation function are added after each convolution layer in the method, a final label is generated through a Softmax classifier, cross entropy is selected as a loss function in a model, and an SGD with momentum optimization algorithm is used to ensure that loss is rapidly converged to the minimum overall situation.