CN114022748B

CN114022748B - Target identification method, device, equipment and storage medium

Info

Publication number: CN114022748B
Application number: CN202210007427.9A
Authority: CN
Inventors: 周波; 邹小刚; 苗瑞; 武新宇
Original assignee: Shenzhen HQVT Technology Co Ltd
Current assignee: Shenzhen Haiqing Zhiyuan Technology Co ltd
Priority date: 2022-01-06
Filing date: 2022-01-06
Publication date: 2022-04-08
Anticipated expiration: 2042-01-06
Also published as: CN114022748A

Abstract

The disclosure provides a target identification method, a target identification device and a storage medium, which are applied to the field of image processing. The image recognition network for target recognition comprises a feature extraction network, a pooling network, an expansion convolution network and a path aggregation network, wherein the path aggregation network comprises a feature pyramid network and a memory network for storing image features. The method comprises the following steps: in the image recognition network, feature extraction, pooling, expansion convolution and feature aggregation are sequentially carried out on an image to be recognized through a feature extraction network, a pooling network, an expansion convolution network and a feature pyramid network, so that a recognition result of a target object is obtained, wherein image features stored in a memory network are utilized in the feature aggregation process. Therefore, by introducing the expansion convolution network capable of extracting the global image information and reducing the model calculation load and the memory network capable of storing the image information, the image recognition precision and the image recognition speed are both considered.

Description

Target identification method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a target identification method, apparatus, device, and storage medium.

Background

With the development of image processing technology, deep learning networks are increasingly applied to various aspects of image processing, including the identification of objects on images.

Taking coal gangue identification as an example, the traditional manual identification method has low efficiency and poor safety, the accuracy rate is different from person to person, and large-scale coal gangue identification cannot be realized, so that the coal gangue can be identified by using a deep learning network, and the problems are solved. However, in the field of image recognition, a deep learning network with a complex structure has a low recognition speed and a large calculation amount, and cannot meet the requirement for fast or real-time recognition.

It can be seen that the current deep learning network is difficult to achieve a compromise balance between recognition accuracy and speed.

Disclosure of Invention

The present disclosure provides a target identification method, apparatus, device, and storage medium, to solve the problem that a deep learning network is difficult to achieve compromise balance between identification accuracy and speed.

In a first aspect, the present disclosure provides a target identification method, where an image identification network includes a feature extraction network, a pooling network, an expanded convolutional network, and a path aggregation network, where the feature extraction network, the pooling network, the expanded convolutional network, and the path aggregation network are connected in sequence, and the path aggregation network includes a feature pyramid network and a memory network for storing image features;

the target identification method comprises the following steps:

acquiring an image to be identified;

extracting the features of the image through the feature extraction network;

pooling image features from the feature extraction network over the pooling network;

performing expansion convolution processing on the pooled image features through the expansion convolution network;

performing feature aggregation processing on the image features from the feature extraction network and the image features from the expanded convolution network by combining the image features stored in the memory network through the feature pyramid network, and storing the image features in the feature aggregation processing in the memory network;

and obtaining the recognition result of the target object in the image according to the image characteristics from the path aggregation network.

In one possible implementation, the feature extraction network includes a two-channel network and a shared network, the two-channel network including a first network and a second network;

the performing feature extraction on the image through the feature extraction network includes:

determining a gray scale map corresponding to the image;

performing feature extraction on the image through the first network;

performing feature extraction on the gray-scale image through the second network;

feature fusion and feature extraction are performed on the image features from the first network and the image features from the second network over the shared network.

In one possible implementation, the pooling network is a spatial pyramid pooling network, and pooling image features from the feature extraction network through the pooling network includes:

and performing maximum pooling on a plurality of image features of different scales from the feature extraction network through the pyramid pooling network to obtain image features of the same scale.

In a possible implementation, the output layer of the feature pyramid network is connected with a corresponding memory network;

the performing, by the feature pyramid network, feature aggregation processing on the image features from the feature extraction network and the image features from the expanded convolutional network in combination with the image features stored in the memory network, and storing the image features in the feature aggregation processing in the memory network, includes:

performing feature aggregation processing on the image features from the feature extraction network and the image features from the expanded convolution network through the feature pyramid network in combination with the image features stored in the corresponding memory network;

in the memory network, image features from the corresponding feature pyramid network are stored.

In one possible implementation, the path aggregation network includes at least three of the feature pyramid networks;

the feature aggregation processing is performed on the image features from the feature extraction network and the image features from the expanded convolution network by combining the image features stored in the corresponding memory network through the feature pyramid network, and the processing method comprises the following steps:

in a first characteristic pyramid network of the path aggregation network, combining image characteristics stored in a memory network positioned at the top of the first characteristic pyramid network, and performing up-sampling processing and characteristic aggregation processing on the image characteristics from the expanded convolutional network;

in the residual characteristic pyramid networks except the first characteristic pyramid network, combining the image characteristics stored in the memory network positioned at the top of the residual characteristic pyramid network, and performing up-sampling processing and characteristic aggregation processing on the image characteristics from the last characteristic pyramid network, or performing down-sampling processing and characteristic aggregation processing;

wherein, in the path aggregation network, the up-sampling process and the down-sampling process are alternately performed.

In a second aspect, the present disclosure provides a model determining method, where an image recognition network includes a feature extraction network, a pooling network, an expansion convolutional network, and a path aggregation network, where the feature extraction network, the pooling network, the expansion convolutional network, and the path aggregation network are connected in sequence, and the path aggregation network includes a feature pyramid network and a memory network for storing image features;

the model determination method comprises the following steps:

acquiring training data, wherein the training data comprises training images marked with target objects;

training the image recognition network for multiple times according to the training data to obtain a trained image recognition network;

wherein, a training process of the image recognition network comprises the following steps:

extracting the features of the training images through the feature extraction network;

determining the recognition result of the target object in the training image according to the image characteristics from the path aggregation network;

and adjusting the model parameters of the image recognition network according to the target object marked on the training image and the recognition result of the target object in the training image.

In a third aspect, the present disclosure provides a target identification apparatus, where an image identification network includes a feature extraction network, a pooling network, an expansion convolutional network, and a path aggregation network, where the feature extraction network, the pooling network, the expansion convolutional network, and the path aggregation network are connected in sequence, and the path aggregation network includes a feature pyramid network and a memory network for storing image features;

the object recognition apparatus includes:

the acquisition module is used for acquiring an image to be identified;

the processing module is used for extracting the features of the image through the feature extraction network; pooling image features from the feature extraction network over the pooling network; performing expansion convolution processing on the pooled image features through the expansion convolution network; performing feature aggregation processing on the image features from the feature extraction network and the image features from the expanded convolution network by combining the image features stored in the memory network through the feature pyramid network, and storing the image features in the feature aggregation processing in the memory network; and obtaining the recognition result of the target object in the image according to the image characteristics from the path aggregation network.

In a fourth aspect, the present disclosure provides a model determining apparatus, where the image recognition network includes a feature extraction network, a pooling network, an expanded convolutional network, and a path aggregation network, where the feature extraction network, the pooling network, the expanded convolutional network, and the path aggregation network are connected in sequence, and the path aggregation network includes a feature pyramid network and a memory network for storing image features;

the model determination apparatus includes:

the acquisition module is used for acquiring training data, and the training data comprises a training image marked with a target object;

the training module is used for training the image recognition network for multiple times according to the training data to obtain the trained image recognition network;

determining an identification result of a target object in the image according to the image characteristics from the path aggregation network;

In a fifth aspect, the present disclosure provides an electronic device comprising: at least one processor and memory;

the memory stores computer-executable instructions;

the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the object recognition method of the first aspect or the model determination method of the second aspect as described above.

In a sixth aspect, the present disclosure provides a computer-readable storage medium having stored therein computer-executable instructions for implementing the object recognition method according to the first aspect or the model determination method according to the second aspect when executed by a processor.

In a seventh aspect, the present disclosure provides a computer program product, which when executed by a processor implements the object recognition method according to the first aspect or the model determination method according to the second aspect.

The target identification method, the target identification device, the target identification equipment and the storage medium provided by the disclosure identify a target object in an image to be identified through an image identification model. The image recognition model comprises a feature extraction network, a pooling network, an expansion convolution network and a path aggregation network, wherein the path aggregation network comprises a feature pyramid network and a memory network for storing image features. On one hand, the expansion convolution network has the characteristics of extracting global image features and reducing model calculation load, and the expansion convolution network is introduced into the image identification model, so that the image identification speed and accuracy are improved; on the other hand, the memory network is introduced into the path aggregation network, so that more image information is added in the image recognition process, the image recognition precision is improved, and no extra calculation cost is brought. Therefore, the method and the device improve the identification precision and the identification speed of the image identification network, and achieve the compromise of the identification precision and the identification speed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a schematic view of an application scenario provided by an embodiment of the present disclosure;

fig. 2 is a schematic structural diagram of an image recognition network according to an embodiment of the present disclosure;

fig. 3 is a first schematic flow chart of a target identification method according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of an image recognition network according to an embodiment of the present disclosure;

fig. 5 is a schematic flowchart of a second target identification method according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an image recognition network according to an embodiment of the present disclosure;

fig. 7 is a third schematic flowchart of a target identification method according to an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of an image recognition network according to an embodiment of the present disclosure;

fig. 9 is a schematic flow chart of a model determination method provided in an embodiment of the present disclosure;

fig. 10 is a schematic structural diagram of an object recognition apparatus provided in an embodiment of the present disclosure;

fig. 11 is a schematic structural diagram of a model determining apparatus provided in an embodiment of the present disclosure;

fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

With the foregoing drawings in mind, certain embodiments of the disclosure have been shown and described in more detail below. These drawings and written description are not intended to limit the scope of the disclosed concepts in any way, but rather to illustrate the concepts of the disclosure to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In the field of image recognition, a deep learning network with a complex structure is low in recognition speed and large in calculation amount, cannot meet the requirement of quick and real-time recognition, and a light-weight deep learning network is high in recognition speed and low in accuracy, and cannot meet the requirement of recognition accuracy. It is seen that the deep learning network has difficulty in achieving a trade-off balance between recognition accuracy and recognition speed. Particularly, the deep learning network for identifying the coal gangue has higher requirements on the identification precision and the identification speed of the deep learning network.

In order to solve the above problems, the present disclosure provides a target identification method, apparatus, device, and storage medium. In the present disclosure, the image recognition network includes a feature extraction network, a pooling network, an expanded convolution network, and a path aggregation network including a feature pyramid network and a memory network for storing image features. The expanded convolutional network can extract wider global image features, the calculation load is reduced, and more image information can be added into the memory network, so that the expanded convolutional network is introduced into the image recognition network and the memory network is introduced into the path aggregation network, the recognition accuracy and speed of the image recognition network are improved, and the image recognition accuracy and speed are both considered.

A specific application scenario of the embodiments of the present disclosure may be a target recognition scenario, in which a target object in an acquired image is recognized and processed by using an image recognition network.

Fig. 1 is an exemplary diagram of an application scenario provided by an embodiment of the present disclosure, and as shown in fig. 1, in the application scenario, a device involved includes an image processing device 110. In this scenario, a target object on an image may be identified on the image processing device 110 through an image recognition network.

Fig. 1 illustrates an image processing apparatus 110 as a server. The server is a device for providing data processing and data set storage. Illustratively, the servers may be unitary servers or distributed servers across multiple computers or computer data centers, and may be various types of servers such as, but not limited to, web servers, application servers, or database servers, or proxy servers.

In some embodiments, the server may include hardware, software, or embedded logic components or a combination of two or more such components for performing the appropriate functions supported or implemented by the server. For example, a server such as a blade server, a cloud server, or the like, or may be a server group consisting of a plurality of servers, and may include one or more of the above-mentioned categories of servers, and the like.

In some embodiments, the devices involved in the application scenario also include image capture device 120. The image capturing device 120 may be the same device as the image processing device 110 or a different device. The image processing device 110 may perform online real-time processing on the image acquired by the image acquisition device 120, or may perform offline processing.

The image capturing device 120 may be a terminal device, such as a camera, a smart phone, a portable computer, a tablet computer, a handheld computer, a wearable device, a virtual reality device, an augmented reality device, or any combination thereof, which is not limited herein. Fig. 1 illustrates a terminal device as a camera as an example.

Wherein the image capturing device 120 may communicate with the image processing device 110 via a wireless or wired network. The wireless network may be a 2G, 3G, 4G, or 5G communication network, or may be a wireless local area network, which is not limited herein.

Optionally, the target identification scene applicable to the embodiment of the present disclosure is a coal gangue identification scene, and in the scene, the coal gangue in the image is identified by using an image identification network, that is, the target object is the coal gangue.

The following describes the technical solutions of the present disclosure and how to solve the above technical problems in specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present disclosure will be described below with reference to the accompanying drawings.

Referring to fig. 2, fig. 2 is a schematic structural diagram of an image recognition network according to an embodiment of the present disclosure. As shown in fig. 2, the image recognition Network includes a feature extraction Network, a Pooling Network (Pooling Network), an expanded convolution Network (scaled convolution) and a Path Aggregation Network (Path Aggregation Network, abbreviated as PANet), and the feature extraction Network, the Pooling Network, the expanded convolution Network and the Path Aggregation Network are connected in sequence. The path aggregation Network includes a Feature Pyramid Network (FPN for short) and a memory Network for storing image features.

As shown in fig. 2, the feature extraction network is connected to the path aggregation network, so that the path aggregation network can perform feature aggregation processing on the image features from the feature extraction network and the image features from the expanded convolution network to enrich the image features and improve the image recognition accuracy.

The feature extraction network comprises a plurality of feature extraction layers (not shown in fig. 2), and is used for performing feature extraction on images in the input image recognition network; the pooling network includes a pooling layer (not shown in FIG. 2) therein for pooling image features from the feature extraction network; the expansion convolution network comprises an expansion convolution layer (not shown in fig. 2), the expansion convolution network is used for performing expansion convolution processing on the image features from the pooling network, and the path aggregation network is used for performing path aggregation processing on the image features from the feature extraction network and the image features from the expansion convolution network in combination with the image features stored in the memory network.

As shown in fig. 2, the image recognition network further includes an output layer connected to the path aggregation network.

Referring to fig. 3, fig. 3 is a first schematic flow chart of a target identification method according to an embodiment of the present disclosure. As shown in fig. 3, on the basis of the image recognition network shown in fig. 2, the object recognition method includes:

s301, acquiring an image to be identified.

In one example, the image to be recognized may be an image captured by a camera in real time or one or more frames of images obtained from a video captured by the camera in real time.

In yet another example, the image to be recognized may be a user input or a user selected image. For example, the user inputs an image to be recognized on a display interface of the terminal or selects an image to be recognized. Or the server receives user input or images selected by the user sent by the terminal.

In this embodiment, one or more images to be identified may be acquired. For example, in a scene in which a target object is recognized in a batch, a preset number of images to be recognized may be acquired.

And S302, extracting the features of the image through a feature extraction network.

In this embodiment, the image to be recognized may be directly input to the image recognition network, or after performing preprocessing operations such as clipping and denoising on the image to be recognized, the preprocessed image may be input to the image recognition network. After the image is input into the image recognition network, a series of deep-level convolution processing is carried out on the image through a plurality of feature extraction layers in the feature extraction network (namely, a series of deep-level feature extraction is carried out on the image), and the image features output by the feature extraction network are obtained.

And S303, performing pooling processing on the image features from the feature extraction network through a pooling network.

In this embodiment, in the image recognition model, the image features from the feature extraction network are input to the pooling network, and the image features are pooled by the pooling network to extract more abstract image features, thereby improving the receptive field of the image recognition network and reducing the calculation load of the image recognition network.

S304, performing expansion convolution processing on the pooled image features through an expansion convolution network.

The dilation convolution is also called hole convolution, and is to inject holes into a convolution kernel, that is, to add pixel points with a pixel value of 0 between each pixel point of the convolution kernel, so that a wider receptive field can be obtained in the process of performing convolution processing on image features based on convolution kernel.

In this embodiment, in the image recognition network, an expanded convolutional network is introduced to process the image features from the pooling network. On one hand, the expansion convolution network can expand the receptive field of the convolution layer to the image, extract more image characteristics and more extensive global image information, and improve the image identification precision; on the other hand, the expansion convolution network can keep most important information in the image characteristics, abandon secondary information, reduce the calculation burden in the image identification process and improve the image identification speed.

Optionally, in the image identification network, the extended convolution network is a hybrid extended convolution network (HDC), so that the effect of performing extended convolution processing on the image features is improved through the HDC, and the image identification accuracy is further improved.

S305, performing feature aggregation processing on the image features from the feature extraction network and the image features from the expansion convolution network by combining the image features stored in the memory network through the feature pyramid network, and storing the image features in the feature aggregation processing in the memory network.

The path aggregation network may include a bottom-up feature pyramid network and/or a top-down feature pyramid network, the bottom-up feature pyramid network is used for up-sampling image features, the size of the image features is gradually reduced, the top-down feature pyramid network is used for down-sampling the image features, and the size of the image features is gradually increased.

Wherein the memory network can store the image characteristics that occurred in the expanded convolutional network. For example, if the path aggregation network processes the image features of 10 images before processing the current image, the image features associated with the 10 images may be saved in the memory network. Therefore, each time the path aggregation network processes the image features, the path aggregation network can combine the existing image features in the memory network to obtain more semantic information related to the image. Therefore, on one hand, the detectability of tiny information on the image is improved, the accuracy of image recognition is improved, and on the other hand, no extra calculation cost is brought to the image recognition.

In this embodiment, the image features from the feature extraction network and the image features from the extended convolutional network are input into the path aggregation network, in the path aggregation network, the image features input into the path aggregation network are subjected to up-sampling/down-sampling processing and feature aggregation processing through the feature pyramid network, and feature aggregation is performed on each layer of network in the feature pyramid network. Meanwhile, the image features can be combined with the image features stored in the memory network, so that the available information when the path aggregation network processes the image features is increased.

Optionally, in the image recognition Network, the Memory Network is a Long Short Term Memory Network (LSTM).

S306, obtaining the recognition result of the target object in the image according to the image characteristics from the path aggregation network.

In this embodiment, the image features from the path aggregation network may be input to an output layer in the image recognition network, and the image features are summarized through the output layer to obtain a recognition result of the target object in the image, where the recognition result may include a position of the target object on the image, and the recognition result may also include a category of the target object on the image. The process of obtaining the recognition result of the target object in the image by summarizing the image features through the output layer is not limited, and for example, the output layer may be a full link layer, or an activation function and a full link layer.

In the embodiment of the disclosure, by introducing the expansion convolution network into the image recognition network and introducing the memory network into the path aggregation network, the image recognition precision and the image recognition speed are improved, and the image recognition precision and the image recognition speed are balanced.

Referring to fig. 4, fig. 4 is a schematic structural diagram of an image recognition network according to an embodiment of the present disclosure. As shown in fig. 4, on the basis of the image recognition network shown in fig. 2, the feature extraction network includes a dual-channel network and a shared network, the dual-channel network includes a first network and a second network, the first network is used for performing feature extraction on an image to be recognized, and the second network is used for performing feature extraction on a gray-scale image of the image.

The first network and the second network are respectively connected with a shared network, and the shared network is used for carrying out feature fusion and feature extraction on the image features from the first network and the image features from the second network.

The first network, the second network and the shared network comprise a plurality of feature extraction layers, and image features output by the last feature extraction layer in the first network and image features output by the last feature extraction layer in the second network can be aggregated and then input to the first feature extraction layer in the shared network.

Optionally, the number of the feature extraction layers in the first network is the same as the number of the feature extraction layers in the second network, and the feature scale of the feature extraction layer in the first network is the same as the feature scale of the feature extraction layer at the same network depth in the second network, so that image features with the same granularity are extracted from the image to be recognized and the gray scale image of the image, and the accuracy of feature processing is improved.

Referring to fig. 5, fig. 5 is a schematic flowchart illustrating a second method for identifying a target according to an embodiment of the present disclosure. As shown in fig. 5, on the basis of the image recognition network shown in fig. 4, the object recognition method includes:

s501, obtaining an image to be identified.

The implementation principle and the technical effect of S501 may refer to the foregoing embodiments, and are not described again.

And S502, determining a gray scale map corresponding to the image.

In this embodiment, the image to be recognized is a multi-channel image, and graying can be performed on the image to be recognized on each channel to obtain a grayscale image of the image to be recognized on each image channel.

Optionally, the image to be recognized is an RGB image, and there are 3 channels: r channel, G channel, B channel, the gray scale maps corresponding to the 3 channels can be generated.

And S503, extracting the features of the image through the first network.

In this embodiment, an image to be recognized is input to a first network, and a series of feature extractions are performed on the image through a plurality of feature extraction layers in the first network.

And S504, performing feature extraction on the gray-scale image through a second network.

In this embodiment, a grayscale image of an image to be recognized is input to the second network, and a series of feature extractions are performed on the grayscale image by a plurality of feature extraction layers in the second network.

And S505, carrying out feature fusion and feature extraction on the image features from the first network and the image features from the second network through the shared network.

In this embodiment, in the shared network, feature fusion may be performed on an image feature output by a first network and an image feature output by a second network, the fused image feature is input to a first feature extraction layer in the shared network, and then a series of feature extractions are performed on the image feature through a plurality of feature extraction layers in the shared network, so as to obtain an image feature output by a last feature extraction layer in the shared network, that is, an image feature output by a feature extraction network in the image recognition network.

And S506, performing pooling processing on the image features from the feature extraction network through a pooling network.

And S507, performing expansion convolution processing on the pooled image features through an expansion convolution network.

And S508, performing feature aggregation processing on the image features from the feature extraction network and the image features from the expansion convolution network by combining the image features stored in the memory network through the feature pyramid network, and storing the image features in the feature aggregation processing in the memory network.

S509, obtaining the recognition result of the target object in the image according to the image characteristics from the path aggregation network.

The implementation principles and technical effects of S506 to S509 may refer to the foregoing embodiments, and are not described in detail.

In the embodiment of the disclosure, in the image input part, the image recognition network adopts a dual-channel input mode of an image and a gray scale image corresponding to the image, shallow preliminary feature extraction is respectively carried out on the image and the gray scale image corresponding to the image through different networks, feature fusion is carried out on the image features of the preliminarily extracted image and the image features of the preliminarily extracted gray scale image, and depth feature extraction is carried out on the fused features. And then, further processing the image characteristics through a pooling network, an expansion convolution network and a path aggregation network to finally obtain the identification result of the target object in the image. Therefore, the image recognition precision is improved by improving the feature extraction network and introducing the memory network, and the image recognition precision and speed are improved by introducing the expansion convolution network, so that the image recognition precision and speed are both considered.

Referring to fig. 6, fig. 6 is a schematic structural diagram of an image recognition network according to an embodiment of the present disclosure. As shown in fig. 6, based on any of the networks, taking the image recognition network shown in reference 4 as an example, the output layer of the feature pyramid network is connected to a corresponding memory network, and the memory network is used for storing the image features output by the corresponding feature pyramid network. Before storing the image features output by the feature pyramid network, the memory network is further used for fusing the stored image features with the image features output by the feature pyramid network, so that more information is provided for the path aggregation processing of the image features by the path aggregation network.

Referring to fig. 7, fig. 7 is a third schematic flowchart of a target identification method according to an embodiment of the present disclosure. As shown in fig. 7, on the basis of the image recognition network shown in fig. 6, the object recognition method includes:

and S701, acquiring an image to be identified.

And S702, extracting the features of the image through a feature extraction network.

And S703, performing pooling processing on the image features from the feature extraction network through a pooling network.

And S704, performing expansion convolution processing on the pooled image features through an expansion convolution network.

The implementation principles and technical effects of S701 to S704 can refer to the foregoing embodiments, and are not described in detail.

S705, performing feature aggregation processing on the image features from the feature extraction network and the image features from the expansion convolution network through the feature pyramid network in combination with the image features stored in the corresponding memory network.

And S706, storing the image characteristics from the corresponding characteristic pyramid network in a memory network.

In this embodiment, in the path aggregation network, the feature pyramid network performs up-sampling or down-sampling on the image features input into its own network to obtain the image features output by the last network layer in the feature pyramid network, inputs the image features into the memory network corresponding to the feature pyramid network (i.e., the memory network connected to the output layer of the feature pyramid network), and fuses the image features with the image features stored in the memory network to enrich the image features input into the memory network at this time, and the memory network also stores the image features input at this time.

In one possible implementation, the path aggregation network includes at least three feature pyramid networks; s705 includes: in a first characteristic pyramid network of the path aggregation network, combining image characteristics stored in a memory network positioned at the top of the first characteristic pyramid network, and performing up-sampling processing and characteristic aggregation processing on the image characteristics from the expanded convolution network; and in the residual feature pyramid networks except the first feature pyramid network, combining the image features stored in the memory network positioned at the top of the residual feature pyramid network, and performing up-sampling processing and feature aggregation processing on the image features from the last feature pyramid network, or performing down-sampling processing and feature aggregation processing. Therefore, in the path aggregation network, at least one feature pyramid is added as a feature enhancement network in addition to the "double-tower structure", for example, the "double-tower structure" is improved to a "three-tower structure" to improve the image recognition accuracy.

In the path aggregation network, an up-sampling process and a down-sampling process are alternately performed. In other words, the path aggregation network includes a bottom-up feature pyramid network and a top-down feature pyramid network, and the bottom-up feature pyramid network and the top-down feature pyramid network alternate in location.

Optionally, the number of layers in the feature pyramid is at least 4, so that the image recognition accuracy is improved by increasing the number of network layers of the feature pyramid.

In one example, as shown in fig. 6, the path aggregation network includes three feature pyramid networks, a first feature pyramid network is a bottom-up feature pyramid network, a second feature pyramid network is a top-down feature pyramid network, and a third feature pyramid network is a bottom-up feature pyramid network. The number of layers of the three feature pyramid networks is 4. In the bottom-up feature pyramid network, each layer of network performs up-sampling processing, convolution processing and feature aggregation processing on image features. In the feature pyramid network from top to bottom, each layer of network carries out down-sampling processing, convolution processing and feature aggregation processing on image features.

As shown in fig. 6, the image features from the expanded convolutional network are input from the first network layer in the first feature pyramid network, the image features from the feature extraction network are input from the intermediate network layer in the first feature pyramid network, and after the upsampling processing, the convolutional processing and the feature aggregation processing are performed by the first feature pyramid network, the image features output by the first feature pyramid network are input into the memory network corresponding to the first feature pyramid network. Then, inputting the image features output from the memory network into a first network layer in a second feature pyramid network, inputting the image features from an intermediate network layer of the first feature pyramid network into an intermediate network layer in the second feature pyramid network, and after down-sampling processing, convolution processing and feature aggregation processing are carried out on the image features output from the second feature pyramid network into a memory network corresponding to the second feature pyramid network. And then inputting the image characteristics from the memory network to a first network layer in a third characteristic pyramid network, inputting the image characteristics from a middle network layer of a hot characteristic pyramid network to the middle network layer in the third characteristic pyramid network, performing up-sampling processing, convolution processing and characteristic aggregation processing on the image characteristics output by the third characteristic pyramid network after the up-sampling processing, the convolution processing and the characteristic aggregation processing are performed on the image characteristics from the middle network layer in the third characteristic pyramid network, and inputting the image characteristics output by the third characteristic pyramid network to an output layer of an image recognition network.

And S707, obtaining the recognition result of the target object in the image according to the image characteristics from the path aggregation network.

The implementation principle and technical effect of S707 refer to the foregoing embodiments, and are not described again.

In the embodiment of the disclosure, in the path aggregation network, the past image features are stored and provided through the memory network corresponding to the feature pyramid network, so that the feature richness of each image recognition is improved, and further the image recognition accuracy is improved. In addition, the number of the feature pyramid networks and the number of layers of the feature pyramid networks can be increased in the path aggregation network, so that the accuracy of image identification can be improved.

Based on any of the foregoing embodiments, in a possible implementation manner, the Pooling network is a Spatial Pyramid Pooling network (SPP), and the Pooling network is used to pool image features from the feature extraction network, and includes: and performing maximum pooling on a plurality of image features of different scales from the feature extraction network through the pyramid pooling network to obtain the image features of the same scale. Thus, the spatial pyramid pooling network is utilized to improve the pooling effect of image features.

For example, as shown in fig. 6, 5 × 5, 9 × 9, 13 × 13 image features with different scales are pooled in a spatial pyramid pooling network.

Based on any of the foregoing embodiments, in a possible implementation manner, referring to fig. 8, fig. 8 is a schematic structural diagram of an image recognition network provided in the embodiment of the present disclosure, as shown in fig. 8, the image recognition network is YOLO v4 (young Look Only one version 4), and at this time: the feature extraction network is CSPDarknet53, and in CSPDarknet53, the change of the image gradient can be completely integrated into the feature map; a network layer for convolution processing is connected between the CSPDarknet53 and the spatial pyramid pooling network, namely, the image features output by the CSPDarknet53 are input into the spatial pyramid pooling network after passing through the network layer; a network layer for convolution processing and feature aggregation is connected between the expanding convolution and path aggregation networks; a network layer for feature aggregation and convolution processing is connected between the first feature pyramid and the second feature pyramid; the output layer of the image recognition network is a YOLO head (YOLO head), where fig. 8 takes two YOLO heads as an example, the input of one YOLO head is the image feature output by the network layer between the second feature pyramid and the third feature pyramid, and the input of the other YOLO head is the image feature output by the third feature pyramid. The different YOLO heads are used for outputting positions of target objects occupying different areas on the image, for example, one position is used for outputting coal gangue occupying a larger area on the image, and the other position is used for outputting coal gangue occupying a smaller area on the image.

Referring to fig. 9, fig. 9 is a schematic flowchart of a model determination method provided in an embodiment of the present disclosure. As shown in fig. 9, based on the image recognition network provided in any of the above embodiments, the model determining method for training the image recognition network includes:

s901, training data are obtained, wherein the training data comprise training images marked with target objects.

In this embodiment, training data acquired in advance is acquired, and the training data may include a plurality of training images labeled with target objects.

And S902, training the image recognition network for multiple times according to the training data to obtain the trained image recognition network.

In this embodiment, a training image with a target object marked in training data may be used to perform multiple supervised training on the image recognition network until the training times are greater than a time threshold, or until a difference between a recognition result of the target object on an image output by the image recognition network and the target object marked on the image is less than a preset threshold, so as to obtain a trained image recognition model. The trained image recognition network can be applied to the image recognition method provided by any one of the foregoing embodiments.

In this embodiment, a training process of the image recognition network includes: extracting the features of the training images through a feature extraction network; pooling image features from the feature extraction network via a pooling network; performing expansion convolution processing on the pooled image features through an expansion convolution network; performing feature aggregation processing on the image features from the feature extraction network and the image features from the expansion convolution network by combining the image features stored in the memory network through the feature pyramid network, and storing the image features in the feature aggregation processing in the memory network; determining the recognition result of the target object in the training image according to the image characteristics from the path aggregation network; and adjusting the model parameters of the image recognition network according to the target object marked on the training image and the recognition result of the target object on the training image.

The processing of the training image by the image recognition network may refer to the processing of the image to be recognized by the image recognition network in the foregoing embodiment, and is not described in detail.

In the process of adjusting the model parameters of the image recognition network according to the target object marked on the training image and the recognition result of the target object on the training image, the loss value of the loss function can be calculated based on the difference between the marking position of the target object on the training image and the position of the target object of the same category closest to the marking position in the recognition result, and then the model parameters of the image recognition model are adjusted based on the loss value and a preset optimization algorithm.

Here, the loss function and the optimization algorithm are not limited.

The following are embodiments of the disclosed apparatus that may be used to perform corresponding method embodiments of the present disclosure. For details which are not disclosed in the embodiments of the apparatus of the present disclosure, reference is made to corresponding method embodiments of the present disclosure.

Fig. 10 is a schematic structural diagram of an object recognition device according to an embodiment of the present disclosure. As shown in fig. 10, based on the network structure of the image recognition network shown in fig. 2, the object recognition apparatus provided in this embodiment includes:

an obtaining module 1001 configured to obtain an image to be identified;

the processing module 1002 is configured to perform feature extraction on an image through a feature extraction network; pooling image features from the feature extraction network via a pooling network; performing expansion convolution processing on the pooled image features through an expansion convolution network; performing feature aggregation processing on the image features from the feature extraction network and the image features from the expansion convolution network by combining the image features stored in the memory network through the feature pyramid network, and storing the image features in the feature aggregation processing in the memory network; and obtaining the recognition result of the target object in the image according to the image characteristics from the path aggregation network.

In one possible implementation, the feature extraction network includes a two-channel network and a shared network, the two-channel network includes a first network and a second network; the processing module 1002 is specifically configured to: determining a gray scale image corresponding to the image; performing feature extraction on the image through a first network; performing feature extraction on the gray level image through a second network; and performing feature fusion and feature extraction on the image features from the first network and the image features from the second network through the shared network.

In a possible implementation manner, the pooling network is a spatial pyramid pooling network, and the processing module 1002 is specifically configured to: and performing maximum pooling on a plurality of image features of different scales from the feature extraction network through the pyramid pooling network to obtain the image features of the same scale.

In a feasible implementation manner, the output layer of the feature pyramid network is connected with a corresponding memory network; the processing module 1002 is specifically configured to: performing feature aggregation processing on the image features from the feature extraction network and the image features from the expansion convolution network through the feature pyramid network in combination with the image features stored in the corresponding memory network; in the memory network, the image features from the corresponding feature pyramid network are stored.

In one possible implementation, the path aggregation network includes at least three feature pyramid networks; the processing module 1002 is specifically configured to: in a first characteristic pyramid network of the path aggregation network, combining image characteristics stored in a memory network positioned at the top of the first characteristic pyramid network, and performing up-sampling processing and characteristic aggregation processing on the image characteristics from the expanded convolution network; in the residual characteristic pyramid networks except the first characteristic pyramid network, combining the image characteristics stored in the memory network positioned at the top of the residual characteristic pyramid network, and performing up-sampling processing and characteristic aggregation processing on the image characteristics from the last characteristic pyramid network, or performing down-sampling processing and characteristic aggregation processing; in the path aggregation network, an up-sampling process and a down-sampling process are alternately performed.

Fig. 11 is a schematic structural diagram of a model determining apparatus according to an embodiment of the present disclosure. As shown in fig. 11, based on the network structure of the image recognition network provided in any of the foregoing embodiments, the model determining apparatus provided in this embodiment includes:

an obtaining module 1101, configured to obtain training data, where the training data includes a training image labeled with a target object;

the training module 1102 is configured to train the image recognition network for multiple times according to the training data to obtain a trained image recognition network;

the one-time training process of the image recognition network comprises the following steps: extracting the features of the training images through a feature extraction network; pooling image features from the feature extraction network via a pooling network; performing expansion convolution processing on the pooled image features through an expansion convolution network; performing feature aggregation processing on the image features from the feature extraction network and the image features from the expansion convolution network by combining the image features stored in the memory network through the feature pyramid network, and storing the image features in the feature aggregation processing in the memory network; determining the recognition result of the target object in the training image according to the image characteristics from the path aggregation network; and adjusting the model parameters of the image recognition network according to the target object marked on the training image and the recognition result of the target object in the training image.

It should be noted that the target identification apparatus provided in each of the above embodiments may be configured to execute each step in the target identification method provided in any of the above embodiments, and specific implementation manners and technical effects are similar and will not be described herein again.

The above embodiments of the apparatus provided in the present disclosure are merely exemplary, and the module division is only one logic function division, and there may be other division ways in actual implementation. For example, multiple modules may be combined or may be integrated into another system. The coupling of the various modules to each other may be through interfaces that are typically electrical communication interfaces, but mechanical or other forms of interfaces are not excluded. Thus, modules described as separate components may or may not be physically separate, may be located in one place, or may be distributed in different locations on the same or different devices.

Fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 12, the electronic device may include: at least one processor 1201 and memory 1202. Fig. 12 shows an electronic device as an example of a processor.

The memory 1202 stores a program of the processor 1201. In particular, the program may include program code including computer operating instructions.

Memory 1202 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The processor 1201 is configured to execute the computer program stored in the memory 1202 to implement the steps in the object recognition method in the above method embodiments.

The processor 1201 may be a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of the present disclosure.

Alternatively, the memory 1202 may be separate or integrated with the processor 1201. When the memory 1202 is a separate device from the processor 1201, the electronic device may further include: a bus 1203 is used to connect the processor 1201 and the memory 1202. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. Buses may be classified as address buses, data buses, control buses, etc., but do not represent only one bus or type of bus.

Optionally, in a specific implementation, if the memory 1202 and the processor 1201 are implemented integrally on a single chip, the memory 1202 and the processor 1201 may communicate through an internal interface.

The present disclosure also provides a computer-readable storage medium, which may include: a variety of media that can store program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and in particular, the computer-readable storage medium stores a computer program, and when at least one processor of the electronic device executes the computer program, the electronic device executes the steps of the object recognition method provided by the above-mentioned various embodiments.

Embodiments of the present disclosure also provide a computer program product comprising a computer program stored in a readable storage medium. The computer program may be read from a readable storage medium by at least one processor of the electronic device, and execution of the computer program by the at least one processor causes the electronic device to perform the steps of the object recognition method provided by the various embodiments described above.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. The target identification method is characterized in that an image identification network comprises a feature extraction network, a pooling network, an expansion convolutional network and a path aggregation network, wherein the feature extraction network, the pooling network, the expansion convolutional network and the path aggregation network are sequentially connected, and the path aggregation network comprises a feature pyramid network and a memory network for storing image features;

the target identification method comprises the following steps:

acquiring an image to be identified;

extracting the features of the image through the feature extraction network;

obtaining an identification result of a target object in the image according to the image characteristics from the path aggregation network;

the output layer of the feature pyramid network is connected with a corresponding memory network, and the feature pyramid network is combined with the image features stored in the memory network to perform feature aggregation processing on the image features from the feature extraction network and the image features from the expanded convolution network, and store the image features in the feature aggregation processing in the memory network, including:

storing image features from the corresponding feature pyramid network in the memory network;

the path aggregation network comprises at least three feature pyramid networks, and the feature aggregation processing is performed on the image features from the feature extraction network and the image features from the expansion convolution network by combining the image features stored in the corresponding memory network through the feature pyramid networks, and comprises the following steps:

2. The object recognition method of claim 1, wherein the feature extraction network comprises a two-channel network and a shared network, the two-channel network comprising a first network and a second network;

determining a gray scale map corresponding to the image;

performing feature extraction on the image through the first network;

3. The method of claim 1, wherein the pooling network is a spatial pyramid pooling network, and wherein pooling image features from the feature extraction network via the pooling network comprises:

4. A model determination method is characterized in that an image recognition network comprises a feature extraction network, a pooling network, an expansion convolutional network and a path aggregation network, wherein the feature extraction network, the pooling network, the expansion convolutional network and the path aggregation network are sequentially connected, and the path aggregation network comprises a feature pyramid network and a memory network for storing image features;

the model determination method comprises the following steps:

adjusting model parameters of the image recognition network according to the target object marked on the training image and the recognition result of the target object in the training image;

5. A target identification device is characterized in that an image identification network comprises a feature extraction network, a pooling network, an expansion convolutional network and a path aggregation network, wherein the feature extraction network, the pooling network, the expansion convolutional network and the path aggregation network are sequentially connected, and the path aggregation network comprises a feature pyramid network and a memory network for storing image features;

the object recognition apparatus includes:

the acquisition module is used for acquiring an image to be identified;

the processing module is used for extracting the features of the image through the feature extraction network; pooling image features from the feature extraction network over the pooling network; performing expansion convolution processing on the pooled image features through the expansion convolution network; performing feature aggregation processing on the image features from the feature extraction network and the image features from the expanded convolution network by combining the image features stored in the memory network through the feature pyramid network, and storing the image features in the feature aggregation processing in the memory network; obtaining an identification result of a target object in the image according to the image characteristics from the path aggregation network;

the output layer of the characteristic pyramid network is connected with a corresponding memory network, and the processing module is specifically used for: performing feature aggregation processing on the image features from the feature extraction network and the image features from the expanded convolution network through the feature pyramid network in combination with the image features stored in the corresponding memory network; storing image features from the corresponding feature pyramid network in the memory network;

the path aggregation network includes at least three feature pyramid networks, and the processing module is specifically configured to: in a first characteristic pyramid network of the path aggregation network, combining image characteristics stored in a memory network positioned at the top of the first characteristic pyramid network, and performing up-sampling processing and characteristic aggregation processing on the image characteristics from the expanded convolutional network; in the residual characteristic pyramid networks except the first characteristic pyramid network, combining the image characteristics stored in the memory network positioned at the top of the residual characteristic pyramid network, and performing up-sampling processing and characteristic aggregation processing on the image characteristics from the last characteristic pyramid network, or performing down-sampling processing and characteristic aggregation processing; wherein, in the path aggregation network, the up-sampling process and the down-sampling process are alternately performed.

6. A model determination device is characterized in that an image recognition network comprises a feature extraction network, a pooling network, an expansion convolutional network and a path aggregation network, wherein the feature extraction network, the pooling network, the expansion convolutional network and the path aggregation network are sequentially connected, and the path aggregation network comprises a feature pyramid network and a memory network for storing image features;

the model determination apparatus includes:

the output layer of the characteristic pyramid network is connected with a corresponding memory network, and the training module is specifically used for: performing feature aggregation processing on the image features from the feature extraction network and the image features from the expanded convolution network through the feature pyramid network in combination with the image features stored in the corresponding memory network; storing image features from the corresponding feature pyramid network in the memory network;

the path aggregation network includes at least three feature pyramid networks, and the training module is specifically configured to: in a first characteristic pyramid network of the path aggregation network, combining image characteristics stored in a memory network positioned at the top of the first characteristic pyramid network, and performing up-sampling processing and characteristic aggregation processing on the image characteristics from the expanded convolutional network; in the residual characteristic pyramid networks except the first characteristic pyramid network, combining the image characteristics stored in the memory network positioned at the top of the residual characteristic pyramid network, and performing up-sampling processing and characteristic aggregation processing on the image characteristics from the last characteristic pyramid network, or performing down-sampling processing and characteristic aggregation processing; wherein, in the path aggregation network, the up-sampling process and the down-sampling process are alternately performed.

7. An electronic device, comprising: at least one processor and memory;

the memory stores computer-executable instructions;

execution of computer-executable instructions stored by the memory by the at least one processor causes the at least one processor to perform the object recognition method of any one of claims 1 to 3 or causes the at least one processor to perform the model determination method of claim 4.

8. A computer-readable storage medium having stored thereon computer-executable instructions for implementing the object recognition method of any one of claims 1 to 3, or the model determination method of claim 4, when executed by a processor.