CN115795078A

CN115795078A - Training method of image retrieval model, image retrieval method and device

Info

Publication number: CN115795078A
Application number: CN202211572076.2A
Authority: CN
Inventors: 王倩芸; 丁昆; 刘朋樟; 张屹峰; 李阁; 周梦迪; 朱阳光; 包勇军
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2022-12-08
Filing date: 2022-12-08
Publication date: 2023-03-14

Abstract

The application provides a training method of an image retrieval model, an image retrieval method and a device, and relates to the technical field of artificial intelligence such as deep learning and image processing, wherein the training method of the image retrieval model comprises the following steps: obtaining a plurality of sample images, wherein each sample image is labeled according to the category of a sample object included in the sample image; inputting each sample image into a target detection network in an image retrieval model to obtain sample image blocks in the corresponding sample images and size information of the sample image blocks; inputting each sample image block and size information into a feature extraction network in an image retrieval model to obtain a prediction feature vector of a sample object in a corresponding sample image; and determining a loss value based on the predicted feature vector and the class of each sample object, and adjusting model parameters of the target detection network and the feature extraction network based on the loss value. The training of the image retrieval model can be realized, and the retrieval accuracy of the image retrieval model is high when the image retrieval model is used for image retrieval.

Description

Training method of image retrieval model, image retrieval method and device

Technical Field

The application relates to the technical field of artificial intelligence such as deep learning and image processing, in particular to a training method of an image retrieval model, an image retrieval method and an image retrieval device.

Background

Image retrieval techniques have wide application in a variety of fields. For example, in the scenes of the e-commerce field, such as the same type recommendation, the attribute management, the category merging and the like, an image similar to a known image needs to be retrieved through an image retrieval technology.

In the image retrieval technique in the related art, in the model training stage, generally, the ROI (region of interest) in the image is scaled to a fixed size and then directly input into the network. Since the dimensions of the ROIs of the image are various, wherein the dimensions of the ROIs reflect some important attributes of the subject object in the image, for example, the aperture ratio (i.e., the ratio of the width to the height) of the ROIs reflects the length of the clothing, and scaling the ROIs of the image to a fixed size causes deformation of the subject object in the image, so that the above-mentioned manner of directly inputting the ROIs of the image to a network for model training after scaling the ROIs to the fixed size results in that the model cannot learn the real form of the subject object in the image, and an image irrelevant to a known image is easily retrieved during retrieval, which results in poor accuracy.

Disclosure of Invention

The present application is directed to solving, at least to some extent, one of the technical problems in the related art.

The application provides a training method of an image retrieval model, an image retrieval method and a device, which are used for solving the technical problems that images irrelevant to known images are easy to retrieve when the image retrieval model obtained by training in the related technology is retrieved, and the accuracy of retrieval results is poor.

An embodiment of a first aspect of the present application provides a training method for an image retrieval model, including: acquiring a plurality of sample images, wherein each sample image is labeled according to the category of a sample object included in the sample image; inputting each sample image into a target detection network in an image retrieval model to obtain sample image blocks in corresponding sample images and size information of the sample image blocks, wherein the sample image blocks comprise sample objects in the sample images; inputting sample image blocks in each sample image and size information of the sample image blocks into a feature extraction network in the image retrieval model to obtain predicted feature vectors corresponding to sample objects in the sample images; and determining a loss value based on the predicted feature vector of the sample object in each sample image and the class of the sample object, and adjusting model parameters of the target detection network and the feature extraction network based on the loss value. Therefore, the training of the image retrieval model can be realized, and the retrieval accuracy of the image retrieval model when the image retrieval model is used for image retrieval is high.

An embodiment of a second aspect of the present application provides an image retrieval method, including: acquiring a retrieval image to be retrieved; inputting the retrieval image into a target detection network in an image retrieval model to obtain target image blocks in the retrieval image and size information of the target image blocks, wherein the target image blocks comprise target objects in the retrieval image; inputting target image blocks in the retrieval image and size information of the target image blocks into a feature extraction network in the image retrieval model to obtain a predicted feature vector of the target object, wherein the image retrieval model is obtained based on the method of the embodiment of the first aspect; determining a target image from a plurality of candidate images based on the predicted feature vector of the target object. Therefore, the accuracy of image retrieval is improved.

An embodiment of a third aspect of the present application provides a training apparatus for an image retrieval model, including: the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a plurality of sample images, and each sample image is labeled according to the category of a sample object included in the sample image; the first processing module is used for inputting each sample image into a target detection network in an image retrieval model so as to obtain sample image blocks in the corresponding sample image and size information of the sample image blocks, wherein the sample image blocks comprise sample objects in the sample image; the second processing module is used for inputting the sample image blocks in the sample images and the size information of the sample image blocks into a feature extraction network in the image retrieval model so as to obtain prediction feature vectors corresponding to sample objects in the sample images; and the model parameter adjusting module is used for determining a loss value based on the predicted feature vector of the sample object in each sample image and the class of the sample object, and adjusting the model parameters of the target detection network and the feature extraction network based on the loss value.

An embodiment of a fourth aspect of the present application provides an image retrieval apparatus, including: the second acquisition module is used for acquiring a retrieval image to be retrieved; the third processing module is used for inputting the retrieval image into a target detection network in an image retrieval model so as to obtain a target image block in the retrieval image and size information of the target image block, wherein the target image block comprises a target object in the retrieval image; a fourth processing module, configured to input a target image block in the search image and size information of the target image block into a feature extraction network in the image search model to obtain a predicted feature vector of the target object, where the image search model is obtained based on the method in the embodiment of the first aspect; a determination module for determining a target image from a plurality of candidate images based on the predicted feature vector of the target object.

An embodiment of a fifth aspect of the present application provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform a method of training an image retrieval model as set forth in embodiments of the first aspect of the application, or to perform an image retrieval method as set forth in embodiments of the second aspect of the application.

An embodiment of a sixth aspect of the present application proposes a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method of training an image retrieval model as proposed in an embodiment of the first aspect of the present application, or to perform an image retrieval method as proposed in an embodiment of the second aspect of the present application.

An embodiment of the seventh aspect of the present application proposes a computer program product, which includes a computer program that, when being executed by a processor, implements or executes a method for training an image retrieval model as proposed in the embodiment of the first aspect of the present application.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a schematic flowchart illustrating a training method of an image retrieval model according to an embodiment of the present application;

FIG. 2 is a schematic flow chart illustrating a method for training an image retrieval model according to an embodiment of the present disclosure;

fig. 3 is a flowchart illustrating a training method of an image retrieval model according to a second embodiment of the present application;

fig. 4 is a structural diagram of a feature extraction network according to a second embodiment of the present application;

fig. 5 is a structural diagram of a first fusion module provided in the second embodiment of the present application;

fig. 6 is a schematic flowchart of a training method of an image retrieval model according to a third embodiment of the present application;

fig. 7 is a structural diagram of a feature extraction network provided in the third embodiment of the present application;

fig. 8 is a schematic flowchart of an image retrieval method according to a fourth embodiment of the present application;

fig. 9 is another schematic flowchart of an image retrieval method according to a fourth embodiment of the present application;

FIG. 10 is a diagram illustrating an example of a ROI processing method according to a fourth embodiment of the present application;

fig. 11 is an exemplary diagram of an image retrieval result provided in the fourth embodiment of the present application;

fig. 12 is a schematic structural diagram of a training apparatus for an image retrieval model according to a fifth embodiment of the present application;

fig. 13 is a schematic structural diagram of an image retrieval apparatus according to a sixth embodiment of the present application;

FIG. 14 illustrates a block diagram of an exemplary electronic device suitable for use in implementing embodiments of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative and intended to explain the present application and should not be construed as limiting the present application.

It should be noted that, in the technical solution of the present application, the acquisition, storage, application, and the like of the personal information of the related user all conform to the regulations of the relevant laws and regulations, and do not violate the customs of the public order.

In the image retrieval technology in the related art, the ROI in the image is generally scaled to a fixed size and then directly input to the network in the model training stage, and the model obtained by training in this way is easy to retrieve images irrelevant to known images during retrieval, and the accuracy of the retrieval result is poor.

The application provides a training method of an image retrieval model, an image retrieval method, an image retrieval device, electronic equipment, a storage medium and a computer program product, aiming at the technical problems that images irrelevant to known images are easy to retrieve when the images are retrieved by the image retrieval model obtained by training in the related technology, and the accuracy of retrieval results is poor.

The training method of the image retrieval model comprises the following steps: obtaining a plurality of sample images, wherein each sample image is labeled according to the category of a sample object included in the sample image; inputting each sample image into a target detection network in an image retrieval model to obtain sample image blocks in the corresponding sample image and size information of the sample image blocks, wherein the sample image blocks comprise sample objects in the sample image; inputting sample image blocks in each sample image and size information of the sample image blocks into a feature extraction network in an image retrieval model to obtain predicted feature vectors corresponding to sample objects in the sample images; and determining a loss value based on the predicted feature vector of the sample object in each sample image and the class of the sample object, and adjusting model parameters of the target detection network and the feature extraction network based on the loss value. Therefore, the training of the image retrieval model can be realized, and the retrieval accuracy of the image retrieval model when the image retrieval model is used for image retrieval is high.

A training method of an image retrieval model, an image retrieval method, an apparatus, an electronic device, a storage medium, and a computer program product according to embodiments of the present application are described below with reference to the accompanying drawings.

First, a training method of an image retrieval model provided in an embodiment of the present application is explained.

The method for training the image retrieval model provided by the embodiment of the present application is executed by a training device for the image retrieval model. The training device of the image retrieval model can be an electronic device, and can also be configured in the electronic device, so that the training of the image retrieval model is realized by executing the training method of the image retrieval model provided by the embodiment of the application, and the retrieval accuracy of the image retrieval model is high when the image retrieval model is used for image retrieval.

The electronic device may be a Personal Computer (PC), a cloud device, a mobile device, a server, and the like, and the mobile device may be any hardware device such as a mobile phone, a tablet Computer, a Personal digital assistant, a wearable device, and a vehicle-mounted device, which is not limited in this application.

Fig. 1 is a flowchart illustrating a training method of an image retrieval model according to an embodiment of the present disclosure. As shown in FIG. 1, the training method of the image retrieval model may include the following steps 101-104.

Step 101, obtaining a plurality of sample images, wherein each sample image is labeled according to the category of a sample object included in the sample image.

The sample image is a training sample for training the image retrieval model. The sample object included in the sample image is a subject object in the sample image, and may be a commodity, a person, an animal, an article, or the like.

In an example embodiment, the sample images may be selected as training samples as needed for the application scenario. For example, when the image retrieval model is used in the same recommendation scene in the e-commerce field, an image including a commodity may be used as a sample image for training the image retrieval model.

It should be noted that, in the embodiment of the present application, the training data for training the image retrieval model may be provided by a user and authorized to be used, or obtained from a public data set, or obtained by other manners that meet the requirements of relevant laws and regulations, which is not limited in this application.

In addition, in the embodiment of the present application, when performing category classification on sample objects in all sample images, the same sample object may be classified into one category, that is, the same sample object belongs to the same category, and different sample objects belong to different categories.

For example, taking 1000 sample images included in the plurality of sample images as an example, assume that the sample images numbered 1-200 are images obtained by photographing an article a from a plurality of angles, the sample images numbered 201-400 are images obtained by photographing an article B from a plurality of angles, the sample images numbered 401-650 are images obtained by photographing an article C from a plurality of angles, and the sample images numbered 651-1000 are images obtained by photographing an article D from a plurality of angles. The belonged category of the sample object in the 1000 sample images includes 4 categories, and it is assumed that the belonged category of the item a is category a, the belonged category of the item B is category B, the belonged category of the item C is category C, and the belonged category of the item D is category D. The sample images numbered 1-200 are labeled with class a, the sample images numbered 201-400 are labeled with class b, the sample images numbered 401-650 are labeled with class c, and the sample images numbered 651-1000 are labeled with class d.

Step 102, inputting each sample image into a target detection network in the image retrieval model to obtain sample image blocks in the corresponding sample image and size information of the sample image blocks, wherein the sample image blocks comprise sample objects in the sample image.

The image retrieval model is a neural network model with an image retrieval function. The image retrieval model comprises a target detection network and a feature extraction network which are sequentially connected.

The target detection network is used for detecting an object in an arbitrary image to acquire an ROI in the arbitrary image and size information of the ROI, wherein the ROI comprises a main object in the arbitrary image. The size information of the ROI may include the height and width of the ROI.

In the embodiment of the present application, referring to fig. 2, for each sample image, the sample image may be input to the target detection network 21 in the image retrieval model, and the ROI output by the target detection network 21 is a sample image block that includes a sample object in the sample image, and meanwhile, the target detection network 21 may also output size information of the sample image block.

Step 103, inputting the sample image blocks in each sample image and the size information of the sample image blocks into a feature extraction network in the image retrieval model to obtain the predicted feature vectors of the sample objects in the corresponding sample images.

The feature extraction network is configured to extract a feature vector of a main object in an arbitrary image based on the ROI in the arbitrary image and size information of the ROI, where the feature vector represents attribute features of the main object, such as color, pattern, style, size, and the like of a commodity.

In the embodiment of the present application, referring to fig. 2, for each sample image, a sample image block (ROI) in the sample image and size information of the sample image block may be input to the feature extraction network 22 in the image retrieval model, and the feature extraction network 22 may output a feature vector of a sample object in the sample image.

And 104, determining a loss value based on the predicted feature vector of the sample object in each sample image and the class of the sample object, and adjusting model parameters of the target detection network and the feature extraction network based on the loss value.

In an embodiment of the present application, referring to fig. 2, the loss calculation may be performed according to the predicted feature vector of the sample object in each sample image output by the feature extraction network 22. Specifically, the class of the sample object in each sample image may be predicted according to the predicted feature vector of the sample object in each sample image output by the feature extraction network 22, the predicted class and the labeled class are substituted into the loss function to determine a loss value, and then the model parameters of the target detection network and the feature extraction network in the image retrieval model are adjusted according to the loss value, and the trained image retrieval model is obtained through multiple iterative optimization.

The loss function may be set as needed, for example, may be set as a cross entropy loss function, a mean squared error MSE (mean squared error) loss function, or other loss functions, which is not limited in this application.

In the embodiment of the application, when the image retrieval model is trained, the size information of the sample image and the sample image blocks in the sample image are simultaneously input into the feature extraction network for model training, and compared with the method that the size information of the ROI of the image is directly input into the network after being zoomed to a fixed size, the image retrieval model can learn the real form of a main object in the image, such as the length of clothes, so that the accuracy of a retrieval result can be improved when the image retrieval model is used for image retrieval.

To sum up, the training method of the image retrieval model provided in the embodiment of the present application obtains a plurality of sample images, each sample image is labeled with a category to which a sample object included therein belongs, each sample image is input to the target detection network in the image retrieval model to obtain a sample image block in a corresponding sample image and size information of the sample image block, the sample image block includes a sample object in the sample image, the size information of the sample image block and the sample image block in each sample image is input to the feature extraction network in the image retrieval model to obtain a predicted feature vector of the sample object in the corresponding sample image, a loss value is determined based on the predicted feature vector of the sample object in each sample image and the category to which the sample object belongs, and model parameters of the target detection network and the feature extraction network are adjusted based on the loss value. Therefore, the training of the image retrieval model can be realized, and the retrieval accuracy of the image retrieval model when the image retrieval model is used for image retrieval is high.

In one possible implementation form, the feature extraction network may include a first size feature extraction module, an image feature extraction module, and a first fusion module connected to the first size feature extraction module and the image feature extraction module. With reference to fig. 3, the following further explains a process of inputting sample image blocks in each sample image and size information of the sample image blocks into the feature extraction network in the image retrieval model in the training method of the image retrieval model in the embodiment of the present application for the feature extraction network with the above structure to obtain predicted feature vectors of sample objects in corresponding sample images.

Fig. 3 is a flowchart illustrating a training method of an image retrieval model according to a second embodiment of the present application. As shown in fig. 3, the training method of the image retrieval model may include the following steps 301-306.

Step 301, a plurality of sample images are obtained, and each sample image is labeled according to the category of the sample object included in the sample image.

Step 302, inputting each sample image into a target detection network in the image retrieval model to obtain sample image blocks in the corresponding sample image and size information of the sample image blocks, wherein the sample image blocks include sample objects in the sample image.

For the specific implementation process and principle of steps 301 to 302, reference may be made to the description of other embodiments, which are not described herein again.

Step 303, for each sample image, inputting the size information of the sample image block therein into a first size feature extraction module included in the feature extraction network, so as to obtain a first size feature of the sample image block based on the size information.

Referring to fig. 4, the feature extraction network may include a first size feature extraction module 41, an image feature extraction module 42, and a first fusion module 43 connected to the first size feature extraction module 41 and the image feature extraction module 42. The first size feature extraction module 41 is configured to extract a size feature of an arbitrary image block. The image feature extraction module 42 is configured to extract an image feature vector of any image block. And a first fusion module 43, configured to fuse the size feature and the image feature vector to obtain a predicted feature vector of the subject object in the image.

In the embodiment of the present application, for each sample image, the size information of the sample image block in the sample image is input to the first size feature extraction module 41, so that the first size feature of the sample image block can be obtained.

The size information of the sample image block may include a width and a height of the sample image block. Wherein, with I _iw Width of the ith sample image block is represented by I _ih Representing the height of the ith sample image block.

In a possible implementation form, for the ith sample image block, the width I of the sample image block can be directly used _iw Value and height I of _ih Is determined as a first size characteristic of the sample image block. In this case, the dimension of the first size feature is 2.

In another possible implementation form, for the ith sample image block, the ratio of the width to the height of the sample image block may be determined

I.e., aperture ratio, is determined as a first size characteristic of the sample image block. In this case, the dimension of the first size feature is 1.

In another possible implementation form, the ratio of the width and the height of the sample image block in each sample image can be calculated separately, and the mean value μ of each ratio can be determined _r Sum standard deviation σ _r Further, for the ith sample image block, the first size characteristic of the sample image block may be determined by the following formula (1). In this case, the dimension of the first size feature is 1.

Wherein f is _ratio A first size feature representing an ith sample image block; r is a radical of hydrogen _i The ratio of the width to the height of the ith sample image block is represented; mu.s _r Representing the mean of the ratios of the widths and heights of all sample image blocks; sigma _r Representing the standard deviation of the ratio of the width and height of all sample image blocks.

In another possibilityIn the implementation form of (2), the logarithmic ratio values of the width and the height of the sample image blocks in each sample image can be respectively calculated, and the mean value mu of each logarithmic ratio value is determined _la Sum standard deviation σ _la Further, for the ith sample image block, the first size characteristic of the ith sample image block may be determined by the following equation (2). Wherein, the logarithmic ratio is the logarithm of the ratio. In this case, the dimension of the first size feature is 1.

Wherein f is _ratio A first size feature representing an ith sample image block; log of _a r _i Representing the logarithmic ratio of the width and the height of the ith sample image block; mu.s _la A mean value representing the logarithmic ratio of the width and height of all sample image blocks; sigma _la And represents the standard deviation of the logarithmic ratio of the width and height of all sample image blocks.

Step 304, inputting the sample image block into an image feature extraction module included in the feature extraction network to perform image feature extraction on the sample image block, so as to obtain an image feature vector of the sample image block.

In the embodiment of the present application, the sample image blocks may be input to the image feature extraction module to perform size scaling on the sample image blocks to obtain target sample image blocks of a preset size, and perform image feature extraction on the target sample image blocks of the preset size to obtain image feature vectors of sample objects in the sample image blocks.

In one possible implementation form, referring to fig. 4, the image feature extraction module 42 may include an image processing sub-module 421 and a first backbone network 422 connected in sequence. The image processing sub-module 421 is configured to perform size scaling on any image block to obtain an image block with a preset size, and obtain a tensor corresponding to the image block. The first trunk network 422 is configured to perform feature extraction on any image block to obtain an image feature vector, which may be any network capable of implementing feature extraction, and this application is not limited thereto.

Referring to fig. 4, for any sample image block, the sample image block may be input to the image processing sub-module 421 to perform size scaling on the sample image block to obtain a first image block with a preset size, and obtain a first tensor corresponding to the first image block, and then input the first tensor to the first trunk network 422 to perform image feature extraction based on the first tensor to obtain an image feature vector of the sample image block.

The preset size may be set as needed, which is not limited in this application. In general, the preset size may be set to be equal in width and height, such as 24 × 24 pixels.

The first quantity may include pixel values of each pixel point in the sample image block in multiple channels (for example, a red channel, a blue channel, and a green channel).

It should be noted that, in order to enable the first backbone network to implement image feature extraction on the first vector, the number of input channels of the first convolution kernel in the first backbone network needs to be equal to the number of channels of the first vector. For example, when the size of the first tensor is 24 × 3, that is, the number of channels of the first tensor is 3, the size of the first convolution kernel in the first main network is [64,3, 7]. Where 64 denotes the number of output channels of the convolution kernel, and 3 denotes the number of input channels of the convolution kernel.

Step 305, inputting the first size feature and the image feature vector into a first fusion module connected with the first size feature extraction module and the image feature extraction module in the feature extraction network, so as to fuse the first size feature and the image feature vector to obtain a predicted feature vector of the sample object in the sample image.

In a possible implementation form, with reference to fig. 5, the first fusion module 43 comprises a multilayer perceptron sub-module 51 and a fusion sub-module 52 connected in sequence. The multi-layer perceptron sub-module 51 may perform dimension expansion on the size features to obtain the size features with preset dimensions. The fusion sub-module 52 may fuse the size feature of the preset dimension and the image feature vector to obtain a predicted feature vector.

Referring to fig. 5, for any sample image block, the first size feature of the sample image block may be input to the multi-layered perceptron sub-module 51 to perform dimension expansion on the first size feature to obtain a second size feature, and then the second size feature and the image feature vector are input to the fusion sub-module 52 to fuse the second size feature and the image feature vector to obtain a predicted feature vector of the sample object in the sample image.

The second size feature and the image feature vector can be fused in the following ways to obtain the predicted feature vector of the sample object in the sample image.

For example, the second-size feature and the image feature vector may be concatenated or added to obtain a predicted feature vector of the sample object in the sample image. Alternatively, the second size feature and the image feature vector may be fused based on a self-attention mechanism to obtain a predicted feature vector of the sample object in the sample image.

For example, assuming that the second-size feature is a 128-dimensional feature and the image feature vector is a 2048-dimensional feature, the second-size feature and the image feature vector are spliced to obtain a 2176-dimensional predicted feature vector.

In the embodiment of the present application, the multilayer perceptron sub-module 51 can be set at different number N of layers in the manner shown in table 1 below _layer The output feature size (i.e., dimension) below. Wherein the input feature size infeatdim =1 and the output feature size outpeatdim e [16,1024 ] of the multi-layer perceptron submodule 51]。

TABLE 1 layer number design of multilayer perceptron submodule

Step 306, determining a loss value based on the predicted feature vector of the sample object in each sample image and the category to which the sample object belongs, and adjusting model parameters of the target detection network and the feature extraction network based on the loss value.

The specific implementation process and principle of step 306 may refer to the description of other embodiments, and are not described herein again.

Therefore, the training of the image retrieval model can be realized, and the retrieval accuracy of the image retrieval model when the image retrieval model is used for image retrieval is high.

In one possible implementation form, the feature extraction network may include a second size feature extraction module, an image processing module, a second fusion module connected to the second size feature extraction module and the image processing module, and a second backbone network connected to the second fusion module. With reference to fig. 6, the following further explains a process of inputting sample image blocks in each sample image and size information of the sample image blocks into the feature extraction network in the image retrieval model in the training method of the image retrieval model in the embodiment of the present application for the feature extraction network with the above structure to obtain predicted feature vectors of sample objects in corresponding sample images.

Fig. 6 is a flowchart illustrating a training method of an image retrieval model according to a third embodiment of the present application. As shown in fig. 6, the training method of the image retrieval model may include the following steps 601-607.

Step 601, a plurality of sample images are obtained, and each sample image is labeled according to the category of the sample object included in the sample image.

Step 602, inputting each sample image into a target detection network in the image retrieval model to obtain sample image blocks in the corresponding sample image and size information of the sample image blocks, where the sample image blocks include sample objects in the sample image.

The specific implementation process and principle of steps 601-602 may refer to the description of other embodiments, and are not described herein again.

Step 603, for each sample image, inputting the size information of the sample image block in the sample image into a second size feature extraction module included in the feature extraction network, so as to obtain a third size feature of the sample image block based on the size information.

Referring to fig. 7, the feature extraction network may include a second size feature extraction module 71, an image processing module 72, a second fusion module 73 connected to the second size feature extraction module 71 and the image processing module 72, and a second backbone network 74 connected to the second fusion module 73. The second size feature extraction module 71 is configured to extract size features of any image block. The image processing module 72 is configured to perform size scaling on any image block to obtain an image block with a preset size, and obtain a tensor corresponding to the image block. And a second fusion module 73, configured to fuse the size features and the tensor to obtain a fusion feature vector.

In the embodiment of the present application, for each sample image, the size information of the sample image block in the sample image is input to the second size feature extraction module 71, so as to obtain the third size feature of the sample image block.

The size information of the sample image block may include a width and a height of the sample image block. The manner of obtaining the third size feature of the sample image block based on the size information may refer to the manner of obtaining the first size feature of the sample image block based on the size information in other embodiments, which are similar to each other and are not described herein again.

Step 604, inputting the sample image block into an image processing module included in the feature extraction network, so as to perform size scaling on the sample image block, obtain a second image block with a preset size, and obtain a second tensor corresponding to the second image block.

The second tensor can include pixel values of pixel points in the sample image block in multiple channels (such as a red channel, a blue channel, and a green channel).

Referring to fig. 7, the sample image block is input to the image processing module 72 included in the feature extraction network, so that the sample image block may be scaled in size to obtain a second image block with a preset size, and a second tensor corresponding to the second image block is obtained.

And 605, inputting the third size feature and the second tensor into a second fusion module connected with the second size feature extraction module and the image processing module in the feature extraction network, so as to fuse the third size feature and the second tensor to obtain a fusion feature vector.

Referring to fig. 7, the third dimensional feature and the second tensor are input to a second fusion module 73 connected to the second dimensional feature extraction module 71 and the image processing module 72 in the feature extraction network, so that the third dimensional feature and the second tensor can be fused to obtain a fused feature vector.

In an embodiment of the present application, the third size feature and the second tensor can be spliced along the channel dimension to obtain a fusion feature vector.

Specifically, the third size feature and the second tensor can be spliced along the channel dimension in the following manner to obtain the fusion feature vector, that is, step 605 can be implemented in the following manner:

inputting the third size characteristics and the second tensor into a second fusion module, and splicing the third size characteristics and pixel values of the pixel points in a plurality of channels for each pixel point in the sample image block to obtain characteristic vectors of the pixel points in the plurality of channels; and generating a fusion feature vector based on the feature vectors of all the pixel points in the sample image block in the multiple channels.

For example, assume that the dimension of the third dimension feature is 1. The size of the second tensor is 24 × 3, that is, the second tensor includes the pixel values of the 24 × 24 pixels in the sample image block in the 3 channels of the red channel, the blue channel, and the green channel. For each pixel point in the sample image block, the third size feature and the pixel values of the pixel points in 3 channels may be spliced to obtain the feature vectors of the pixel points in 4 channels, where for each pixel point, the feature vector of the pixel point includes the pixel values of the 3 channels, namely, the red channel, the blue channel, and the green channel, and the third size feature. Therefore, a 24 × 4-dimensional fusion feature vector can be obtained based on the feature vectors of 24 × 24 pixels in the sample image block in 4 channels.

It should be noted that, because the range of the pixel value of the pixel point of the image is [ -128,128], and when the third size characteristic is one-dimensional, the value of the third size characteristic is a decimal, and if the third size characteristic in the form of the decimal is spliced with the pixel values of the pixel points in multiple channels, the obtained fusion characteristic vector may cause fluctuation of the model, and then the third size characteristic may be normalized to be close to the range of the pixel value, and then spliced with the pixel values of the pixel points in multiple channels, so as to ensure the retrieval accuracy of the model.

And 606, inputting the fusion feature vector into a second trunk network connected with a second fusion module in the feature extraction network, so as to extract features based on the fusion feature vector and obtain a prediction feature vector of the sample object in the sample image.

The second trunk network is configured to perform feature extraction on any characteristic vector to obtain a predicted feature vector, which may be any network capable of implementing feature extraction, and this application is not limited thereto.

Referring to fig. 7, inputting the fused feature vector into the second backbone network 74 connected to the second fusion module 73 in the feature extraction network can implement feature extraction based on the fused feature vector, and obtain a predicted feature vector of a sample object in a sample image.

It should be noted that, in order to enable the second backbone network to implement feature extraction on the fused feature vector, the number of input channels of the first convolution kernel in the second backbone network needs to be equal to the number of channels of the fused feature vector. For example, when the fused feature vector is 24 × 4 dimensions, that is, the number of channels of the fused feature vector is 4, the size of the first convolution kernel in the second backbone network is [64,4, 7]. Where 64 denotes the number of output channels of the convolution kernel, and 4 denotes the number of input channels of the convolution kernel.

The first convolution kernel in the second backbone network may implement initialization of the weight in an orthogonal manner, or may implement initialization of the weight in other manners, which is not limited in this application.

Step 607, determining a loss value based on the predicted feature vector of the sample object in each sample image and the category to which the sample object belongs, and adjusting the model parameters of the target detection network and the feature extraction network based on the loss value.

For the specific implementation process and principle of step 607, reference may be made to the description of other embodiments, which are not described herein again.

Therefore, the training of the image retrieval model can be realized, and the retrieval accuracy of the image retrieval model is high when the image retrieval model is used for image retrieval.

Based on the training method of the image retrieval model in the above embodiment, an image retrieval method is provided in the embodiment of the present application. The following describes an image retrieval method provided in an embodiment of the present application.

The image retrieval method provided by the embodiment of the present application is executed by an image retrieval device. The image retrieval device can be an electronic device, and can also be configured in the electronic device, so that the accuracy of image retrieval is improved by executing the image retrieval method provided by the embodiment of the application.

Fig. 8 is a flowchart illustrating an image retrieval method according to a fourth embodiment of the present application. As shown in fig. 8, the image retrieval method may include the following steps 801-804.

Step 801, obtaining a retrieval image to be retrieved.

The retrieval image comprises a target object, and the target object is a main object in the retrieval image and can be a commodity, a person, an animal, an article and the like.

Step 802, inputting the search image into a target detection network in the image search model to obtain a target image block in the search image and size information of the target image block, wherein the target image block comprises a target object in the search image.

The image retrieval model is a neural network model with an image retrieval function. The image retrieval model comprises a target detection network and a feature extraction network which are sequentially connected. The image retrieval model is obtained by training based on the training method of the image retrieval model of any one of the embodiments.

The target detection network is used for detecting an object in an arbitrary image to acquire an ROI in the arbitrary image and size information of the ROI, wherein the ROI comprises a main object in the arbitrary image.

In the embodiment of the present application, referring to fig. 9, a search image is input into the target detection network 21 in the image search model, and the ROI output by the target detection network 21 is a target image block including a target object in the search image, and meanwhile, the target detection network 21 may also output size information of the target image block.

The size information of the target image block may include a height and a width of the target image block.

Step 803, inputting the target image block in the search image and the size information of the target image block into the feature extraction network in the image search model to obtain the predicted feature vector of the target object.

In the embodiment of the present application, referring to fig. 9, the target image block in the search image and the size information of the target image block are input to the feature extraction network 22 in the image search model, and the feature extraction network 22 can output the feature vector of the target object in the search image.

The specific implementation process of steps 802 to 803 is similar to the process of obtaining sample image blocks in sample images, size information of the sample image blocks, and prediction feature vectors of sample objects in the model training process, and is not described here again.

Step 804, determining a target image from the plurality of candidate images based on the predicted feature vector of the target object.

In one embodiment of the present application, step 804 may be obtained by: acquiring candidate characteristic vectors of candidate objects in each candidate image; determining the similarity between the predicted characteristic vector of the target object and the candidate characteristic vectors of the candidate objects in the candidate images; and determining the target image from the candidate images based on the similarity between the predicted characteristic vector of the target object and the candidate characteristic vector of the candidate object in each candidate image.

The number of target images may be preset, and thus a preset number of candidate images having the highest similarity to the predicted feature vector of the target object may be used as the target images.

Referring to fig. 9, the similarity between the predicted feature vector of the target object and the candidate feature vectors of the candidate objects in the respective candidate images may be calculated by the retrieval module to determine the target image from the plurality of candidate images.

In an embodiment of the present application, before obtaining the candidate feature vectors of the candidate objects in each candidate image, the method further includes: for each candidate image, inputting the candidate image into a target detection network to obtain candidate image blocks in the candidate image and size information of the candidate image blocks, wherein the candidate image blocks comprise candidate objects in the candidate image; and inputting the candidate image blocks in the candidate image and the size information of the candidate image blocks into a feature extraction network to obtain the predicted feature vectors of the candidate objects.

The process of obtaining the candidate image blocks in the candidate image, the size information of the candidate image blocks and the predicted feature vectors of the candidate objects is similar to the process of obtaining the sample image blocks in the sample image, the size information of the sample image blocks and the predicted feature vectors of the sample objects in the model training process, and is not repeated here.

It can be understood that, in the image retrieval technology in the related art, in the model training stage, generally, the ROI in the image is scaled to a square with a fixed size and then directly input into the network (hereinafter, this method is referred to as method 1), and the model obtained by training in this way is easy to retrieve an image that is not related to the known image (i.e., the retrieved image) when the retrieval is performed, and the accuracy of the retrieval result is poor. The processing manner of the ROI by method 1 can refer to the third column of images in fig. 10.

In addition, in order to improve the retrieval accuracy of the image retrieval model, the following methods 2, 3 and 3 are also proposed in the related art. In the method 2, the aperture ratio of the ROI is kept in a data processing stage, the rest part is filled with 0 pixel to obtain a square image, and feature extraction is performed according to the square image, so that the image input to the network can embody the aperture ratio information of the image, and the retrieval accuracy of an image retrieval model is improved; the method 3 is that the aperture ratio of the ROI is kept in the data processing stage, the rest part is filled with the original image before retrieval, so that a square image is obtained, and feature extraction is carried out according to the square image, so that the image input to the network can embody the aperture ratio information of the image, and the retrieval accuracy of the image retrieval model is improved; in the method 4, the ROI in the image is zoomed to a square with a fixed size in the model training stage and then is directly input into the network, in the image retrieval stage, the aperture ratio feature and the image feature of the retrieval image are fused, and the fused feature is used for retrieval, so that the retrieval accuracy of the image retrieval model is improved. The method 2 can refer to the fourth column of images in fig. 10 for processing the ROI; method 3 the ROI may be processed in the manner described with reference to the fifth column of images in fig. 10.

In the method 2 and the method 3, the square image used for feature extraction is filled with 0 pixel or the original image, so that the image resolution is reduced, and more detail information is lost, so that the image features extracted by the model are inaccurate, and the retrieval accuracy of the model is poor. In the method 4, because the characteristic scales of the image retrieval model are different in the model training stage and the image retrieval stage, when the trained image retrieval model is used for image retrieval, the characteristic scales need to be adjusted, the debugging process is relatively complicated, the online workload is increased, and the model retrieval accuracy is poor.

According to the method provided by the embodiment of the application, the characteristic scales of the image retrieval model are the same in the model training stage and the image retrieval stage, and the image retrieval model trained in the model training stage can be directly used in the image retrieval stage, so that the retrieval accuracy of the model is improved, the characteristic scales are prevented from being adjusted, and the online workload is reduced. In addition, compared with the methods 2 and 3, the training method of the image retrieval model provided by the embodiment shown in fig. 3 and 6 of the present application does not maintain the aperture ratio of the ROI, but scales the ROI to the preset size, so that the detail information of the image can be retained to the maximum extent, and the retrieval accuracy of the image retrieval model is improved.

Referring to table 2 and fig. 11 below, where table 2 is comparison data of search results of an image search method provided in an embodiment of the present application and an image search method in the related art when image search is performed on a 31 ten thousand dataset, and fig. 11 is a schematic comparison diagram of a target image obtained by performing image search according to the image search method provided in the embodiment of the present application and a target image obtained by performing image search using a model trained by method 1.

TABLE 2 results of the experiment

Method	Rate of accuracy
		Method 1	95.94
Method 2	95.56
		Method 3	95.14
The method shown in FIG. 3	96.24
		The method shown in FIG. 6	96.29

As can be seen from table 2 and fig. 11, the image retrieval method provided in the embodiment of the present application can improve the retrieval accuracy compared to the related art.

In summary, the image retrieval method provided by the embodiment of the application obtains a retrieval image to be retrieved, inputs the retrieval image into a target detection network in an image retrieval model to obtain a target image block in the retrieval image and size information of the target image block, where the target image block includes a target object in the retrieval image, inputs the target image block in the retrieval image and the size information of the target image block into a feature extraction network in the image retrieval model to obtain a predicted feature vector of the target object, and determines the target image from a plurality of candidate images based on the predicted feature vector of the target object, thereby improving accuracy of image retrieval.

Fig. 12 is a schematic structural diagram of a training apparatus for an image retrieval model according to a fifth embodiment of the present application.

As shown in fig. 12, the training apparatus 1200 for the image retrieval model may include: a first obtaining module 1210, a first processing module 1220, a second processing module 1230, and a model parameter adjusting module 1240.

The first obtaining module 1210 is configured to obtain a plurality of sample images, where each sample image is labeled with a category to which a sample object included in the sample image belongs;

the first processing module 1220 is configured to input each sample image into a target detection network in the image retrieval model, so as to obtain sample image blocks in the corresponding sample image and size information of the sample image blocks, where the sample image blocks include sample objects in the sample image;

the second processing module 1230 is configured to input the sample image blocks in each sample image and the size information of the sample image blocks into a feature extraction network in the image retrieval model, so as to obtain predicted feature vectors corresponding to sample objects in the sample images;

and the model parameter adjusting module 1240 is configured to determine a loss value based on the predicted feature vector of the sample object in each sample image and the category to which the sample object belongs, and adjust the model parameters of the target detection network and the feature extraction network based on the loss value.

The training device for the image retrieval model provided in the embodiment of the present application may be an electronic device, or may be configured in an electronic device, and the training device for the image retrieval model provided in the embodiment of the present application may be implemented to implement training of the image retrieval model, and the image retrieval model has high retrieval accuracy when used for image retrieval.

The electronic device may be a PC, a cloud device, a mobile device, a server, and the like, and the mobile device may be any hardware device such as a mobile phone, a tablet computer, a personal digital assistant, a wearable device, and a vehicle-mounted device, which is not limited in this application.

In a possible implementation manner of this embodiment of the present application, the second processing module 1230 includes:

a first size feature extraction unit, configured to, for each sample image, input size information of a sample image block therein to a first size feature extraction module included in the feature extraction network to obtain a first size feature of the sample image block based on the size information;

the image feature extraction unit is used for inputting the sample image blocks into an image feature extraction module included in the feature extraction network so as to extract the image features of the sample image blocks to obtain image feature vectors of the sample image blocks;

and the first fusion unit is used for inputting the first size features and the image feature vectors into a first fusion module connected with the first size feature extraction module and the image feature extraction module in the feature extraction network so as to fuse the first size features and the image feature vectors to obtain the predicted feature vectors of the sample objects in the sample images.

In a possible implementation manner of the embodiment of the application, the image feature extraction module comprises an image processing sub-module and a first backbone network which are connected in sequence;

an image feature extraction unit configured to:

inputting the sample image block into an image processing sub-module to perform size scaling on the sample image block to obtain a first image block with a preset size, and acquiring a first tensor corresponding to the first image block;

and inputting the first vector into a first backbone network to perform image feature extraction based on the first vector to obtain an image feature vector of the sample image block.

In a possible implementation manner of the embodiment of the application, the first fusion module includes a multilayer perceptron sub-module and a fusion sub-module which are connected in sequence;

a first fusion unit configured to:

inputting the first size characteristic into a multilayer perceptron submodule to perform dimension expansion on the first size characteristic to obtain a second size characteristic;

and inputting the second size features and the image feature vectors into a fusion sub-module to fuse the second size features and the image feature vectors to obtain the predicted feature vectors of the sample objects in the sample images.

In a possible implementation manner of the embodiment of the present application, the size information includes a width and a height of the sample image block; a first size feature extraction unit configured to:

determining the width and the height of a sample image block as a first size characteristic;

or determining the ratio of the width to the height of the sample image block as a first size characteristic;

or determining the ratio of the width to the height of the sample image block in each sample image, determining the mean value and the standard deviation of each ratio, and determining the first size characteristic based on the ratio, the mean value and the standard deviation;

or determining the logarithmic ratio of the width and the height of the sample image block in each sample image, determining the mean and the standard deviation of each logarithmic ratio, and determining the first size characteristic based on the logarithmic ratio, the mean and the standard deviation of each logarithmic ratio.

a second size feature extraction unit, configured to, for each sample image, input size information of a sample image block in the sample image into a second size feature extraction module included in the feature extraction network to obtain a third size feature of the sample image block based on the size information;

the image processing unit is used for inputting the sample image blocks into an image processing module included in the feature extraction network so as to perform size scaling on the sample image blocks to obtain second image blocks with preset sizes and obtain second tensors corresponding to the second image blocks;

the second fusion unit is used for inputting the third size characteristic and the second tensor into a second fusion module connected with the second size characteristic extraction module and the image processing module in the characteristic extraction network so as to fuse the third size characteristic and the second tensor to obtain a fusion characteristic vector;

and the fusion feature extraction unit is used for inputting the fusion feature vector into a second trunk network connected with the second fusion module in the feature extraction network so as to extract features based on the fusion feature vector and obtain a prediction feature vector of the sample object in the sample image.

In a possible implementation manner of the embodiment of the application, the second tensor includes pixel values of each pixel point in the sample image block in a plurality of channels;

a second fusion unit configured to:

inputting the third size characteristic and the second tensor into a second fusion module, and splicing the third size characteristic and pixel values of the pixel points in the multiple channels for each pixel point in the sample image block to obtain characteristic vectors of the pixel points in the multiple channels;

and generating a fusion feature vector based on the feature vectors of all the pixel points in the sample image block in the multiple channels.

It should be noted that the explanation in the foregoing embodiment of the training method for image retrieval models is also applicable to the training apparatus for image retrieval models in this embodiment, and details are not repeated here.

The training device for the image retrieval model according to the embodiment of the application obtains a plurality of sample images, each sample image is labeled with the type of a sample object included in the sample image, each sample image is input into a target detection network in the image retrieval model to obtain a sample image block in a corresponding sample image and size information of the sample image block, each sample image block includes a sample object in the sample image, the sample image block in each sample image and the size information of the sample image block are input into a feature extraction network in the image retrieval model to obtain a predicted feature vector of the sample object in the corresponding sample image, a loss value is determined based on the predicted feature vector of the sample object in each sample image and the type of the sample object, and model parameters of the target detection network and the feature extraction network are adjusted based on the loss value. Therefore, the training of the image retrieval model can be realized, and the retrieval accuracy of the image retrieval model when the image retrieval model is used for image retrieval is high.

Fig. 13 is a schematic structural diagram of an image retrieval device according to a sixth embodiment of the present application.

As shown in fig. 13, the image retrieval apparatus 1300 may include: a second obtaining module 1310, a third processing module 1320, a fourth processing module 1330, and a determining module 1340.

The second obtaining module 1310 is configured to obtain a retrieval image to be retrieved;

a third processing module 1320, configured to input the search image into a target detection network in the image search model to obtain a target image block in the search image and size information of the target image block, where the target image block includes a target object in the search image;

a fourth processing module 1330, configured to input the target image block in the search image and the size information of the target image block into a feature extraction network in the image search model to obtain a predicted feature vector of the target object, where the image search model is obtained by training based on any one of the aforementioned training methods of the image search model;

the determining module 1340 is configured to determine a target image from the plurality of candidate images based on the predicted feature vector of the target object.

In a possible implementation manner of the embodiment of the present application, the determining module 1340 includes:

the acquisition unit is used for acquiring candidate characteristic vectors of candidate objects in each candidate image;

a first determination unit configured to determine a similarity between the predicted feature vector of the target object and the candidate feature vectors of the candidate objects in the respective candidate images;

a second determining unit configured to determine the target image from the plurality of candidate images based on a similarity between the predicted feature vector of the target object and the candidate feature vector of the candidate object in each of the candidate images.

In a possible implementation manner of the embodiment of the present application, the image retrieval apparatus 1300 may further include:

the third acquisition module is used for inputting the candidate images into the target detection network for each candidate image so as to acquire candidate image blocks in the candidate images and size information of the candidate image blocks, wherein the candidate image blocks comprise candidate objects in the candidate images;

and the fifth processing module is used for inputting the candidate image blocks in the candidate image and the size information of the candidate image blocks into the feature extraction network so as to obtain the predicted feature vectors of the candidate objects.

The image retrieval device provided in the embodiments of the present application may be an electronic device, or may be configured in an electronic device, so as to improve the accuracy of image retrieval by executing the image retrieval method provided in the embodiments of the present application.

It should be noted that the explanation in the foregoing embodiment of the image retrieval method is also applicable to the image retrieval apparatus of this embodiment, and is not repeated here.

The image retrieval device acquires a retrieval image to be retrieved, inputs the retrieval image into a target detection network in an image retrieval model to acquire a target image block in the retrieval image and size information of the target image block, wherein the target image block comprises a target object in the retrieval image, inputs the size information of the target image block and the target image block in the retrieval image into a feature extraction network in the image retrieval model to acquire a predicted feature vector of the target object, and determines the target image from a plurality of candidate images based on the predicted feature vector of the target object, so that the accuracy of image retrieval is improved.

In order to implement the above embodiments, the present application also provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of training an image retrieval model as set forth in any one of the embodiments of the present application or to perform an image retrieval method as set forth in any one of the embodiments of the present application.

In order to achieve the above embodiments, the present application also proposes a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute a training method of an image retrieval model as proposed in any of the foregoing embodiments of the present application or execute an image retrieval method as proposed in any of the foregoing embodiments of the present application.

In order to implement the above embodiments, the present application also proposes a computer program product comprising a computer program which, when executed by a processor, implements a training method of an image retrieval model as proposed in any of the previous embodiments of the present application, or implements an image retrieval method as proposed in any of the previous embodiments of the present application.

FIG. 14 illustrates a block diagram of an exemplary electronic device suitable for use in implementing embodiments of the present application. The electronic device 1400 shown in fig. 14 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 14, the electronic device 1400 is embodied in the form of a general purpose computing device. The components of the electronic device 1400 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.

Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, industry Standard Architecture (ISA) bus, micro Channel Architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Electronic device 1400 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by electronic device 1400 and includes both volatile and nonvolatile media, removable and non-removable media.

Memory 28 may include computer system readable media in the form of volatile Memory, such as Random Access Memory (RAM) 30 and/or cache Memory 32. The electronic device 1400 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 14, and commonly referred to as a "hard drive"). Although not shown in FIG. 14, a magnetic disk drive for reading from and writing to a removable nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable nonvolatile optical disk (e.g., a Compact disk Read Only Memory (CD-ROM), a Digital versatile disk Read Only Memory (DVD-ROM), or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the application.

A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally perform the functions and/or methodologies of the embodiments described herein.

The electronic device 1400 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with the electronic device 1400, and/or with any devices (e.g., network card, modem, etc.) that enable the electronic device 1400 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Moreover, the electronic device 1400 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public Network (e.g., the Internet) via the Network adapter 20. As shown, the network adapter 20 communicates with the other modules of the electronic device 1400 via the bus 18. It should be appreciated that although not shown in FIG. 14, other hardware and/or software modules may be used in conjunction with the electronic device 1400, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, to name a few.

The processing unit 16 executes various functional applications and data processing, for example, implementing the methods mentioned in the foregoing embodiments, by executing programs stored in the system memory 28.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless explicitly specified otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

The logic and/or steps represented in the flowcharts or otherwise described herein, such as an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims

1. A method for training an image retrieval model, the method comprising:

obtaining a plurality of sample images, wherein each sample image is labeled according to the category of a sample object included in the sample image;

inputting each sample image into a target detection network in an image retrieval model to obtain sample image blocks in corresponding sample images and size information of the sample image blocks, wherein the sample image blocks comprise sample objects in the sample images;

inputting sample image blocks in each sample image and size information of the sample image blocks into a feature extraction network in the image retrieval model to obtain predicted feature vectors corresponding to sample objects in the sample images;

and determining a loss value based on the predicted feature vector of the sample object in each sample image and the class of the sample object, and adjusting model parameters of the target detection network and the feature extraction network based on the loss value.

2. The method according to claim 1, wherein the inputting sample image blocks in each sample image and size information of the sample image blocks into a feature extraction network in the image retrieval model to obtain predicted feature vectors corresponding to sample objects in the sample images comprises:

for each sample image, inputting the size information of a sample image block in the sample image into a first size feature extraction module included in the feature extraction network so as to obtain a first size feature of the sample image block based on the size information;

inputting the sample image blocks into an image feature extraction module included in the feature extraction network to perform image feature extraction on the sample image blocks to obtain image feature vectors of the sample image blocks;

and inputting the first size feature and the image feature vector into a first fusion module connected with the first size feature extraction module and the image feature extraction module in the feature extraction network, so as to fuse the first size feature and the image feature vector to obtain a predicted feature vector of a sample object in the sample image.

3. The method according to claim 2, wherein the image feature extraction module comprises an image processing sub-module and a first backbone network connected in sequence;

the inputting the sample image blocks into an image feature extraction module included in the feature extraction network to perform image feature extraction on the sample image blocks to obtain image feature vectors of the sample image blocks includes:

inputting the sample image block into the image processing sub-module to perform size scaling on the sample image block to obtain a first image block with a preset size, and acquiring a first tensor corresponding to the first image block;

and inputting the first vector into the first backbone network to perform image feature extraction based on the first vector to obtain an image feature vector of the sample image block.

4. The method of claim 2, wherein the first fusion module comprises a multilayer perceptron sub-module and a fusion sub-module connected in series;

the inputting the first size feature and the image feature vector into a first fusion module connected to the first size feature extraction module and the image feature extraction module in the feature extraction network to fuse the first size feature and the image feature vector to obtain a predicted feature vector of a sample object in the sample image, includes:

inputting the first size characteristic into the multilayer perceptron submodule to perform dimension expansion on the first size characteristic to obtain a second size characteristic;

and inputting the second size features and the image feature vectors into the fusion sub-module to fuse the second size features and the image feature vectors to obtain the predicted feature vectors of the sample objects in the sample images.

5. The method according to any of claims 2-4, wherein the size information comprises a width and a height of the sample image block; the obtaining a first size characteristic of the sample image block based on the size information includes:

determining the width and the height of the sample image block as the first size characteristic;

or, determining the ratio of the width to the height of the sample image block as the first size characteristic;

or determining the logarithmic ratio of the width and the height of the sample image block in each sample image, determining the mean value and the standard deviation of each logarithmic ratio, and determining the first size characteristic based on the logarithmic ratio, the mean value and the standard deviation of each logarithmic ratio.

6. The method according to claim 1, wherein the inputting sample image blocks in each of the sample images and size information of the sample image blocks into a feature extraction network in the image retrieval model to obtain predicted feature vectors corresponding to sample objects in the sample images comprises:

for each sample image, inputting size information of sample image blocks in the sample image into a second size feature extraction module included in the feature extraction network to obtain third size features of the sample image blocks based on the size information;

inputting the sample image blocks into an image processing module included in the feature extraction network to perform size scaling on the sample image blocks to obtain second image blocks with preset sizes, and acquiring second tensors corresponding to the second image blocks;

inputting the third dimensional feature and the second tensor into a second fusion module connected with the second dimensional feature extraction module and the image processing module in the feature extraction network, so as to fuse the third dimensional feature and the second tensor to obtain a fusion feature vector;

and inputting the fusion feature vector into a second trunk network connected with the second fusion module in the feature extraction network, so as to extract features based on the fusion feature vector and obtain a prediction feature vector of a sample object in the sample image.

7. The method of claim 6, wherein the second tensor comprises pixel values of pixels in the sample image block in a plurality of channels;

the inputting the third dimensional feature and the second tensor into a second fusion module connected to the second dimensional feature extraction module and the image processing module in the feature extraction network to fuse the third dimensional feature and the second tensor to obtain a fusion feature vector, including:

inputting the third size features and the second tensor into the second fusion module, so as to splice the third size features and pixel values of the pixel points in the multiple channels for each pixel point in the sample image block, and obtain feature vectors of the pixel points in the multiple channels;

and generating the fusion feature vector based on the feature vectors of all the pixel points in the sample image block in the channels.

8. An image retrieval method, characterized in that the method comprises:

acquiring a retrieval image to be retrieved;

inputting the retrieval image into a target detection network in an image retrieval model to obtain target image blocks in the retrieval image and size information of the target image blocks, wherein the target image blocks comprise target objects in the retrieval image;

inputting target image blocks in the retrieval image and size information of the target image blocks into a feature extraction network in the image retrieval model to obtain a predicted feature vector of the target object, wherein the image retrieval model is trained based on the method of any one of claims 1-7;

determining a target image from a plurality of candidate images based on the predicted feature vector of the target object.

9. The method of claim 8, wherein determining a target image from a plurality of candidate images based on the predicted feature vector of the target object comprises:

obtaining candidate characteristic vectors of candidate objects in each candidate image;

determining similarity between the predicted feature vector of the target object and the candidate feature vectors of the candidate objects in each candidate image;

determining the target image from the plurality of candidate images based on a similarity between the predicted feature vector of the target object and the candidate feature vector of the candidate object in each of the candidate images.

10. The method of claim 9, wherein before obtaining the candidate feature vector of the candidate object in each of the candidate images, the method further comprises:

for each candidate image, inputting the candidate image into the target detection network to obtain candidate image blocks in the candidate image and size information of the candidate image blocks, wherein the candidate image blocks comprise candidate objects in the candidate image;

and inputting the candidate image blocks in the candidate image and the size information of the candidate image blocks into the feature extraction network to obtain the predicted feature vectors of the candidate objects.

11. An apparatus for training an image search model, the apparatus comprising:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a plurality of sample images, and each sample image is labeled according to the category of a sample object included in the sample image;

the first processing module is used for inputting each sample image into a target detection network in an image retrieval model so as to obtain sample image blocks in the corresponding sample image and size information of the sample image blocks, wherein the sample image blocks comprise sample objects in the sample image;

the second processing module is used for inputting sample image blocks in each sample image and size information of the sample image blocks into a feature extraction network in the image retrieval model so as to obtain predicted feature vectors corresponding to sample objects in the sample images;

and the model parameter adjusting module is used for determining a loss value based on the predicted feature vector of the sample object in each sample image and the class of the sample object, and adjusting the model parameters of the target detection network and the feature extraction network based on the loss value.

12. The apparatus of claim 11, wherein the second processing module comprises:

a first size feature extraction unit, configured to, for each sample image, input size information of a sample image block therein to a first size feature extraction module included in the feature extraction network, so as to obtain a first size feature of the sample image block based on the size information;

the image feature extraction unit is used for inputting the sample image blocks into an image feature extraction module included in the feature extraction network so as to perform image feature extraction on the sample image blocks to obtain image feature vectors of the sample image blocks;

and the first fusion unit is used for inputting the first size characteristic and the image characteristic vector into a first fusion module connected with the first size characteristic extraction module and the image characteristic extraction module in the characteristic extraction network so as to fuse the first size characteristic and the image characteristic vector and obtain a predicted characteristic vector of the sample object in the sample image.

13. The apparatus of claim 12, wherein the image feature extraction module comprises an image processing sub-module and a first backbone network connected in sequence;

the image feature extraction unit is configured to:

14. The apparatus of claim 12, wherein the first fusion module comprises a multilayer perceptron sub-module and a fusion sub-module connected in series;

the first fusion unit is configured to:

15. The apparatus according to any of claims 12-14, wherein the size information comprises a width and a height of the sample image block; the first size feature extraction unit is configured to:

16. The apparatus of claim 11, wherein the second processing module comprises:

a second size feature extraction unit, configured to, for each sample image, input size information of a sample image block in the sample image to a second size feature extraction module included in the feature extraction network, so as to obtain a third size feature of the sample image block based on the size information;

the image processing unit is used for inputting the sample image blocks into an image processing module included in the feature extraction network, so as to perform size scaling on the sample image blocks to obtain second image blocks with preset sizes and obtain second tensors corresponding to the second image blocks;

a second fusion unit, configured to input the third dimensional feature and the second tensor into a second fusion module connected to the second dimensional feature extraction module and the image processing module in the feature extraction network, so as to fuse the third dimensional feature and the second tensor to obtain a fusion feature vector;

and the fusion feature extraction unit is used for inputting the fusion feature vector into a second trunk network connected with the second fusion module in the feature extraction network so as to perform feature extraction based on the fusion feature vector to obtain a prediction feature vector of the sample object in the sample image.

17. The apparatus of claim 16, wherein the second tensor comprises pixel values of pixels in the sample image block in a plurality of channels;

the second fusion unit is configured to:

inputting the third size features and the second tensor into the second fusion module, so as to splice the third size features and the pixel values of the pixel points in the multiple channels for each pixel point in the sample image block, and obtain feature vectors of the pixel points in the multiple channels;

and generating the fusion feature vector based on the feature vectors of all the pixel points in the sample image block in the plurality of channels.

18. An image retrieval apparatus, characterized in that the apparatus comprises:

the second acquisition module is used for acquiring a retrieval image to be retrieved;

a third processing module, configured to input the search image into a target detection network in an image search model to obtain a target image block in the search image and size information of the target image block, where the target image block includes a target object in the search image;

a fourth processing module, configured to input a target image block in the search image and size information of the target image block into a feature extraction network in the image search model to obtain a predicted feature vector of the target object, where the image search model is trained based on the method according to any one of claims 1 to 7;

a determination module for determining a target image from a plurality of candidate images based on the predicted feature vector of the target object.

19. The apparatus of claim 18, wherein the determining means comprises:

an obtaining unit, configured to obtain candidate feature vectors of candidate objects in each of the candidate images;

a second determining unit configured to determine the target image from the plurality of candidate images based on a similarity between the predicted feature vector of the target object and a candidate feature vector of a candidate object in each of the candidate images.

20. The apparatus of claim 19, further comprising:

a third obtaining module, configured to, for each candidate image, input the candidate image into the target detection network to obtain candidate image blocks in the candidate image and size information of the candidate image blocks, where the candidate image blocks include candidate objects in the candidate image;

21. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7 or to perform the method of any one of claims 8-10.

22. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1-7 or perform the method of any one of claims 8-10.