CN112200031A

CN112200031A - Network model training method and equipment for generating image corresponding word description

Info

Publication number: CN112200031A
Application number: CN202011033394.2A
Authority: CN
Inventors: 赵佳男
Original assignee: Shanghai Eye Control Technology Co Ltd
Current assignee: Shanghai Eye Control Technology Co Ltd
Priority date: 2020-09-27
Filing date: 2020-09-27
Publication date: 2021-01-08

Abstract

Compared with the prior art, the network model training method and device for generating the image corresponding to the caption firstly obtain an image training set, then determine the image coding vector of the image in the sample and the caption coding vector corresponding to the image based on the sample in the image training set, then associate the image coding vector and the caption coding vector to obtain the associated feature vector of the sample, finally input the associated feature vector into the network model for training, and finish the training of the network model when the loss function of the network model meets the preset condition to obtain the trained network model. By the method, a network model which can be used for automatically labeling the text descriptions on the images is obtained, and the labor cost of manually labeling the text descriptions is greatly reduced. The network model is applied to the pedestrian re-identification task, so that the identification performance of pedestrian re-identification can be improved, and the practical application value is achieved.

Description

Network model training method and equipment for generating image corresponding word description

Technical Field

The application relates to the technical field of computer vision processing, in particular to a technology for generating a text description corresponding to an image.

Background

The pedestrian re-identification technology in the field of computer vision is used for identifying and matching the same target pedestrian in images acquired by different monitoring equipment, and has important significance in research and application in the fields of intelligent security and the like. In a real scene, due to the influence of factors such as human body posture, shooting angle change and lighting conditions, how to accurately identify and match a target pedestrian becomes a very challenging problem in the pedestrian re-identification technology.

In recent years, with the successful application of deep learning related algorithms in the field of computer vision and the continuous appearance of large-scale databases, many pedestrian re-identification methods utilize auxiliary information such as human body postures, attributes of various parts of the human body, and word descriptions on images in addition to extracting features provided by the images, so as to improve the effect of pedestrian re-identification.

The image description can provide specific and comprehensive information, so that the image description is semantically richer than the visual attribute, and the position descriptions of different images to the same target pedestrian are generally more consistent. Therefore, the pedestrian re-identification method utilizes the text description of the image, the problem of pedestrian appearance difference in the pedestrian re-identification method can be reduced, and the identification and matching accuracy is improved.

However, the method for labeling the characters of a plurality of images of the pedestrian re-identification method, especially the image data set of the pedestrian re-identification method by adopting deep learning, needs to invest high labor cost and has low efficiency, so that the application of the character description of the images in the pedestrian re-identification method is limited.

Disclosure of Invention

The application aims to provide a method and equipment for training a network model for generating a text description corresponding to an image, and the method and equipment are used for solving the technical problems that in the prior art, manual text marking of an image data set based on pedestrian re-recognition of deep learning is high in labor cost and low in efficiency.

According to one aspect of the application, a network model training method for generating image corresponding word descriptions is provided, wherein the method comprises the following steps:

acquiring an image training set, wherein each sample in the image training set comprises an image and a text description corresponding to the image;

determining an image coding vector of an image in the sample and a text coding vector of a text description corresponding to the image based on the sample in the image training set;

correlating the image coding vector and the character coding vector to obtain a correlation characteristic vector of the sample;

and inputting the associated feature vectors into a neural network for training, and finishing the training of the neural network when the loss function of the neural network meets a preset condition so as to obtain a trained network model.

Optionally, wherein the method further comprises:

acquiring a target image of a to-be-generated caption;

determining an image coding vector of the target image based on the target image;

and inputting the image coding vector into the trained network model to obtain the text description corresponding to the target image.

Optionally, wherein before the determining an image coding vector of the target image based on the target image, the method further comprises:

unifying the picture style of the target image with the picture style in the image training set.

Optionally, the inputting the image coding vector into the trained network model to obtain the caption corresponding to the target image includes:

inputting the image coding vector into the trained network model, and extracting attribute characteristic information of the target image;

generating scattered character description information based on the attribute characteristic information of the target image;

and integrating the scattered text description information to obtain the text description corresponding to the target image.

According to another aspect of the present application, there is also provided a network model training apparatus for generating a corresponding caption of an image, wherein the apparatus includes:

the device comprises a first device, a second device and a third device, wherein the first device is used for acquiring an image training set, and each sample in the image training set comprises an image and a word description corresponding to the image;

the second device is used for determining an image coding vector of an image in the sample and a text coding vector of a text description corresponding to the image based on the sample in the image training set;

the third device is used for correlating the image coding vector with the character coding vector to obtain a correlation characteristic vector of the sample;

and the fourth device is used for inputting the associated feature vectors into a neural network for training, and finishing the training of the neural network when the loss function of the neural network meets a preset condition so as to obtain a trained network model.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings:

FIG. 1 illustrates a flow diagram of a network model training method for generating image-corresponding text descriptions, according to one aspect of the subject application;

FIG. 2 illustrates a schematic diagram of a network model architecture for generating corresponding textual descriptions of images, in accordance with an aspect of the subject application;

FIG. 3 illustrates a CNN module convolutional layer structure diagram for generating a network model of an image corresponding to a caption, according to an aspect of the subject application;

FIG. 4 illustrates a schematic diagram of a network model training apparatus for generating image corresponding text descriptions, according to an aspect of the subject application;

the same or similar reference numbers in the drawings identify the same or similar elements.

Detailed Description

The present invention is described in further detail below with reference to the attached drawing figures.

In a typical configuration of the present application, each module and trusted party of the system includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.

In order to further explain the technical means and effects adopted by the present application, the following description clearly and completely describes the technical solution of the present application with reference to the accompanying drawings and preferred embodiments.

FIG. 1 illustrates a flow diagram of a network model training method for generating image-corresponding text descriptions in one aspect of the application, wherein the method of an embodiment comprises:

s11, acquiring an image training set, wherein each sample in the image training set comprises an image and a text description corresponding to the image;

s12, determining image coding vectors of images in the samples and character coding vectors of the character descriptions corresponding to the images based on the samples in the image training set;

s13, correlating the image coding vector and the character coding vector to obtain a correlation characteristic vector of the sample;

s14, inputting the associated feature vectors into a neural network for training, and finishing the training of the neural network when the loss function of the neural network meets a preset condition to obtain a trained network model.

In the present application, the method is performed by a device 1, the device 1 is a computer device and/or a cloud, the computer device includes but is not limited to a personal computer, a notebook computer, an industrial computer, a network host, a single network server, a plurality of network server sets; the Cloud is made up of a large number of computers or web servers based on Cloud Computing (Cloud Computing), which is a type of distributed Computing, a virtual supercomputer consisting of a collection of loosely coupled computers.

The computer device and/or cloud are merely examples, and other existing or future devices and/or resource sharing platforms, as applicable to the present application, are also intended to be included within the scope of the present application and are hereby incorporated by reference.

In this embodiment, in step S11, the apparatus 1 acquires an image training set, where each sample in the image training set includes an image and a caption corresponding to the image, and the caption is usually saved in the form of an annotation file corresponding to the image. The image training set may be an existing pedestrian re-identification data set, each sample in the existing pedestrian re-identification data set includes an image and a caption corresponding to the image, for example, a CUHK-PEDES data set which is well known in the field of pedestrian re-identification, or an acquired image data set suitable for a pedestrian re-identification task, and then preprocessing is performed, that is, manually labeling the caption on each image in the image data set is traversed, the image and the caption corresponding to the image are used as one sample, and all samples form the image training set.

Continuing in this embodiment, in step S12, after the device 1 acquires the image training set, each sample in the image training set is processed, wherein the images in the samples are encoded to obtain image encoding vectors of the images.

For example, images in the samples are input into a trained VGG16 module to perform image coding in combination with a neural network model of the ReLU activation function, so as to obtain image coding vectors corresponding to the images, and the lengths of the image coding vectors can be set according to actual needs, for example, set to 512 dimensions. The trained neural network model of the VGG16 in combination with the ReLU activation function can be obtained after training based on actual conditions, for example, the neural network pre-training of the VGG16 in combination with the ReLU activation function is performed on the existing ImageNet image data set, so as to give a better initialization to network parameters, so that the training process can be converged faster, and a disclaimer (dropout) operation can be added to the network to reduce the overfitting phenomenon of the neural network which may occur in the training.

After the device 1 acquires the image training set, each sample in the image training set is processed, and besides the images in the samples are encoded to obtain the image encoding vectors of the images, the word descriptions corresponding to the images in the samples are encoded in a word/word direction according to the words/words to obtain the word encoding vectors of the word descriptions corresponding to the images.

For example, the text description corresponding to the image in the sample is input into the basic RNN neural network model for text coding, so as to obtain a text coding vector of the text description corresponding to the image, and the length of the text coding vector can be set according to actual needs, but should be the same as the length of the image coding vector of the image, for example, set to 512 dimensions.

In this embodiment, in step S13, the device 1 associates the image coding vector of the obtained image with the text coding vector of the text description corresponding to the image, and obtains an associated feature vector of the corresponding sample.

The associating the image coding vector of the obtained image and the text coding vector of the text description corresponding to the image may be splicing the image coding vector of the obtained image and the text coding vector of the text description corresponding to the image, for example, splicing a 512-dimensional image coding vector and a 512-dimensional text coding vector to obtain a 1024-dimensional associated feature vector of the corresponding sample; or corresponding channel addition may be performed on the obtained image coding vector of the image and the text coding vector of the text description corresponding to the image, for example, corresponding channel addition is performed on the 512-dimensional image coding vector and the 512-dimensional text coding vector to obtain a 512-dimensional associated feature vector corresponding to the sample; the obtained image coding vector of the image and the text coding vector of the text description corresponding to the image may be associated based on a weighted mapping (attribute) method, for example, a 512-dimensional image coding vector and a 512-dimensional text coding vector are associated based on a weighted mapping method to obtain a 512-dimensional associated feature vector corresponding to the sample.

Continuing in this embodiment, in step S14, the device 1 inputs the associated feature vector into a neural network for training, and when the loss function of the neural network satisfies a preset condition, completes the training of the neural network to obtain a trained network model. And constructing the association relation between the images in the sample of the image training set and the corresponding text descriptions of the images through a neural network.

The preset condition includes that the numerical value of the loss function of the neural network meets a preset threshold, or within a preset training cycle number (epoch), although the numerical value of the loss function of the neural network does not meet the preset threshold, the numerical value of the loss function of the neural network does not decrease.

The trained network model is used for automatically labeling the images in the image data set of the pedestrian re-recognition task with the corresponding text descriptions of the images.

One network model structure for generating corresponding text descriptions of images of one embodiment is shown in FIG. 2. For example, the device 1 obtains an image training set, where a caption labeled to an image of one sample is a woman in red shirt, and inputs the image into a trained VGG16 to perform image coding in combination with a neural network model of a ReLU activation function, so as to obtain a 512-dimensional image coding vector; acquiring a caption label corresponding to the image, and inputting the caption into a basic RNN neural network model for character coding to obtain a 512-dimensional character coding vector of the caption corresponding to the image; then, 512-dimensional image coding vectors and 512-dimensional character coding vectors are spliced into 1024-dimensional associated feature vectors, a neural network formed by a CNN module and a classification layer is input for training, wherein the CNN module is formed by 3 layers of convolution layers and GLU activation functions, the structure of each layer of convolution layer is shown in figure 3, weight normalization, residual connection and weight abandon operations are added into each layer of convolution layer, so that the neural network training effect is better, the CNN module outputs 512-dimensional vectors as the input of the classification layer, the classification layer performs downsampling on the input 512-dimensional vectors to obtain 256-dimensional vectors, and the feature vectors are classified through a normalized softmax function.

And when the numerical value of the loss function of the neural network meets a preset condition, finishing the training of the neural network to obtain a trained network model for automatically labeling the text description corresponding to the image in the image data set of the pedestrian re-recognition task.

For example, the neural network is trained using a cross-entropy loss function (cross-entropy loss function), and optimized using a back propagation algorithm based on gradient descent, wherein the optimizer may use RMSProp, and the value of the cross-entropy loss function may be calculated based on the following formula:

wherein,

L_IDis a cross entropy loss function;

k is the total number of classified categories;

y_kprobability of being the kth class;

theta is an optimized initial parameter;

f is the feature extracted by the neural network.

Optionally, the method for training a network model for generating a caption corresponding to an image further includes:

s15 (not shown) acquiring a target image of a caption to be generated;

s16 (not shown) determining an image coding vector for the target image based on the target image;

s17 (not shown) inputs the image coding vector into the trained network model to obtain the corresponding text description of the target image.

In one embodiment, in step S15, the device 1 acquires a target image to be recognized of the pedestrian re-recognition task, wherein the target image to be recognized has no corresponding text description.

Continuing in this embodiment, in step S16, device 1 performs image coding on the obtained target image to be recognized, and obtains an image coding vector corresponding to the target image to be recognized. For example, the target image to be recognized is input into a trained VGG16 to perform image coding in combination with a neural network model of a ReLU activation function, so as to obtain an image coding vector corresponding to the image, and the length of the image coding vector may be set according to actual needs, for example, set to 512 dimensions. The trained neural network model of the VGG16 in combination with the ReLU activation function can be obtained after training based on actual conditions, for example, the neural network pre-training of the VGG16 in combination with the ReLU activation function is performed on the existing ImageNet image data set, so as to give a better initialization to network parameters, so that the training process can be converged faster, and a disclaimer (dropout) operation can be added to the network to reduce the overfitting phenomenon of the neural network which may occur in the training.

If the picture style of the target image to be recognized is greatly different from the picture style of the images in the sample of the image training set, for example, the difference of the brightness, exposure, color saturation, hue and the like of the image shot is large, the generalization capability of the trained network model obtained based on the image training set (for example, CUHK-PEDES) to the caption prediction of the target image to be recognized is poor.

Optionally, before the step S16, the method further includes:

s18 (not shown) unifies the picture style of the target image with the picture style in the training set of images.

In order to enable the trained network model obtained based on the image training set to have good prediction generalization capability on the word description of the target image to be recognized, before image coding is performed on the target image to be recognized, a style migration technology is adopted to unify the picture style of the target image and the picture style in the image training set so as to eliminate the picture style difference of different images and improve the generalization capability of the trained network model prediction. For example, the picture style of the target image to be recognized may be migrated to the picture style of the sample image of the image training set by using a SPGAN (simple prediction cycle-dependent adaptive Network) method, for example, the picture style of the image in the Duke-MTMC data set for the pedestrian re-recognition task may be migrated to the picture style of the CUHK-PEDES data set by using the SPGAN method with the image in the Duke-MTMC data set as the target image. Other image style migration methods, such as CycleGAN, Pix2Pix and other image style migration methods, may also be adopted, and any other image style migration method, such as being applicable to the present application, is within the scope of the present application without limitation.

Continuing in this embodiment, in step S17, the device 1 inputs the obtained image coding vector corresponding to the target image to be recognized into the trained network model to obtain the caption corresponding to the target image to be recognized, and further saves the obtained caption corresponding to the target image to be recognized as the annotation file corresponding to the target image to be recognized.

Optionally, wherein the step S17 includes:

For example, images in the Duke-MTMC data set for the pedestrian re-recognition task after the image style migration are encoded into image encoding vectors and then input into a trained network model, the network model extracts attribute features from the image encoding vectors, predicts the interpretative scattered word (word) character descriptions of the target images according to the attribute features, and integrates the scattered word (word) character descriptions into corresponding character descriptions by combining language styles (grammars and rules) of the character descriptions corresponding to sample images in the image training set learned by training.

Furthermore, by traversing each image of the image data set where the target image is located by adopting the method, the description can be automatically labeled for the image data set where the target image is located efficiently and accurately, and the image data set with the description label can be obtained. For example, the method is adopted to traverse each image in the Duke-MTMC data set, so that the caption annotation of the Duke-MTMC data set can be efficiently and accurately obtained, and the Duke-MTMC data set with the caption annotation is obtained.

Furthermore, the obtained image data set with the caption (such as the Duke-MTMC data set with the caption) can be used in a pedestrian re-identification task, so that the visual image features are provided for the existing neural network model for assisting the pedestrian re-identification task by using the caption, and meanwhile, the caption with rich semantic information is provided, so that the better performance is obtained based on the pedestrian re-identification task.

FIG. 4 is a schematic diagram of a network model training apparatus for generating image-corresponding text descriptions according to another aspect of the present application, wherein the apparatus comprises:

a first device 41, configured to obtain an image training set, where each sample in the image training set includes an image and a caption corresponding to the image;

a second device 42, configured to determine, based on the samples in the image training set, an image coding vector of an image in the samples and a text coding vector of a text description corresponding to the image;

a third device 43, configured to associate the image coding vector and the text coding vector to obtain an associated feature vector of the sample;

and a fourth device 44, configured to input the associated feature vector into a neural network for training, and complete training of the neural network when a loss function of the neural network meets a preset condition, so as to obtain a trained network model.

The first device 41 acquires an image training set, where each sample in the image training set includes an image and a caption corresponding to the image, and the caption is usually saved in the form of an annotation file corresponding to the image. The image training set may be an existing pedestrian re-identification data set, each sample in the existing pedestrian re-identification data set includes an image and a caption corresponding to the image, for example, a CUHK-PEDES data set which is well known in the field of pedestrian re-identification, or an acquired image data set suitable for a pedestrian re-identification task, and then preprocessing is performed, that is, manually labeling the caption on each image in the image data set is traversed, the image and the caption corresponding to the image are used as one sample, and all samples form the image training set.

Wherein the second means 42 processes each sample in the training set of images, wherein the images in the samples are encoded to obtain an image encoding vector for the image.

Optionally, wherein the second device 42 comprises:

a VGG16 network module to determine an image encoding vector for an image based on the image of the sample in the training set of images;

and the RNN module is used for determining a word coding vector of the word description corresponding to the image based on the word description corresponding to the image of the sample in the image training set.

The VGG16 network module comprises a VGG16 module and a ReLU activation function, and the VGG16 network module performs image coding on images in input image training set samples to obtain image coding vectors corresponding to the images. The length of the image coding vector can be set according to actual needs, for example, 512 dimensions.

The RNN module carries out character/word direction coding on the word descriptions corresponding to the images in the input image training set samples according to characters/words so as to obtain character coding vectors of the word descriptions corresponding to the images. The length of the text encoding vector can be set according to actual needs, but should be the same as the length of the image encoding vector of the image, for example, set to 512 dimensions.

The third device 43 correlates the image coding vector of the obtained image with the text coding vector of the text description corresponding to the image to obtain a correlation feature vector corresponding to the sample.

The fourth device 44 inputs the associated feature vector into a neural network for training, and when the loss function of the neural network meets a preset condition, completes the training of the neural network to obtain a trained network model. The relevance relation between the images in the sample of the image training set and the text descriptions corresponding to the images is established through the neural network, and the trained network model can be used for automatically labeling the text descriptions corresponding to the images in the image data set of the pedestrian re-recognition task.

Optionally, the apparatus further comprises:

fifth means 45 (not shown) for acquiring a target image of the caption to be generated;

sixth means 46 (not shown) for determining an image coding vector for the target image based on the target image;

seventh means 47 (not shown) for inputting the image coding vector into the trained network model to obtain the corresponding text description of the target image.

The fifth device 45 obtains a target image to be recognized of the pedestrian re-recognition task, where the target image to be recognized has no corresponding text description.

The sixth device 46 performs image coding on the obtained target image to be identified, and obtains an image coding vector corresponding to the target image to be identified. The length of the image coding vector can be set according to actual needs, for example, set to 512 dimensions.

The seventh device 47 inputs the image coding vector corresponding to the target image to be recognized into the trained network model to obtain the caption corresponding to the target image to be recognized, and further saves the caption corresponding to the target image to be recognized as the annotation file corresponding to the target image to be recognized.

Furthermore, each image of the image data set where the target image is located is traversed, so that the image data set where the target image is located can be automatically labeled with the caption efficiently and accurately, and the image data set with the caption label is obtained. For example, the method is adopted to traverse each image in the Duke-MTMC data set, so that the caption annotation of the Duke-MTMC data set can be efficiently and accurately obtained, and the Duke-MTMC data set with the caption annotation is obtained.

Optionally, the apparatus may further comprise:

eighth means 48 (not shown) for unifying the picture style of the target image with the picture style in the training set of images before the target image is encoded.

The method can eliminate the picture style difference of different images and improve the generalization capability of the trained network model prediction. For example, the picture style of the target image to be recognized may be migrated to the picture style of the sample image of the image training set by using a SPGAN (simple prediction cycle-dependent adaptive Network) method, for example, the picture style of the image in the Duke-MTMC data set for the pedestrian re-recognition task may be migrated to the picture style of the CUHK-PEDES data set by using the SPGAN method with the image in the Duke-MTMC data set as the target image. Other image style migration methods, such as CycleGAN, Pix2Pix and other image style migration methods, may also be adopted, and any other image style migration method, such as being applicable to the present application, is within the scope of the present application without limitation.

Further, the apparatus may further include:

ninth means 49 (not shown) for using the obtained image dataset with the caption annotations for the pedestrian re-identification task.

For example, the ninth device 49 uses the Duke-MTMC data set with the caption for the pedestrian re-recognition task, and provides the visual image features for the existing neural network model that uses the caption to assist the pedestrian re-recognition task, and at the same time provides the caption with rich semantic information, so as to obtain better performance based on the pedestrian re-recognition task.

According to yet another aspect of the present application, there is also provided a computer readable medium having stored thereon computer readable instructions executable by a processor to implement the foregoing method.

According to yet another aspect of the present application, there is also provided an apparatus for optimizing a predicted radar echo image, wherein the apparatus comprises:

one or more processors; and

a memory storing computer readable instructions that, when executed, cause the processor to perform operations of the method as previously described.

For example, the computer readable instructions, when executed, cause the one or more processors to: firstly, obtaining an image training set, wherein each sample in the image training set comprises an image and a text description corresponding to the image, and then determining an image coding vector of the image in the sample and a text coding vector of the text description corresponding to the image based on the samples in the image training set; then, correlating the image coding vector and the character coding vector to obtain a correlation characteristic vector of the sample; and then inputting the associated feature vectors into a neural network for training, and finishing the training of the neural network when the loss function of the neural network meets a preset condition so as to obtain a trained network model. The target image of the text description to be generated can be continuously obtained; then determining an image coding vector of the target image based on the target image; and then inputting the image coding vector into the trained network model to obtain the corresponding word description of the target image. And unifying the picture style of the target image with the picture style in the image training set.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims

1. A method for training a network model for generating a caption corresponding to an image, the method comprising:

2. The method of claim 1, further comprising:

acquiring a target image of a to-be-generated caption;

3. The method of claim 2, wherein prior to said determining an image coding vector for the target image based on the target image, the method further comprises:

4. The method of claim 2 or 3, wherein the inputting the image coding vector into the trained network model to obtain the corresponding caption of the target image comprises:

5. A network model training apparatus for generating a corresponding caption of an image, the apparatus comprising:

6. The apparatus of claim 5, wherein the second means comprises:

7. The apparatus according to claim 5 or 6, characterized in that it further comprises;

the fifth device is used for acquiring a target image of the text description to be generated;

sixth means for determining an image coding vector for the target image based on the target image;

and the seventh device is used for inputting the image coding vector into the trained network model so as to obtain the word description corresponding to the target image.

8. A computer-readable medium comprising, in combination,

stored thereon computer readable instructions executable by a processor to implement the method of any one of claims 1 to 4.

9. An apparatus, characterized in that the apparatus comprises:

one or more processors; and

memory storing computer readable instructions that, when executed, cause the processor to perform the operations of the method of any of claims 1 to 4.