CN113989476A

CN113989476A - Object identification method and electronic equipment

Info

Publication number: CN113989476A
Application number: CN202111117601.7A
Authority: CN
Inventors: 章宦记; 孙可嘉; 李彤
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2021-09-23
Filing date: 2021-09-23
Publication date: 2022-01-28

Abstract

The embodiment of the application discloses an object identification method and electronic equipment, wherein the method comprises the following steps: determining a target object group to be identified, wherein the target object group comprises at least two target objects to be identified, and the target objects are associated with text information and image information; inputting the text information and the image information of each target object into a target model for recognition processing; the target model is used for respectively fusing the text features and the image features of each target object and judging whether the target objects have commonality on target attributes or not according to the obtained image-text cross features; the target model comprises a multi-modal feature fusion model and is used for performing fusion processing on the text features and the image features in the process of executing the task of judging whether the text features and the image features are matched or not. By the embodiment of the application, whether the target object has commonality on the target attribute can be identified more accurately.

Description

Object identification method and electronic equipment

Technical Field

The present application relates to the field of information recognition technologies, and in particular, to an object recognition method and an electronic device.

Background

In a commodity object information system, there is a need to identify commodities of the same type or similar types in many application scenarios. For example, when recommending a product to a user, it is necessary to identify a product similar to a product that the user has historically viewed from a product library and recommend the product. Or, in some pages (e.g., event venue pages, etc.) where multiple commodities are displayed in an aggregated manner, it may be necessary to display the commodities of the same type or similar types after scattering the commodities, so as to avoid the commodities of the same type or similar types from being excessively aggregated, and at this time, it is also necessary to identify the commodities belonging to the same type or similar types from the commodity set to be displayed, and then perform scattering processing according to the identification result, and so on.

When identifying the same or similar commodities, the prior art generally performs similarity comparison based on pictures or texts of the commodities. That is, the commodity images of the two commodities can be compared to judge whether the two commodities are the same or similar; or comparing text information such as titles of the two commodities to judge whether the two commodities are the same or similar. However, the image may have a situation of being not a main image or having an unobvious subject target, and the text may also have a situation of having inaccurate similar attribute information (for example, when a merchant issues a commodity, there may be problems of missing and wrong filling of the text information of the commodity, redundant information accumulation, lack of key information, and the like), and finally, the result of determining the same money or the similar money is biased.

Therefore, how to improve the accuracy of identifying the same/similar commodities becomes a technical problem to be solved by the technical personnel in the field.

Disclosure of Invention

The application provides an object identification method and electronic equipment, which can be used for more accurately identifying whether the target object has commonality on the target attribute, and can improve the accuracy of identifying the same type/similar type commodities in a commodity object scene.

The application provides the following scheme:

an object recognition method, comprising:

determining a target object group to be identified, wherein the target object group comprises at least two target objects to be identified, and the target objects are associated with text information and image information;

inputting the text information and the image information of each target object in the target object group into a target model for recognition processing to obtain an object recognition result; the target model is used for respectively fusing the text features and the image features of each target object, and judging whether the target objects have commonality on target attributes according to the obtained image-text cross features of the target objects;

the target model comprises a multi-mode feature fusion model which is used for fusing the text features and the image features to obtain the image-text cross features in the process of executing the task of judging whether the text features and the image features are matched or not.

The target model comprises a feature generation model and a discrimination model, wherein the feature generation model comprises a feature extraction model and a multi-modal feature fusion model;

the feature extraction model is to: respectively extracting text features and image features of each target object;

the multi-modal feature fusion model is used to: fusing the text features and the image features, and outputting image-text cross features of each target object;

and the discrimination model is used for judging whether the target objects have commonality on the target attributes according to the image-text cross characteristics of the target objects.

Wherein the target model is a set of models formed by combining a feature generation model and a discrimination model.

Wherein before training the target model, the method further comprises:

training the feature generation model, and obtaining a parameter learning result in the feature generation model according to a training result;

and taking the parameter learning result as a parameter initial value when the target model is trained.

Wherein, still include:

providing a recognition result about whether the target object has commonality on the target attribute, receiving user feedback information about the accuracy of the recognition result, and performing iterative training on the target model according to the user feedback information.

Wherein the target object comprises a merchandise object;

the determining whether the objects have commonality on the target attribute comprises:

and determining whether each commodity object in the same commodity object group belongs to the same-money or similar-money commodity object.

Wherein at least two commodity objects in the commodity object group correspond to the same category;

the target model comprises a plurality of different target models corresponding to different categories;

inputting the text information and the image information of each target object in the target object group into a target model for processing, wherein the processing comprises the following steps:

and inputting the text information and the image information of each commodity object into a target model corresponding to the target category for processing according to the target category to which each commodity object in the target commodity object group belongs.

Wherein the determining of the target object group to be recognized includes:

when commodity object recommendation is carried out according to commodity objects browsed by a user history, the commodity objects browsed by the user history and other data objects in a commodity object library form a plurality of target commodity object groups.

Wherein the determining of the target object group to be recognized includes:

after receiving a request for searching for the same-style/similar-style commodity object of the target commodity object submitted by a user, the target commodity object and other data objects in the commodity object library form a plurality of target commodity object groups.

Wherein the determining of the target object group to be recognized includes:

receiving text information and image information of an object to be recognized submitted by a user, wherein the image information is generated after image acquisition is carried out on a real object corresponding to the object to be recognized;

respectively forming the object to be identified and a plurality of commodity objects in a commodity object library into a target object group; the commodity objects in the commodity object library are also associated with labels, and the labels are used for representing whether the corresponding commodity objects are genuine products or the proximity degree of the genuine products;

the method further comprises the following steps:

and providing identification result information about whether the object to be identified is a genuine product or the proximity of the object to be identified and the genuine product according to the label corresponding to the target commodity object with the similarity meeting the condition of the object to be identified.

A model processing method, comprising:

constructing a graph-text matching model, wherein the graph-text matching model is used for identifying whether the image and the text are used for describing a unified object;

acquiring a training sample, wherein the training sample comprises a plurality of sample entries, and the sample entries comprise image content, text content and annotation information about whether the image content is matched with the text content;

training the image-text matching model through the training sample;

after training is finished, the multi-mode feature fusion model is generated according to a feature fusion module in the image-text matching model, so that image feature extraction and text feature extraction are carried out on an input target object, and an image text and a text feature are fused to generate image-text cross features.

A model processing method, wherein the model is formed by combining a feature generation model and a discriminant model, and the method comprises the following steps:

training the feature generation model, and determining parameter values in the feature generation model; the feature generation model is used for fusing image features and text features of the target object to generate image-text cross features;

acquiring a training sample, wherein the training sample comprises a plurality of target object groups, each target object group comprises at least two target objects and labeling information about whether the at least two target objects have commonality on target attributes;

and training the target model by using the training sample, wherein the parameter values determined in the training process of the characteristic generation model are used as initial values of the parameters when the target model is trained.

A method of providing recognition results, comprising:

inputting the text information and the image information of each target object in the target object group into a target model for processing, wherein the target model is used for fusing the text characteristics and the image characteristics of each target object respectively and determining the similarity between the target objects according to the obtained image-text cross characteristics; the target model comprises a multi-modal feature fusion model, and the multi-modal feature fusion model is used for performing fusion processing on text features and image features to obtain the image-text cross features in the process of executing a task of judging whether the text features and the image features are matched or not;

An object recognition apparatus comprising:

the target object group determining unit is used for determining a target object group to be recognized, the target object group comprises at least two target objects to be recognized, and the target objects are associated with text information and image information;

the identification processing unit is used for inputting the text information and the image information of each target object in the target object group into a target model for identification processing to obtain an object identification result; the target model is used for respectively fusing the text features and the image features of each target object, and judging whether the target objects have commonality on target attributes according to the obtained image-text cross features of the target objects;

A model processing apparatus comprising:

the model construction unit is used for constructing a graph-text matching model which is used for identifying whether the image and the text are used for describing the unified object;

the training sample acquisition unit is used for acquiring a training sample, wherein the training sample comprises a plurality of sample entries, and the sample entries comprise image content, text content and annotation information about whether the image content is matched with the text content;

the training unit is used for training the image-text matching model through the training sample;

and the multi-modal feature fusion model generation unit is used for generating the multi-modal feature fusion model according to the feature fusion module in the image-text matching model after training is finished, so as to extract image features and text features of the input target object, and fuse the image text and the text features to generate image-text cross features.

A model processing apparatus in which a feature generation model and a discrimination model are combined, comprising:

the first training unit is used for training the feature generation model and determining parameter values in the feature generation model; the feature generation model is used for fusing image features and text features of the target object to generate image-text cross features;

the training sample acquisition unit is used for acquiring a training sample, wherein the training sample comprises a plurality of target object groups, each target object group comprises at least two target objects and labeling information about whether the at least two target objects have commonality on target attributes, and text information and image information associated with each commodity object;

and the second training unit is used for training the target model by using the training sample, wherein the parameter values determined in the process of training the feature generation model are used as initial values of the parameters when the target model is trained.

An apparatus for providing recognition results, comprising:

the information receiving unit is used for receiving text information and image information of an object to be recognized submitted by a user, wherein the image information is generated by acquiring an image of a real object corresponding to the object to be recognized;

the target object group generating unit is used for respectively forming the object to be identified and a plurality of commodity objects in the commodity object library into a target object group; the commodity objects in the commodity object library are also associated with labels, and the labels are used for representing whether the corresponding commodity objects are genuine products or the proximity degree of the genuine products;

the similarity recognition unit is used for inputting the text information and the image information of each target object in the target object group into a target model for processing, the target model is used for fusing the text characteristics and the image characteristics of each target object respectively, and the similarity between the target objects is determined according to the obtained image-text cross characteristics; the target model comprises a multi-modal feature fusion model, and the multi-modal feature fusion model is used for performing fusion processing on text features and image features to obtain the image-text cross features in the process of executing a task of judging whether the text features and the image features are matched or not;

and the identification result providing unit is used for providing identification result information about whether the object to be identified is a genuine product or the proximity of the object to be identified and the genuine product according to the label corresponding to the target commodity object with the similarity meeting the condition of the object to be identified.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any of the preceding claims.

An electronic device, comprising:

one or more processors; and

a memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform the steps of the method of any of the preceding claims.

According to the specific embodiments provided herein, the present application discloses the following technical effects:

by means of the method and the device, the teletext cross-feature of the target object can be obtained by fusing multimodal features (including text information and image information and the like) of the target object, and whether the target object has commonality on the target attribute can be identified more accurately based on the teletext cross-feature. When the multi-modal features are fused, the image features and the text features are considered to belong to different domains, the multi-modal feature fusion model can be constructed instead of directly splicing the image features and the text features or performing simple mathematical operation, and the model is trained in a supervised learning mode, so that the multi-modal features can be fused for the text features and the image features through the multi-modal feature fusion model. In order to realize the supervised learning mode training of the multi-modal feature fusion model, the multi-modal feature fusion model can be converted into the training of the image-text matching classification model, that is, the specific multi-modal feature fusion model can perform fusion processing on the text features and the image features in the process of executing the task of judging whether the text features and the image features are matched, and then the fusion processing result can be determined as the image-text cross features. In this way, a supervised multi-modal feature fusion model can be generated, so as to generate the image-text cross feature which can express the features of the target object more, and further, whether the target object has commonalities on the target attributes (for example, whether each commodity object belongs to the same money/similar money) can be identified more accurately.

The specific discriminant model can also be obtained by means of supervised learning. In addition, the feature generation model and the discrimination model which comprise the multi-modal feature fusion model can be spliced together to form a set of target model, so that the feature generation model and the discrimination model can be trained simultaneously, and the recognition accuracy of the model can be improved.

In addition, before training a target model formed by combining the feature generation model and the discrimination model, the feature generation model can be trained firstly, and after parameter values in the feature generation model are determined, the parameter values can be kept; then, the target model may be retrained, and the parameter values determined when the model is generated by separately training the features may be used as the initial values of the parameters in the model when the target model is retrained, so that the accuracy of model identification may be further improved.

Of course, it is not necessary for any product to achieve all of the above-described advantages at the same time for the practice of the present application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a schematic diagram of a system architecture provided by an embodiment of the present application;

FIG. 2 is a flow chart of a first method provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of a multimodal fusion model training process provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of another model training process provided by embodiments of the present application;

FIG. 5 is a flow chart of a second method provided by embodiments of the present application;

FIG. 6 is a flow chart of a third method provided by embodiments of the present application;

FIG. 7 is a flow chart of a fourth method provided by embodiments of the present application;

FIG. 8 is a schematic diagram of a first apparatus provided by an embodiment of the present application;

FIG. 9 is a schematic diagram of a second apparatus provided by an embodiment of the present application;

FIG. 10 is a schematic diagram of a third apparatus provided by an embodiment of the present application;

FIG. 11 is a schematic diagram of a fourth apparatus provided by an embodiment of the present application;

fig. 12 is a schematic diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments that can be derived from the embodiments given herein by a person of ordinary skill in the art are intended to be within the scope of the present disclosure.

In the embodiment of the application, in order to improve the accuracy of identifying the commodity objects of the same type or similar type, an identification mode with multi-mode feature fusion is adopted. Specifically, since the commodity object may generally include elements such as a text (including a title, a brand name, etc.) and an image (a picture, a video, etc.), although there may be a recognition deviation when the commodity object is recognized from a text feature or an image feature dimension, if the text feature and the image feature are fused together for recognition, the function of mutually compensating the respective disadvantages may be achieved. Therefore, in the embodiment of the application, the text features and the image features, which are single-mode features, are fused together to generate multi-mode image-text cross features, and then recognition is performed based on the multi-mode features, so that the recognition accuracy of the same/similar commodity objects is improved. However, in the specific implementation, how to fuse the text feature and the image feature, and specifically how to better identify the same-style/similar-style commodity object according to the fused image-text cross feature are problems to be considered.

In the prior art, regarding the problem of multi-modal feature fusion, one solution is to directly splice together feature vectors of multiple different modalities to generate multi-modal features. For example, if the image feature vector is (1, 2, 3) and the text vector is (56, 67, 90, 4), then the fused feature may be (1, 2, 3, 56, 67, 90, 4). Alternatively, in some schemes, the fused features may be obtained after performing some mathematical operations on the image features and the text features. For example, also the image feature vectors and the text feature vectors, the fused multi-modal feature vectors may be (1 × 56, 1 × 67, 1 × 90, 1 × 4, 2 × 56, 2 × 67, 2 × 90, 2 × 4, 3 × 56, 3 × 67, 3 × 90, 3 × 4), and so on.

However, in the process of implementing the present application, the inventor of the present application finds that, because the text feature and the image feature belong to features in different domains, if the text feature and the image feature are directly spliced together or subjected to simple mathematical operations, the fused features may still be insufficient to express the features of the commodity object. For this reason, in the embodiment of the present application, firstly, the fusion manner of the text feature and the image feature is improved.

Specifically, in order to more reasonably perform fusion of text features and image features, a multi-modal feature fusion model can be established in a supervised learning mode, and then the text features and the image features of the same commodity object are fused into higher-level abstract features through the multi-modal feature fusion model, and the abstract features can be used as image-text cross features to participate in the subsequent recognition process.

In the process of model training with supervised learning, because the samples need to be labeled in advance (usually in a manual labeling manner), the training targets (i.e. the information expected to be output by the model) need to be known before labeling. And if the direct output result of a model is a feature vector, it is difficult to perform manual labeling of a sample (and for this reason, similar feature generation models in the prior art are trained in an unsupervised manner). For this reason, in the embodiment of the present application, a model for performing a task of determining whether text features and image features of a commodity object are matched (that is, determining whether an input piece of text and an input piece of image are available for describing the same commodity object) may be first constructed. In the process of training the model to determine whether the text features and the image features are matched, it is mainly required to learn how to fuse the text features and the image features into higher-level feature vectors, that is, in the process of executing the task of determining whether the text features and the image features are matched, the output result of one layer includes the feature vectors obtained by fusing the text features and the image features, so that the vectors can be used as the fused image-text cross vectors in the embodiment of the application to participate in the subsequent recognition process of the same-style/similar-style commodity objects.

In other words, in the embodiment of the present application, the training of the multi-modal feature fusion model is converted into the training of the graph-text matching classification model. Thus, a model for performing the teletext matching classification can be constructed, for example, the model can be specifically constructed based on the Bert model, and the like. Then, the image-text matching classification model can be trained in a supervision mode. Specifically, a plurality of < text, image > pairs and corresponding label information may be used as training samples, where the label information may be whether the text and the image are matched. After the training is finished, the model has the function of giving a judgment result whether the input text is matched with the image or not. Of course, in the embodiment of the present application, actually, the matching condition between the image and the text does not need to be classified, but only the fusion result of the text feature and the image feature needs to be obtained, so that, from the trained model, a layer (i.e., an output layer) that is finally used for outputting the image-text matching determination result from the model may be removed, and thus, after the text and the image of the commodity object are input, the image-text cross feature vector obtained by fusing the text feature and the image feature may be input.

By the mode, text features and image features can be fused based on a supervised learning mode instead of simply splicing two feature vectors, so that higher-level abstract expression of the commodity object can be realized, and the characteristics of the commodity object can be more comprehensively and accurately expressed by the image-text cross features obtained by fusion processing.

After the image-text cross feature of the commodity object is obtained, particularly when the identification of the commodity object with the same money/similar money is carried out based on the image-text cross feature, various modes can be provided. In the preferred embodiment of the present application, the specific discriminant model can also be obtained by means of supervised learning training. In addition, the feature generation model and the discrimination model can be spliced together to form a target model, so that the feature generation model and the discrimination model can be trained simultaneously.

In addition, in order to further improve the recognition accuracy, before training a target model formed by combining the feature generation model and the discrimination model, the feature generation model can be trained first, and after parameter values in the feature generation model are determined, the parameter values can be retained; then, the target model may be retrained, and the parameter values determined when the model is generated by separately training the features may be used as the initial values of the parameters in the model when the target model is retrained, so that the accuracy of model identification may be further improved.

From the system architecture perspective, the embodiment of the present application provides a function of identifying whether the target object has commonality on the target attribute, and the function may be provided to the user as a separate application program, or may exist as a functional module of another system. For example, as shown in fig. 1, the object recognition function module may be provided in the merchandise object information system, so that when other function modules in the merchandise object information system need to determine whether some merchandise objects belong to the same money or similar money, they can be recognized by the object recognition model. For example, when the recommending module needs to recommend a commodity object similar to a product browsed historically to the user, the function module can be used for acquiring the same-money/similar-money commodity object from the commodity pool. Or, when the commodity objects in a certain page need to be broken up according to styles, the functional module can be used for identifying the commodity objects with the same style or similar styles and then carrying out breaking-up processing. Or, some commodity object information systems may also provide a function module of "find similar" or "find the same style" for the user, for example, in a process of displaying a plurality of commodity objects through a plurality of resource positions in a certain commodity object list page, an operation control of "find similar" or the like may also be provided in a specific resource position, and after the user clicks the operation control, the commodity object and the commodity object in the commodity library may be identified by the same style/similar style, and an identification result is returned, and so on. At this time, the function module of "find similar" or "find the same money" may also be used to determine a specific group of commodity objects to be identified, and identify whether each commodity object in the group belongs to the same money or similar money by calling the object identification function module, and so on.

Or, the embodiment of the application can also provide other update-form services for the user by utilizing the specific identification capability of the same-style/similar-style commodity object. For example, in practical applications, a user may purchase a commodity through a certain channel, and may need to determine whether the commodity is genuine; alternatively, a user a and a user B may purchase the same type of cosmetics of the same brand, and in this case, the two products may need to be compared to see which is more like the genuine product, and so on. Therefore, the above-mentioned service can be provided for the user, specifically, related function modules (or light applications, applets, etc.) can be provided in the commodity object information system, or a stand-alone application program can be provided, or a handheld hardware device can be provided, and the like. Through the service, the user can take the purchased commodity (corresponding to the real object) as the object to be identified, acquire the image information in modes of photographing the commodity and the like, and upload specific image information. The service may perform similarity identification with each related product in the product library based on the image information, and the product in the product library may be added with a tag of whether the product is a genuine product or a tag close to the genuine product in advance, and so on. In this way, if it is recognized that the similarity between the current object to be recognized and a certain commodity object in the commodity library is relatively high, the recognition result about whether the object to be recognized is a genuine product or the proximity of the object to be recognized and the genuine product can be returned by using the label corresponding to the commodity object. For example, "your merchandise may be genuine," and so on. Or, in the case of uploading photos of two objects to be identified for quality control, similarity identification may be performed on the two objects to be identified and each related commodity in the commodity library, and then, according to the tag corresponding to the commodity with higher similarity, an identification result such as which of the two objects to be identified is more like a good is given, and so on.

Here, when identifying the identity or the similarity with the commodity object in the commodity library based on the image information uploaded by the user, the image information uploaded by the user may not have text content, and in this case, the image feature in the image information may be directly subjected to higher-level feature abstraction. Alternatively, text may be recognized from the picture, for example, the text includes trademark text and the like, and then text features are extracted, or an input control for inputting specific text content may be provided for the user, so that the user may input text content for a specific object to be recognized, so as to improve recognition accuracy, and the like.

In addition, the scheme provided by the embodiment of the application is not limited to be used in the same-money/similar-money commodity object identification scene, but can also be extended to other fields. For example, the user may be provided with a picture summarization capability (e.g., photos of the same type in a mobile phone album may be summarized), or may be used to distinguish an original image from an image processed by PS or the like, and so on.

The following describes in detail specific implementations provided in embodiments of the present application.

Example one

First, the first embodiment provides an object identification method, and referring to fig. 2, the method may include:

s201: determining a target object group to be recognized, wherein the target object group comprises at least two target objects to be recognized, and the target objects are associated with text information and image information.

The target object group to be recognized may be determined according to actual recognition requirements. For example, in the context of a commodity object information system, as described above, if it is necessary to browse commodities according to the history of the user and recommend commodity objects of the same type/similar type, the historically browsed commodities and the commodities in the commodity pool to be recommended may be respectively combined to obtain a plurality of commodity pairs, and each commodity pair may become a commodity object group. Or, when "finding similarity" is performed according to a certain commodity object a currently displayed in the page, the commodity object a and the commodity objects in the commodity library may be grouped into a plurality of commodity object groups (each commodity object group includes the commodity object a and another commodity object in the commodity library). In addition, in a scene of judging a genuine product or the like from a picture or the like uploaded by a user, a shooting object in the picture and a commodity object in a commodity library may be grouped into a plurality of commodity object groups and respectively identified, and so on.

After the target object group is determined, text information and image information can be acquired for each target object respectively. If the specific target objects are all commodity objects published in the system, the commodity information base can store text information, image information and the like related to the specific commodity objects, wherein the text information can be obtained through information such as titles and trademarks of the commodity objects, and the image information can be obtained through a main graph and the like of the commodity objects. If the specific target object comprises an object to be identified, which is specified by the user through uploading image information, text information and the like, the text information and the image information can be directly extracted from the information uploaded by the user. If the user only uploads the image information, wherein the situation that the image information does not contain text information may exist, an alarm function may be provided, that is, in the process of respectively reading in the text, the image and other contents of each object to be recognized, if a certain part of the contents are found to be missing, an alarm may be triggered to remind the user whether the information is read incorrectly or continuously, or an operation control for inputting the text information may be provided for the user, so as to input the associated text content for the specifically uploaded image, and so on.

S202: and inputting the text information and the image information of each object in the target object group into a target model for processing to obtain an object identification result.

After the target object group, and the text content and the image content associated with the specific target object are determined, it may be determined whether the target objects in the group have commonality in target attributes by using a pre-trained target model, for example, whether each commodity object in the same group is of the same type or similar type, whether some photos are of the same scene, and the like.

The target model can be used for respectively fusing the text features and the image features of each target object, and judging whether the target objects have commonality in target attributes according to the obtained image-text cross features of the target objects. In a specific implementation manner, the specific target model may include a feature generation model and a discriminant model, and the specific feature generation model is configured to: respectively extracting text features and image features of each target object, fusing the text features and the image features, and outputting image-text cross features of each target object; and the discrimination model is used for judging whether the target objects have commonality on the target attributes according to the image-text cross characteristics of the target objects. The specific feature generation model may specifically be composed of a text feature extraction model, an image feature extraction model, a multi-modal feature fusion model, and the like, where the text feature extraction model is mainly used for performing text feature extraction from text content of the target object, and the image feature extraction model is used for performing image feature extraction from image content of the target object to obtain a text feature vector and an image feature vector, respectively. The multi-modal feature fusion model is used for carrying out fusion processing on the text feature vector and the image feature vector to obtain the image-text cross feature vector.

As described above, in the embodiment of the present application, the multi-modal feature fusion model may be generated by training in a supervised learning manner. In order to train the multi-modal feature fusion model in a supervised learning manner so as to improve the quality of a feature fusion result, the training of the multi-modal feature fusion model can be converted into the training of a model for classifying whether the image and the text of the target object are matched. That is to say, the specific multi-modal feature fusion model is to perform fusion processing on the text features and the image features during the process of executing the task of determining whether the text features and the image features of the target object are matched, and determine the fusion processing result as the image-text cross features.

Therefore, when the multi-mode feature model is trained, the image-text matching classification model can be established in advance so as to be convenient for training the image-text matching classification model, and after the training of the image-text matching classification model is completed, an output layer in the model is removed, so that the multi-mode feature model can be obtained.

Specifically, when the image-text matching classification model is trained, a plurality of training samples and corresponding labeling information can be obtained. Each training sample can correspond to one (image, text) pair, and the labeling information is whether the specific image is matched with the text. For example, in a commodity object scene, in a concrete implementation, according to a plurality of commodity objects published in a commodity object information system, a main graph and texts such as corresponding titles of the plurality of commodity objects are extracted in advance to be used as training samples, and whether the images are matched with the texts or not is labeled in a mode such as manual labeling. Of course, if the image and the text of the same commodity object published in the system are matched, the two are usually matched, but some negative samples are usually required in specific training, that is, the image and the text are not matched. For this, in a specific implementation, as shown in fig. 3, a data set D that is pre-labeled first (where the data set D includes a plurality of images and corresponding texts, and where most of the images in the barcode match the texts) may also be obtained, and then, a data set G is obtained by sampling each text and each image in the data set D several times. For example, the original data set D includes (image 1, text 1), (image 2, text 2), (image 3, text 3), and so on, and after several sampling, the data set G may include not only the above data entries but also (image 1, text 2), (image 1, text 3), (image 2, text 1), (image 2, text 3), (image 3, text 1), (image 3, text 2), and so on. In this way, not only a larger number of samples can be constructed, but also the samples can include not only positive samples but also negative samples. For example, the data set G has n samples, the sampled teletext matched sample is labeled as 1, the sample with non-teletext match is labeled as 0, and so on.

After the data set G is obtained, each sample may be extracted through algorithms such as text coding and Resnet, so as to obtain a text feature sequence and an image feature sequence. For example, after performing text word segmentation on input text content and removing stop words, the text is encoded by using algorithms such as predefined word segmentation, so as to extract a text feature sequence. In addition, the input image can be cut into N small blocks (N can be set by itself, for example, can be set to 64), and each small block of image is subjected to feature extraction by using a predefined algorithm such as Resnet, so as to obtain an image feature sequence. Then, fusion of the text feature sequence and the image feature sequence may be implemented by using algorithms such as Bert, and a Relu (Rectified Linear Unit) and the like may be used as an activation function (that is, a function for specifically performing fusion calculation, which is mainly used for calculating data in the text feature sequence and the image feature sequence to obtain a mapping result, that is, a fusion result). The obtained fusion result can be used for judging whether the text is matched with the image, and solving the parameter W1 in the model by introducing a loss function and utilizing algorithms such as gradient descent and the like. Through multiple iterations until the algorithm converges, the value of the parameter W1 in the model can be determined, thereby completing the training of the model.

After the training of the image-text matching classification model is completed, the model has the effect that an image and a text of a target object are input, and the classification model can give a result whether the image and the text of the target object are matched. In the embodiment of the application, the image-text cross feature after the image feature and the text feature of the target object are fused is required, so that an output layer in the classification model can be removed, and the image-text matching classification model can be used as a multi-modal feature fusion model. That is, the multi-modal feature fusion model can be obtained according to the feature fusion module in the image-text matching model. Thus, in the case of using the image and text content of a target object as input, the output may be the cross-text feature after the image feature and the text feature are fused.

In this way, after the target object group is determined, each target object in the group may be input into the feature generation model, feature extraction may be performed on text features and image features of each target object, and image-text cross features of each target object may be obtained. Then, whether the target objects have commonality on the target attributes can be judged by using the image-text cross characteristics of the target objects.

The discrimination model for judging whether the target object has commonality on the target attribute according to the image-text cross characteristics can be in various forms. For example, in a simple manner, the specific discrimination model may be to perform distance calculation (e.g., euclidean distance between vectors) on the image-text crossing feature vectors of each target object, and then determine whether the two target objects belong to the same/similar commodity object according to the specifically calculated distance and a preset threshold.

In order to further improve the recognition accuracy, in the preferred embodiment of the present application, the discriminant model may also be generated by means of supervised learning training. For example, a discrimination model may be constructed in advance, and the discrimination model may output a result of determining whether or not the plurality of target objects have commonality in target attributes (e.g., whether or not each product object is a same money or a similar money, etc.) by using the cross-text features of the plurality of target objects as inputs.

Specifically, when the above discrimination model is trained by supervised learning, a plurality of target object groups (for example, a commodity object a and a commodity object B form a commodity object pair, etc.) may be obtained in advance as training samples, and the image-text cross features of each target object in the commodity object group may be obtained in advance (for example, obtained in advance by using a feature generation model trained previously), and labeling information about whether each target object in the same group has commonality in target attributes is added. Then, the discriminant model is trained by using the training samples, so that after the image-text cross features of a group of target objects are input into the discriminant model, a discriminant result about whether the group of target objects have commonality on the target attributes can be output.

In addition, since the specific discrimination model discriminates the image-text cross feature of the target object as an input, the process of extracting and fusing the text feature and the image feature of each target object to generate the image-text cross feature is actually involved before discrimination. In a specific implementation, the feature generation model and the above discriminant model may be individually trained by supervised learning, or, in a preferred embodiment of the present application, in order to improve training efficiency and further improve accuracy of final model recognition, the feature generation model and the discriminant model may be directly combined together (for example, a discriminant model is spliced after the feature generation model, etc.), so as to generate a whole set of target models, and the target models may directly take texts and image contents of a plurality of target objects as inputs, and output a result of determining whether the target objects have commonalities in target attributes.

In addition, when the combination is a complete set of target models, the feature generation model and the discrimination model may be trained together. For example, specifically, when training is performed, as shown in fig. 4, a plurality of commodity object groups (for example, a commodity object pair is formed by the commodity object 1 and the commodity object 2, and the like) may be directly obtained as training samples, and label information about whether each commodity object in the same group is the same money or similar money may be added. And then, text feature extraction and image feature extraction are respectively carried out on the commodity object 1 and the commodity object 2 through the feature generation model, and then, image cross feature vectors are generated after fusion, wherein the vector M and the vector N are respectively. And then, the vector M and the vector N can be input into a discriminant function for judgment, the output judgment result is substituted into a loss function, and the parameters in the model are solved by using algorithms such as gradient descent and the like.

In order to further improve the performance of the model, the feature generation model can be trained independently in a supervised learning mode, and a parameter learning result in the feature generation model is obtained according to a training result; for example, the parameter W1 in the previous example can be determined by training the feature generation model alone, and the value of W1 can be determined. And then, carrying out supervised learning on a whole set of target model consisting of the feature generation model and the discrimination model again, taking parameter values determined in the previous independent feature generation model training as parameter initial values in the target model, and then training the target model in a supervised learning mode. That is, when the whole set of target model is trained, the random initial value of the parameter is not used, but the parameter value determined in the process of generating the model by using the individual training feature is used as the initial value, so that the performance of the model can be further improved, and the final recognition accuracy is improved.

In summary, through the embodiment of the application, the teletext cross-feature of the target object can be obtained by fusing the multi-modal features (for example, including text information and image information) of the target object, and then whether the target object has commonality in the target attribute can be more accurately identified based on the teletext cross-feature. When the multi-modal features are fused, the image features and the text features are considered to belong to different domains, the multi-modal feature fusion model can be constructed instead of directly splicing the image features and the text features or performing simple mathematical operation, and the model is trained in a supervised learning mode, so that the multi-modal features can be fused for the text features and the image features through the multi-modal feature fusion model. In order to realize the supervised learning mode training of the multi-modal feature fusion model, the multi-modal feature fusion model can be converted into the training of the image-text matching classification model, that is, the specific multi-modal feature fusion model can perform fusion processing on the text features and the image features in the process of executing the task of judging whether the text features and the image features are matched, and then the fusion processing result can be determined as the image-text cross features. In this way, a supervised multi-modal feature fusion model can be generated, so as to generate the image-text cross feature which can express the features of the target object more, and further, whether the target object has commonalities on the target attributes (for example, whether each commodity object belongs to the same money/similar money) can be identified more accurately.

In addition, in practical application, in some scenarios, a recognition result about whether the target objects have commonality in target attributes may be provided to the user, and an operation option for feeding back the recognition result may be provided to the user. For example, in the "find similar" scenario, after the user initiates a "find similar" request for a certain commodity object a, the recognition result may be returned through model calculation, for example, the recognition result may include the commodity object B, C, D, and the like, and may also give a "feedback" operation option for each commodity object, and may submit feedback information, and the like, assuming that the user finds that the commodity object C is not the same as the commodity object a. After receiving user feedback information about the accuracy of the recognition result, the target model can be updated and iteratively trained by using the user feedback information. That is, the information fed back by the user may also be a labeled sample, and may participate in the iterative training of the model.

In a specific implementation, in a scenario where a commodity object is a target object, different target models may be trained according to a plurality of different commodity object categories, so as to perform more accurate recognition of a commodity object of the same type or similar type by using the target models of the specific categories. In this way, the text information and the image information of each commodity object can be input into the target model corresponding to the target category for processing according to the common target category to which each commodity object in the target commodity object group belongs. For example, when a money or a similar money commodity object needs to be found for a certain commodity object a, the category to which the commodity object a belongs may be determined first (the commodity object information base may store the category information to which the specific commodity object belongs, and therefore, the category information of the commodity object may be obtained by querying the commodity object information base), and a preliminary screening of the commodity object is performed from the commodity base according to the category information first, and then, whether the commodity object a and other commodity objects of the same category are the same money or similar money is identified by using a target model corresponding to the category, and so on.

Example two

The second embodiment corresponds to the first embodiment, and provides a model processing method in a process of specifically training a multi-modal feature fusion model, referring to fig. 5, where the method may include:

s501: constructing a graph-text matching model, wherein the graph-text matching model is used for identifying whether the image and the text are used for describing a unified object;

s502: acquiring a training sample, wherein the training sample comprises a plurality of sample entries, and the sample entries comprise image content, text content and annotation information about whether the image content is matched with the text content;

s503: training the image-text matching model through the training sample;

s504: after training is finished, the multi-mode feature fusion model is generated according to a feature fusion module in the image-text matching model, so that image feature extraction and text feature extraction are carried out on an input target object, and an image text and a text feature are fused to generate image-text cross features.

EXAMPLE III

The third embodiment is also corresponding to the first embodiment, and for a model processing method in a training process of a target model (formed by combining a feature generation model and a discriminant model), referring to fig. 6, the method may include:

s601: training the feature generation model, and determining parameter values in the feature generation model; the feature generation model is used for fusing image features and text features of the target object to generate image-text cross features;

s602: acquiring a training sample, wherein the training sample comprises a plurality of target object groups, each target object group comprises at least two target objects and labeling information about whether the at least two target objects have commonality on target attributes;

s603: and training the target model by using the training sample, and taking the parameter value determined in the training process of the characteristic generation model as the initial value of the parameter when the target model is trained.

Example four

The fourth embodiment provides a method for providing a recognition result for one specific application scenario, and referring to fig. 7, the method may include:

s701: receiving text information and image information of an object to be recognized submitted by a user, wherein the image information is generated after image acquisition is carried out on a real object corresponding to the object to be recognized;

s702: respectively forming the object to be identified and a plurality of commodity objects in a commodity object library into a target object group; the commodity objects in the commodity object library are also associated with labels, and the labels are used for representing whether the corresponding commodity objects are genuine products or the proximity degree of the genuine products;

s703: inputting the text information and the image information of each target object in the target object group into a target model for processing, wherein the target model is used for fusing the text characteristics and the image characteristics of each target object respectively, and determining the similarity between the target objects according to the obtained image-text cross characteristics; the target model comprises a multi-modal feature fusion model, and the multi-modal feature fusion model is used for performing fusion processing on text features and image features to obtain the image-text cross features in the process of executing a task of judging whether the text features and the image features are matched or not;

s704: and providing identification result information about whether the object to be identified is a genuine product or the proximity of the object to be identified and the genuine product according to the label corresponding to the target commodity object with the similarity meeting the condition of the object to be identified.

For the parts of the second to fourth embodiments that are not described in detail, reference may be made to the description of the first embodiment and other parts in this specification, which are not described herein again.

It should be noted that, in the embodiments of the present application, the user data may be used, and in practical applications, the user-specific personal data may be used in the scheme described herein within the scope permitted by the applicable law, under the condition of meeting the requirements of the applicable law and regulations in the country (for example, the user explicitly agrees, the user is informed, etc.).

Corresponding to the first embodiment, an embodiment of the present application further provides an object identification apparatus, and referring to fig. 8, the apparatus may include:

a target object group determining unit 801, configured to determine a target object group to be identified, where the target object group includes at least two target objects to be identified, and the target objects are associated with text information and image information;

the recognition processing unit 802 is configured to input text information and image information of each target object in the target object group into a target model for processing, so as to obtain an object recognition result; the target model is used for respectively fusing the text features and the image features of each target object, and judging whether the target objects have commonality on target attributes according to the obtained image-text cross features of the target objects;

the target model comprises a multi-modal feature fusion model, and the multi-modal feature fusion model is used for performing fusion processing on the text features and the image features to obtain the image-text cross features in the process of executing the task of judging whether the text features and the image features are matched or not.

During specific implementation, the target model comprises a feature generation model and a discrimination model, and the feature generation model comprises a feature extraction model and a multi-modal feature fusion model;

In specific implementation, before the target model is trained, the feature generation model can be trained, and a parameter learning result in the feature generation model is obtained according to a training result; then, the parameter learning result is used as a parameter initial value when the target model is trained.

In a specific implementation, the apparatus may further include:

and the result feedback unit is used for providing a recognition result about whether the target object has commonality on the target attribute, receiving user feedback information about the accuracy of the recognition result, and performing iterative training on the target model according to the user feedback information.

Wherein the target object comprises a merchandise object;

the identification processing unit is specifically configured to: and determining whether each commodity object in the same commodity object group belongs to the same-money or similar-money commodity object.

the identification processing unit may specifically be configured to:

The target object group determining unit may be specifically configured to:

Alternatively, the target object group determination unit may be specifically configured to:

the determining of the target object group to be identified includes:

the apparatus may further include:

and the genuine product identification result providing unit is used for providing identification result information about whether the object to be identified is a genuine product or the proximity of the object to be identified and the genuine product according to the label corresponding to the target commodity object with the similarity meeting the condition of the object to be identified.

Corresponding to the second embodiment, an embodiment of the present application further provides a model processing apparatus, and referring to fig. 9, the apparatus may include:

a model building unit 901, configured to build a graph-text matching model, where the graph-text matching model is used to identify whether an image and a text are used to describe a unified object;

a training sample obtaining unit 902, configured to obtain a training sample, where the training sample includes a plurality of sample entries, and the sample entries include image content and text content, and label information about whether the image content matches the text content;

a training unit 903, configured to train the image-text matching model through the training sample;

and a multi-modal feature fusion model generation unit 904, configured to generate the multi-modal feature fusion model according to the feature fusion module in the image-text matching model after the training is completed, so as to perform image feature extraction and text feature extraction on the input target object, and fuse the image text and the text feature to generate an image-text cross feature.

Corresponding to the three phases of the embodiment, the embodiment of the present application further provides a model processing apparatus, where the model is formed by combining a feature generation model and a discriminant model, and referring to fig. 10, the apparatus includes:

a first training unit 1001, configured to train the feature generation model, and determine parameter values therein; the feature generation model is used for fusing image features and text features of the target object to generate image-text cross features;

a training sample obtaining unit 1002, configured to obtain a training sample, where the training sample includes a plurality of target object groups, each target object group includes at least two target objects, and labeling information about whether there is commonality between the at least two target objects on a target attribute, where text information and image information associated with each commodity object;

a second training unit 1003, configured to train the target model by using the training sample, where a parameter value determined in a training process of the feature generation model is used as an initial value of a parameter when the target model is trained.

Corresponding to the fourth embodiment, the embodiment of the present application further provides an apparatus for providing a recognition result, and referring to fig. 11, the apparatus may include:

the information receiving unit 1101 is configured to receive text information and image information of an object to be recognized, which are submitted by a user, where the image information is generated by acquiring an image of a real object corresponding to the object to be recognized;

a target object group generating unit 1102, configured to respectively form a target object group by the object to be identified and the plurality of commodity objects in the commodity object library; the commodity objects in the commodity object library are also associated with labels, and the labels are used for representing whether the corresponding commodity objects are genuine products or the proximity degree of the genuine products;

a similarity recognition unit 1103, configured to input text information and image information of each target object in the target object group into a target model for processing, where the target model is configured to perform fusion processing on text features and image features of each target object, and determine similarity between the target objects according to the obtained image-text cross features; the target model comprises a multi-modal feature fusion model, and the multi-modal feature fusion model is used for performing fusion processing on text features and image features to obtain the image-text cross features in the process of executing a task of judging whether the text features and the image features are matched or not;

an identification result providing unit 1104 for providing identification result information on whether the object to be identified is a genuine product or a degree of proximity to the genuine product, according to a tag corresponding to a target merchandise object whose similarity to the object to be identified satisfies a condition.

In addition, the present application also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the method described in any of the preceding method embodiments.

And an electronic device comprising:

one or more processors; and

a memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform the steps of the method of any of the preceding method embodiments.

Fig. 12 illustrates an architecture of an electronic device, which may include, in particular, a processor 1210, a video display adapter 1211, a disk drive 1212, an input/output interface 1213, a network interface 1214, and a memory 1220. The processor 1210, video display adapter 1211, disk drive 1212, input/output interface 1213, network interface 1214, and memory 1220 may be communicatively coupled via a communication bus 1230 as described above.

The processor 1210 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solution provided in the present Application.

The Memory 1220 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 1220 may store an operating system 1221 for controlling the operation of the electronic device 1200, a Basic Input Output System (BIOS) for controlling low-level operations of the electronic device 1200. In addition, a web browser 1223, a data storage management system 1224, and an object recognition processing system 1225, and the like may also be stored. The object recognition processing system 1225 may be an application program that implements the operations of the foregoing steps in this embodiment of the application. In summary, when the technical solution provided in the present application is implemented by software or firmware, the relevant program codes are stored in the memory 1220 and called for execution by the processor 1210.

The input/output interface 1213 is used for connecting an input/output module to realize information input and output. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.

The network interface 1214 is used for connecting communication modules (not shown in the figure) to enable the device to interact with other devices in a communication way. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).

Bus 1230 includes a path that transfers information between various components of the device, such as processor 1210, video display adapter 1211, disk drive 1212, input/output interface 1213, network interface 1214, and memory 1220.

It should be noted that although the above-mentioned devices only show the processor 1210, the video display adapter 1211, the disk drive 1212, the input/output interface 1213, the network interface 1214, the memory 1220, the bus 1230, etc., in a specific implementation, the devices may also include other components necessary for proper operation. Furthermore, it will be understood by those skilled in the art that the apparatus described above may also include only the components necessary to implement the solution of the present application, and not necessarily all of the components shown in the figures.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system or system embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described system and system embodiments are only illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The object identification method and the electronic device provided by the present application are introduced in detail, and a specific example is applied in the description to explain the principle and the implementation of the present application, and the description of the above embodiment is only used to help understanding the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, the specific embodiments and the application range may be changed. In view of the above, the description should not be taken as limiting the application.

Claims

1. An object recognition method, comprising:

2. The method of claim 1,

3. The method of claim 2,

the target model is a set of models formed by combining a feature generation model and a discrimination model.

4. The method of claim 3,

before training the target model, further comprising:

5. The method of claim 1, further comprising:

6. The method according to any one of claims 1 to 5,

the target object comprises a commodity object;

7. The method of claim 6,

at least two commodity objects in the commodity object group correspond to the same category;

8. The method of claim 6,

the determining of the target object group to be identified includes:

9. The method of claim 6,

the determining of the target object group to be identified includes:

10. The method according to any one of claims 1 to 5,

the determining of the target object group to be identified includes:

the method further comprises the following steps:

11. A method of model processing, comprising:

training the image-text matching model through the training sample;

12. A method of providing recognition results, comprising:

13. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 12.

14. An electronic device, comprising:

one or more processors; and

a memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform the steps of the method of any of claims 1 to 12.