CN117253245A

CN117253245A - Multi-mode target detection method, device, computer equipment and storage medium

Info

Publication number: CN117253245A
Application number: CN202311315163.4A
Authority: CN
Inventors: 豆泽阳; 庞磊; 蒋阳
Original assignee: Zhugao Intelligent Technology Shenzhen Co ltd
Current assignee: Zhugao Intelligent Technology Shenzhen Co ltd
Priority date: 2023-10-11
Filing date: 2023-10-11
Publication date: 2023-12-19

Abstract

The embodiment of the invention discloses a multi-mode target detection method, a multi-mode target detection device, computer equipment and a storage medium. The method comprises the following steps: acquiring an image to be detected and a text instruction; inputting the image to be detected and the text instruction into a target detection model to carry out target detection so as to obtain a detection result; outputting the detection result; the target detection model comprises a coding model, a vector transformation model, a large language model and a position decoding model; the object detection model is formed by training a coding model, a vector transformation model, a large language model and a position decoding model by using the marked image and the text instruction as a sample set. By implementing the method provided by the embodiment of the invention, the improved multi-mode generation type large model can be used for multi-mode target detection, and the accurate description and accurate positioning of equipment defects, environmental hidden danger and personnel illegal behaviors can be realized.

Description

Multi-mode target detection method, device, computer equipment and storage medium

Technical Field

The present invention relates to a method for processing a multi-mode large model, and more particularly to a method, an apparatus, a computer device, and a storage medium for detecting a multi-mode target.

Background

In recent years, a multi-modal generation type large model has attracted extensive attention and research in the field of machine learning, has made a remarkable breakthrough, and has been widely applied to various industries. However, the current multi-mode large model can only realize text description for different modes, lacks the positioning capability for the target of interest, and cannot accurately output a frame or a mask. The positioning method of the multi-mode generation type large model has two positioning modes, one is that a picture is divided into a plurality of sub-pictures, each sub-picture respectively enters the multi-mode generation type large model to be judged, and finally, the target coarse positioning is carried out by depending on the position of the sub-picture in the original picture, the positioning precision of the method is not high, and meanwhile, due to the lack of full picture information, the misjudgment of the model can be caused; the second method is to connect the multi-mode generation type large model with the traditional target detection framework in series, firstly, judging whether the target exists by using the multi-mode generation type large model, and then calling the target detection model to position, however, introducing the traditional target detection model increases the calculation burden, and the traditional target detection model is often poor in performance, frequent in false detection and missed detection, and cannot meet the service requirement of the high standard of the power industry.

The two positioning modes have the following problems that the existing positioning method is only used for positioning the preset type, if the preset type is apple, the accurate positioning cannot be realized for the watermelon; secondly, the existing positioning method is not linked with the instruction of the user, for example, a model for detecting apples can detect all apples in the graph, and if the instruction sent by the user is just to detect apples on a table, the existing positioning method can be ineffective because the existing positioning method cannot understand the refined instruction of the user. In the application fields of power equipment defect operation and maintenance, channel safety monitoring and personnel behavior monitoring, accurate positioning is required for equipment defect positions, channel hidden danger positions and illegal personnel positions, a target area is quickly focused by auxiliary personnel, and the current large model cannot meet the service requirements.

Therefore, a new method is necessary to be designed, and the purpose of carrying out multi-mode target detection by adopting the improved multi-mode generation type large model is achieved, and the accurate description and the accurate positioning of equipment defects, environmental hidden danger and personnel illegal behaviors are carried out.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a multi-mode target detection method, a multi-mode target detection device, computer equipment and a storage medium.

In order to achieve the above purpose, the present invention adopts the following technical scheme: the multi-mode target detection method comprises the following steps:

acquiring an image to be detected and a text instruction;

inputting the image to be detected and the text instruction into a target detection model to carry out target detection so as to obtain a detection result;

outputting the detection result;

the target detection model comprises a coding model, a vector transformation model, a large language model and a position decoding model;

the object detection model is formed by training a coding model, a vector transformation model, a large language model and a position decoding model by using the marked image and the text instruction as a sample set.

The further technical scheme is as follows: the step of inputting the image to be detected and the text instruction into a target detection model to perform target detection so as to obtain a detection result includes:

coding the image to be detected through a coding model to obtain a feature map;

labeling the text instruction to obtain a labeling sequence;

after combining the feature map and the marking sequence through a vector transformation model, inputting the feature map and the marking sequence into a trained large language model to reply a text instruction so as to obtain a text reply result;

After labeling the text reply result, forming a new labeling sequence;

and inputting the feature map, the marking sequence and the new marking sequence into a position decoding model to identify target position information so as to obtain a monitoring result.

The further technical scheme is as follows: the object detection model is formed by training a coding model, a vector transformation model, a large language model and a position decoding model by using an image with labels and text instructions as sample sets, and comprises the following steps:

acquiring an initial image, and removing images with the same background in the initial image to obtain a training image;

performing text description labeling and target positioning information labeling on the training image to obtain labeling results;

acquiring a text instruction during training;

constructing a coding model, a vector transformation model, a large language model and a position decoding model;

performing forward propagation training on the coding model, the vector transformation model, the large language model and the position decoding model by using the labeling result and the text instruction during training to obtain the type and position information of the target;

constructing a loss function;

and updating the large language model and the position decoding model by using a return gradient mode by using a loss function and the type and position information of the target, and combining the updated large language model and position decoding model with the coding model and the vector transformation model to obtain a target detection model.

The further technical scheme is as follows: the forward propagation training is carried out on the coding model, the vector transformation model, the large language model and the position decoding model by the labeling result and the text instruction during training to obtain the type and the position information of the target, and the method comprises the following steps:

coding the labeling result through a coding model to obtain a relevant feature map;

labeling the text instruction during training to obtain a labeling sequence during training;

combining the related feature images and the training marking sequences through a vector transformation model, and inputting the combined feature images and the training marking sequences into the large language model to reply text instructions so as to obtain a training text reply result;

after labeling the text reply result during training, forming a new labeling sequence during training;

inputting the relevant feature map, the training marker sequence and the training new marker sequence into a position decoding model to identify the target position information so as to obtain the type and the position information of the target.

The further technical scheme is as follows: the loss function comprises a loss function of text description content in the output of the language big model and the labeling result, and a loss function of target positioning information in the output of the position decoding model and the labeling result.

The further technical scheme is as follows: the method for updating the large language model and the position decoding model by using the loss function and the type and the position information of the target in a feedback gradient mode, and combining the updated large language model and position decoding model with the coding model and the vector transformation model to obtain a target detection model comprises the following steps:

gradient information is calculated for the loss function;

updating the large language model and the position decoding model by using the gradient information;

and combining the updated large language model and the position decoding model with the coding model and the vector transformation model to obtain a target detection model.

The invention also provides a multi-mode target detection device, which comprises:

the acquisition unit is used for acquiring the image to be detected and the text instruction;

the target detection unit is used for inputting the image to be detected and the text instruction into a target detection model to carry out target detection so as to obtain a detection result;

and the output unit is used for outputting the detection result.

The further technical scheme is as follows: the target detection model generating unit is used for training the coding model, the vector transformation model, the large language model and the position decoding model by taking the marked image and the text instruction as a sample set so as to obtain the target detection model.

The invention also provides a computer device which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the method when executing the computer program.

The present invention also provides a storage medium storing a computer program which, when executed by a processor, implements the above method.

Compared with the prior art, the invention has the beneficial effects that: according to the invention, the target associated with the text instruction is accurately positioned by acquiring the image to be detected and the text instruction in the power operation and maintenance process and adopting the target detection model with the large language model and the visual positioning combined together, the text instruction is input by the terminal according to actual requirements, the type of the target appearing in training data can be positioned, the target type which does not appear can be effectively positioned, the multi-mode target detection by adopting the improved multi-mode generation large model is realized, and the accurate description and the accurate positioning of equipment defects, environmental hidden hazards and personnel illegal behaviors are realized.

The invention is further described below with reference to the drawings and specific embodiments.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of an application scenario of a multi-mode target detection method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a multi-mode target detection method according to an embodiment of the present invention;

FIG. 3 is a schematic sub-flowchart of a multi-mode target detection method according to an embodiment of the present invention;

FIG. 4 is a schematic sub-flowchart of a multi-mode target detection method according to an embodiment of the present invention;

FIG. 5 is a schematic sub-flowchart of a multi-mode target detection method according to an embodiment of the present invention;

FIG. 6 is a schematic sub-flowchart of a multi-mode target detection method according to an embodiment of the present invention;

FIG. 7 is a schematic block diagram of a multi-modal object detection apparatus provided by an embodiment of the present invention;

FIG. 8 is a schematic block diagram of an object detection unit of the multi-modal object detection apparatus provided by an embodiment of the present invention;

Fig. 9 is a schematic block diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

Referring to fig. 1 and fig. 2, fig. 1 is a schematic view of an application scenario of a multi-mode target detection method according to an embodiment of the present invention. Fig. 2 is a schematic flowchart of a multi-mode target detection method according to an embodiment of the present invention. The multi-mode target detection method is applied to the server. The server performs data interaction with the terminal and the camera, images to be detected are shot through the camera, the terminal inputs text instructions, visual positioning content such as a position coding model is introduced on the basis of an original multi-mode generation type large model, namely a large language model, and accurate positioning of equipment defects, environment hidden danger and offenders is achieved. Not only can the target types appearing in the training data be located, but also target types not appearing can be effectively located. The fusion link of the generated multi-mode model description and the visual positioning is introduced, so that a desired target can be positioned according to a user instruction, for example, only the person standing in a dangerous area is positioned, the person standing in a safe area is not positioned, and the instruction understanding capability is remarkably enhanced.

Fig. 2 is a flow chart of a multi-mode target detection method according to an embodiment of the present invention. As shown in fig. 2, the method includes the following steps S110 to S130.

S110, acquiring an image to be detected and a text instruction.

In this embodiment, the image to be detected is formed by the power transmission, power transformation, and power distribution monitoring cameras or the image or video information shot by the unmanned aerial vehicle.

The text instruction refers to an instruction related to a detection target, such as which type of target is detected or which position is detected, which is input by a user through a terminal according to actual demands.

S120, inputting the image to be detected and the text instruction into a target detection model to carry out target detection so as to obtain a detection result.

In the present embodiment, the detection result refers to position information, type information, and the like of the detected target.

Specifically, the object detection model includes an encoding model, a vector transformation model, a large language model, and a position decoding model; the coding model is used for coding field images/videos of equipment operation, environmental conditions and personnel operation, namely graphs to be detected, so as to form a feature vector set which can be understood by the large language model. The input is the image/video and the output is the feature vector set. Model structures include, but are not limited to, the usual Vision Transformer architecture, CNN architecture, a mix of both architectures;

The vector transformation model is used for transforming the coded image/video feature vector set into an input space for transforming additional vectors into a large language model, so that the large language model can understand the content of the image/video feature vector. The input is a feature vector set of the coding model and the user question, i.e., a text instruction, and the output is a feature vector aligned to the input space of the large language model. Network structures of vector transformation models include, but are not limited to, simple linear transformation layers and complex nonlinear transformations (e.g., a multi-layer neural network); because the middleware is used for establishing a general mapping, the image features are translated into a phase change, so that the semantic information of the image can be understood by the large language model. The large language model is pre-trained by a large number of samples, and has understanding capability for everything in the world, so that the large language model can use the self understanding capability to understand data which is not seen as long as the large language model understands the image semantic information.

The large language model is taken as an inference brain of the whole target detection model, and infers according to the text instruction input by human and the image content, namely the feature map, and outputs an answer to the text instruction. The input of the model is vector and text instruction output by the vector transformation model, and the output is text answer. Large language model structures include, but are not limited to, common ilama, ilama 2, vicuna, chatGLM, etc.

The position decoding model is used for obtaining position information of the target, and the position information includes, but is not limited to, center point coordinates, frame coordinates and mask coordinates of the target. The input of the position decoding model is a feature vector set and user instruction information output by the image/video encoder, and the feature vector set and the user instruction information are output as target positioning information. The structure of the position decoding model includes but is not limited to the framework of CNN/Bert/FPN Decoder with Cross attribute. Because the user instruction information is encoded at the same time, the intention information of the user is contained in the constructed feature, so that the position information decoder can output the positioning information according to the intention of the user.

In one embodiment, referring to fig. 3, the step S120 may include steps S121 to S125.

S121, carrying out coding processing on the image to be detected through a coding model to obtain a feature map.

In this embodiment, the feature map refers to information composed of different features of each object in the image to be detected.

Specifically, the image to be detected is firstly subjected to operations of size scaling, mean value reduction and variance division, and then is processed by a coding model to form a corresponding characteristic diagram.

S122, labeling the text instruction to obtain a labeling sequence.

In this embodiment, the marking sequence refers to a sequence formed after labeling the text instruction.

S123, after the feature map and the marking sequence are combined through a vector transformation model, inputting the feature map and the marking sequence into a trained large language model for replying text instructions so as to obtain a text replying result.

In this embodiment, the text reply result refers to the text reply content related to the annotation sequence.

Specifically, the feature map and the tag sequence are combined and processed using a vector transformation model to form feature vectors that can be understood by a large language model.

Specifically, when the feature map and the tag sequence are input together, the sequence needs to be mapped into a vector sequence in a unified way, then the vector sequences are spliced, and the spliced vector sequences are input into a large language model.

S124, after labeling the text reply result, a new labeling sequence is formed.

In this embodiment, the new tag sequence refers to a sequence formed after labeling the text reply result.

S125, inputting the feature map, the marking sequence and the new marking sequence into a position decoding model to identify target position information so as to obtain a monitoring result.

In this embodiment, when the feature map, the tag sequence, and the new tag sequence are input together, the sequences need to be mapped into a vector sequence in a unified manner, then the vector sequences are spliced, and the spliced vector sequences are input into a position decoding model.

In an embodiment, referring to fig. 4, the object detection model is formed by training the coding model, the vector transformation model, the large language model and the position decoding model with the labeled image and the text instruction as the sample set, and may include steps S120a to S120g.

S120a, acquiring an initial image, and removing images with the same background in the initial image to obtain a training image.

In this embodiment, the initial image refers to collecting the picture or video information shot by a monitoring camera or an unmanned aerial vehicle for on-site power transmission, power transformation and power distribution;

the training image is to remove the image or video data with the same background in the initial image to obtain unlabeled data with rich and various background and target forms.

And S120b, performing text description labeling and target positioning information labeling on the training image to obtain a labeling result.

In this embodiment, the labeling result refers to a training image with the text description labeling information and the target positioning information of the kind.

Specifically, a labeling tool is used to label a picture or video. When labeling, the two aspects of contents are mainly labeled: the training images are subjected to text description, including but not limited to mountain fire smoke application, whether mountain fire smoke exists in a description chart or video, the occurrence trend of the mountain fire smoke and the influence of the mountain fire smoke on a power transmission line; targets in pictures or videos are located and corresponding information is annotated, including but not limited to, in the form of a picture frame, a picture mask, a drawing line, and the like.

S120c, acquiring a text instruction during training.

In this embodiment, the text instruction during training is the same as the text instruction obtained in step S110, and will not be described here again.

S120d, constructing an encoding model, a vector transformation model, a large language model and a position decoding model.

S120e, carrying out forward propagation training on the coding model, the vector transformation model, the large language model and the position decoding model by the labeling result and the text instruction during training so as to obtain the type and position information of the target.

In one embodiment, referring to fig. 5, the step S120e may include steps S120e1 to S120e 5. S120e1, carrying out coding processing on the labeling result through a coding model to obtain a relevant feature map;

s120e2, labeling the text instruction during training to obtain a labeling sequence during training;

s120e3, after the relevant feature images and the training marking sequences are combined through a vector transformation model, inputting the relevant feature images and the training marking sequences into the large language model for replying text instructions so as to obtain a training text replying result;

s120e4, after labeling the text reply result during training, forming a new labeling sequence during training;

s120e5, inputting the relevant feature map, the training marker sequence and the training new marker sequence into a position decoding model to identify target position information so as to obtain the type and position information of the target.

In this embodiment, the steps S120e1 to S120e5 are identical to the steps S121 to S125, and are not described in detail herein.

S120f, constructing a loss function.

In this embodiment, the loss function includes a loss function of text description content in the output and labeling result of the language big model, and a loss function of target positioning information in the output and labeling result of the position decoding model.

Specifically, the types of loss functions of text description contents in the output and labeling results of the language big model include, but are not limited to, cross inner copy, MSE loss and the like; the types of the loss functions s of the target positioning information in the output and labeling results of the position decoding model include, but are not limited to, smooth L1 loss, MSE loss, IOU loss and the like.

S120g, updating the large language model and the position decoding model by using a loss function and the type and position information of the target in a return gradient mode, and combining the updated large language model and position decoding model with the coding model and the vector transformation model to obtain a target detection model.

In an embodiment, referring to fig. 6, the step S120g may include steps S120g1 to S120g3.

S120g1, gradient information is obtained for the loss function.

In this embodiment, the gradient information refers to gradient information of each parameter of the model in the training process.

And S120g2, updating the large language model and the position decoding model by using the gradient information.

In this embodiment, the step of updating is to perform the step S120e until the Loss value of Loss of the corresponding model on the test set is no longer reduced, that is, the model tends to be in a convergence state.

S120g3, combining the updated large language model and the position decoding model with the coding model and the vector transformation model to obtain a target detection model.

S130, outputting the detection result.

In this embodiment, the detection result is output to the terminal for display.

According to the electric power operation and maintenance model detection method, the to-be-detected image and the text instruction in the electric power operation and maintenance process are obtained, the target detection model with the large language model combined with visual positioning is adopted, the target associated with the text instruction is accurately positioned, the text instruction is input by the terminal according to actual requirements, not only can the target types appearing in training data be positioned, but also the target types which do not appear can be effectively positioned, the purpose that the improved multi-mode generation type large model is adopted for multi-mode target detection is achieved, and the accurate description and the accurate positioning of equipment defects, environmental hidden hazards and personnel illegal behaviors are carried out.

Fig. 7 is a schematic block diagram of a multi-modal object detection apparatus 300 according to an embodiment of the present invention. As shown in fig. 7, the present invention further provides a multi-modal object detection apparatus 300 corresponding to the above multi-modal object detection method. The multi-modal object detection apparatus 300 includes means for performing the multi-modal object detection method described above, which may be configured in a server. Specifically, referring to fig. 7, the multi-mode target detection apparatus 300 includes an acquisition unit 301, a target detection unit 302, and an output unit 303.

An acquiring unit 301, configured to acquire an image to be detected and a text instruction; the target detection unit 302 is configured to input the image to be detected and the text instruction into a target detection model to perform target detection, so as to obtain a detection result; an output unit 303, configured to output the detection result.

In an embodiment, the apparatus further includes an object detection model generating unit, configured to train the coding model, the vector transformation model, the large language model, and the position decoding model by using the image with the label and the text instruction as the sample set, so as to obtain the object detection model.

In one embodiment, as shown in fig. 8, the object detection unit 302 includes an encoding subunit 3032, a labeling subunit 3022, a reply subunit 3023, a re-labeling subunit 3024, and an identification subunit 3025.

A coding subunit 3032, configured to perform coding processing on the image to be detected through a coding model to obtain a feature map; a labeling subunit 3022, configured to perform labeling operation on the text instruction to obtain a labeling sequence; a reply subunit 3023, configured to combine the feature map and the tag sequence through a vector transformation model, and input the combined feature map and the tag sequence into the trained large language model to reply a text instruction, so as to obtain a text reply result; a re-labeling subunit 3024, configured to perform a labeling operation on the text reply result to form a new labeling sequence; and the identifying subunit 3025 is configured to input the feature map, the tag sequence, and the new tag sequence into a position decoding model to identify the target position information, so as to obtain a monitoring result.

In an embodiment, the object detection model generation unit includes a training data generation subunit, a labeling subunit, an instruction acquisition subunit, a model construction subunit, a training subunit, a loss function construction subunit, and an update subunit.

The training data generation subunit is used for acquiring an initial image and removing images with the same background in the initial image to obtain a training image; the marking subunit is used for carrying out text description marking and target positioning information marking on the training image so as to obtain a marking result; the instruction acquisition subunit is used for acquiring text instructions during training; the model construction subunit is used for constructing a coding model, a vector transformation model, a large language model and a position decoding model; the training subunit is used for carrying out forward propagation training on the coding model, the vector transformation model, the large language model and the position decoding model by the marking result and the text instruction during training so as to obtain the type and position information of the target; a loss function construction subunit for constructing a loss function; and the updating subunit is used for updating the large language model and the position decoding model by using a return gradient mode by using the loss function and the type and position information of the target, and combining the updated large language model and position decoding model with the coding model and the vector transformation model to obtain a target detection model.

In one embodiment, the training subunit includes an encoding processing module, a labeling module, a reply module, a re-labeling module, and an identification module.

The coding processing module is used for carrying out coding processing on the labeling result through a coding model so as to obtain a relevant characteristic diagram; the labeling module is used for labeling the text instruction during training to obtain a labeling sequence during training; the reply module is used for combining the relevant feature images and the training marking sequences through a vector transformation model, and inputting the combined relevant feature images and the training marking sequences into the large language model for replying text instructions so as to obtain a training text reply result; the re-labeling module is used for forming a new labeling sequence during training after labeling the text reply result during training; and the identification module is used for inputting the relevant feature map, the training marker sequence and the training new marker sequence into the position decoding model to identify the target position information so as to obtain the type and the position information of the target.

In one embodiment, the update subunit includes a gradient information calculation module, a model update module, and a combination module.

The gradient information calculation module is used for calculating gradient information of the loss function; the model updating module is used for updating the large language model and the position decoding model by utilizing the gradient information; and the combining module is used for combining the updated large language model and the position decoding model with the coding model and the vector transformation model to obtain a target detection model.

It should be noted that, as will be clearly understood by those skilled in the art, the specific implementation process of the multi-mode target detection device 300 and each unit may refer to the corresponding description in the foregoing method embodiments, and for convenience and brevity of description, the description is omitted here.

The above-described multi-modal object detection apparatus 300 may be implemented in the form of a computer program that is executable on a computer device as shown in fig. 9.

Referring to fig. 9, fig. 9 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 500 may be a server, where the server may be a stand-alone server or may be a server cluster formed by a plurality of servers.

With reference to FIG. 9, the computer device 500 includes a processor 502, memory, and a network interface 505 connected by a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.

The non-volatile storage medium 503 may store an operating system 5031 and a computer program 5032. The computer program 5032 includes program instructions that, when executed, cause the processor 502 to perform a multi-modal object detection method.

The processor 502 is used to provide computing and control capabilities to support the operation of the overall computer device 500.

The internal memory 504 provides an environment for the execution of a computer program 5032 in the non-volatile storage medium 503, which computer program 5032, when executed by the processor 502, causes the processor 502 to perform a multi-modal object detection method.

The network interface 505 is used for network communication with other devices. It will be appreciated by those skilled in the art that the structure shown in fig. 9 is merely a block diagram of a portion of the structure associated with the present application and does not constitute a limitation of the computer device 500 to which the present application is applied, and that a particular computer device 500 may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

Wherein the processor 502 is configured to execute a computer program 5032 stored in a memory to implement the steps of:

Acquiring an image to be detected and a text instruction; inputting the image to be detected and the text instruction into a target detection model to carry out target detection so as to obtain a detection result; outputting the detection result;

In an embodiment, when the step of inputting the image to be detected and the text instruction into the target detection model to perform target detection to obtain a detection result is implemented by the processor 502, the following steps are specifically implemented:

coding the image to be detected through a coding model to obtain a feature map; labeling the text instruction to obtain a labeling sequence; after combining the feature map and the marking sequence through a vector transformation model, inputting the feature map and the marking sequence into a trained large language model to reply a text instruction so as to obtain a text reply result; after labeling the text reply result, forming a new labeling sequence; and inputting the feature map, the marking sequence and the new marking sequence into a position decoding model to identify target position information so as to obtain a monitoring result.

In one embodiment, when the processor 502 implements the target detection model by training the coding model, the vector transformation model, the large language model, and the position decoding model using the labeled image and the text instruction as the sample set, the following steps are specifically implemented:

acquiring an initial image, and removing images with the same background in the initial image to obtain a training image; performing text description labeling and target positioning information labeling on the training image to obtain labeling results; acquiring a text instruction during training; constructing a coding model, a vector transformation model, a large language model and a position decoding model; performing forward propagation training on the coding model, the vector transformation model, the large language model and the position decoding model by using the labeling result and the text instruction during training to obtain the type and position information of the target; constructing a loss function; and updating the large language model and the position decoding model by using a return gradient mode by using a loss function and the type and position information of the target, and combining the updated large language model and position decoding model with the coding model and the vector transformation model to obtain a target detection model.

The loss function comprises a loss function of text description content in the output of the language big model and the labeling result, and a loss function of target positioning information in the output of the position decoding model and the labeling result.

In one embodiment, when implementing the step of performing forward propagation training on the coding model, the vector transformation model, the large language model, and the position decoding model by using the labeling result and the text instruction during training to obtain the type and position information of the target, the processor 502 specifically implements the following steps:

coding the labeling result through a coding model to obtain a relevant feature map; labeling the text instruction during training to obtain a labeling sequence during training; combining the related feature images and the training marking sequences through a vector transformation model, and inputting the combined feature images and the training marking sequences into the large language model to reply text instructions so as to obtain a training text reply result; after labeling the text reply result during training, forming a new labeling sequence during training; inputting the relevant feature map, the training marker sequence and the training new marker sequence into a position decoding model to identify the target position information so as to obtain the type and the position information of the target.

In an embodiment, when the processor 502 updates the large language model and the position decoding model by using the type and position information of the loss function and the target in a feedback gradient manner, and combines the updated large language model and position decoding model with the coding model and the vector transformation model to obtain the target detection model, the following steps are specifically implemented:

gradient information is calculated for the loss function; updating the large language model and the position decoding model by using the gradient information; and combining the updated large language model and the position decoding model with the coding model and the vector transformation model to obtain a target detection model.

It should be appreciated that in embodiments of the present application, the processor 502 may be a central processing unit (Central Processing Unit, CPU), the processor 502 may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSPs), application specific integrated circuits (Application Specific Integrated Circuit, ASICs), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Those skilled in the art will appreciate that all or part of the flow in a method embodying the above described embodiments may be accomplished by computer programs instructing the relevant hardware. The computer program comprises program instructions, and the computer program can be stored in a storage medium, which is a computer readable storage medium. The program instructions are executed by at least one processor in the computer system to implement the flow steps of the embodiments of the method described above.

Accordingly, the present invention also provides a storage medium. The storage medium may be a computer readable storage medium. The storage medium stores a computer program which, when executed by a processor, causes the processor to perform the steps of:

In one embodiment, when the processor executes the computer program to implement the step of inputting the image to be detected and the text instruction into the target detection model to perform target detection, the processor specifically implements the following steps:

In one embodiment, when the processor executes the computer program to implement the target detection model by training the coding model, the vector transformation model, the large language model and the position decoding model by using the marked image and the text instruction as the sample set, the following steps are specifically implemented:

In one embodiment, when the processor executes the computer program to implement the step of performing forward propagation training on the coding model, the vector transformation model, the large language model and the position decoding model by using the text instruction when the labeling result and the training are performed, the specific implementation steps are as follows:

In one embodiment, when the processor executes the computer program to update the large language model and the position decoding model by using the type and position information of the loss function and the target in a feedback gradient manner, and combine the updated large language model and position decoding model with the coding model and the vector transformation model to obtain the target detection model, the processor specifically implements the following steps:

The storage medium may be a U-disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk, or other various computer-readable storage media that can store program codes.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of each unit is only one logic function division, and there may be another division manner in actual implementation. For example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed.

The steps in the method of the embodiment of the invention can be sequentially adjusted, combined and deleted according to actual needs. The units in the device of the embodiment of the invention can be combined, divided and deleted according to actual needs. In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The integrated unit may be stored in a storage medium if implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a terminal, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention.

While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. The multi-mode target detection method is characterized by comprising the following steps:

acquiring an image to be detected and a text instruction;

outputting the detection result;

2. The method for multi-modal object detection according to claim 1, wherein the inputting the image to be detected and the text instruction into the object detection model for object detection to obtain a detection result includes:

Coding the image to be detected through a coding model to obtain a feature map;

labeling the text instruction to obtain a labeling sequence;

after labeling the text reply result, forming a new labeling sequence;

3. The multi-modal object detection method of claim 1 wherein the object detection model is formed by training an encoding model, a vector transformation model, a large language model, and a position decoding model with annotated images and text instructions as a sample set, comprising:

Acquiring a text instruction during training;

constructing a loss function;

4. The method for multi-modal object detection as claimed in claim 3, wherein the performing forward propagation training on the coding model, the vector transformation model, the large language model and the position decoding model by using the labeling result and the text instruction during training to obtain the kind and the position information of the object includes:

5. The method of claim 3, wherein the loss functions include a loss function of text description content in the output of the language big model and the labeling result, and a loss function of target positioning information in the output of the position decoding model and the labeling result.

6. The multi-modal object detection method as claimed in claim 3, wherein updating the large language model and the position decoding model by using the loss function and the type and position information of the object in a feedback gradient manner, and combining the updated large language model and position decoding model with the coding model and the vector transformation model to obtain the object detection model, comprises:

Gradient information is calculated for the loss function;

7. A multi-modal object detection apparatus, comprising:

and the output unit is used for outputting the detection result.

8. The multi-modal object detection apparatus according to claim 7, further comprising an object detection model generation unit that trains the encoding model, the vector transformation model, the large language model, and the position decoding model by using the image with the label and the text instruction as a sample set to obtain the object detection model.

9. A computer device, characterized in that it comprises a memory on which a computer program is stored and a processor which, when executing the computer program, implements the method according to any of claims 1-6.

10. A storage medium storing a computer program which, when executed by a processor, implements the method of any one of claims 1 to 6.