CN111476162A

CN111476162A - Operation command generation method and device, electronic equipment and storage medium

Info

Publication number: CN111476162A
Application number: CN202010265410.4A
Authority: CN
Inventors: 刘文印; 莫秀云; 陈俊洪; 梁达勇; 朱展模
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2020-04-07
Filing date: 2020-04-07
Publication date: 2020-07-31

Abstract

The application discloses an operation command generation method and device, electronic equipment and a computer-readable storage medium, wherein the method comprises the steps of obtaining a training set, wherein the training set comprises a plurality of video segments marked with operation commands, the operation commands comprise hands, main body objects, actions and receptor objects of operators, extracting RGB (red, green, blue) features and optical flow features of each video segment, fusing the RGB features and the optical flow features to obtain fusion features, and training an L STM (scanning tunneling microscope) network based on the fusion features and marked operation commands corresponding to each video segment so as to output the operation commands corresponding to target video segments by using the L STM network after training.

Description

Operation command generation method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of robotics, and more particularly, to an operation command generation method, an operation command generation apparatus, an electronic device, and a computer-readable storage medium.

Background

Learning operations from video is an important way for robots to gain new skills. In the related art, an original video is parsed using a syntax-based parser, and the original video is first decomposed into atomic commands to recognize actions, subject objects, and recipient objects therein and combined into an initial command. Secondly, from the real environment, it is decided whether to use the left hand or the right hand by calculating the minimum actual euclidean distance between the subject object and the recipient object and the robot. Finally, the parser may combine the atomic commands to generate a generic command for the robot according to a predefined command sequence syntax.

In the above scheme, a plurality of complex networks, such as a motion recognition network, an object classification network, a subject object classification network, a receptor object classification network, and the like, need to be designed and trained. The hands (left hand or right hand) composing the operation command cannot be directly learned through the information in the video, and the efficiency and the accuracy are low.

Therefore, how to improve the efficiency and accuracy of generating the operation command is a technical problem to be solved by those skilled in the art.

Disclosure of Invention

The application aims to provide an operation command generation method and device, an electronic device and a computer readable storage medium, and efficiency and accuracy of generating an operation command are improved.

In order to achieve the above object, the present application provides an operation command generating method, including:

acquiring a training set; wherein the training set comprises a plurality of video segments labeled with operation commands, the operation commands comprising hands of an operator, a subject object, an action and a recipient object;

extracting RGB (red, green and blue) features and optical flow features of each video segment, and fusing the RGB features and the optical flow features to obtain fusion features;

and training L the STM network based on the fusion characteristics and the labeled operation commands corresponding to each video segment, so as to output the operation commands corresponding to the target video segment by using the trained L STM network.

Wherein the extracting the RGB features and the optical flow features of each of the video segments comprises:

extracting RGB images and optical flow images from each video segment by using an opencv tool box;

extracting the RGB features of each of the video segments from each of the RGB images and the optical flow features of each of the video segments from each of the optical flow images using a dual-stream 3D convolutional neural network.

Wherein, fusing the RGB features and the optical flow features to obtain fused features comprises:

and carrying out vector splicing on the RGB features and the optical flow features to obtain the fusion features.

Wherein the trained L STM network comprises a first L STM layer, a second L STM layer, and a softmax layer;

the input of the first L STM layer comprises fused features of the target video segment and the output comprises a sequence of hidden encoder vectors;

the input of the second L STM layer comprises the concealed encoder vector sequence and the output comprises a decoder vector sequence;

the input of the softmax layer comprises the decoder vector sequence, and the output comprises an operation command corresponding to the target video segment.

To achieve the above object, the present application provides an operation command generating apparatus including:

the acquisition module is used for acquiring a training set; wherein the training set comprises a plurality of video segments labeled with operation commands, the operation commands comprising hands of an operator, a subject object, an action and a recipient object;

the extraction module is used for extracting RGB (red, green and blue) features and optical flow features of each video segment and fusing the RGB features and the optical flow features to obtain fused features;

and the training module is used for training L STM network based on the fusion characteristics and the labeled operation commands corresponding to each video segment, so that the trained L STM network is used for outputting the operation commands corresponding to the target video segment.

Wherein the extraction module comprises:

an extraction unit, configured to extract an RGB image and an optical flow image from each of the video segments using an opencv tool box;

an extracting unit, configured to extract the RGB features of each video segment from each RGB image and extract the optical flow features of each video segment from each optical flow image by using a dual-stream 3D convolutional neural network;

and the fusion unit is used for fusing the RGB features and the optical flow features to obtain fusion features.

The fusion unit is specifically a unit that performs vector splicing on the RGB features and the optical flow features to obtain the fusion features.

To achieve the above object, the present application provides an electronic device including:

a memory for storing a computer program;

and a processor for implementing the steps of the operation command generation method when executing the computer program.

To achieve the above object, the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the above-described operation command generating method.

According to the scheme, the operation command generation method comprises the steps of obtaining a training set, wherein the training set comprises a plurality of video segments marked with operation commands, the operation commands comprise hands, main body objects, actions and receptor objects of an operator, extracting RGB (red, green, blue) features and optical flow features of each video segment, fusing the RGB features and the optical flow features to obtain fusion features, and training L STM (scanning tunneling microscope) networks based on the fusion features and the marked operation commands corresponding to the video segments so as to output the operation commands corresponding to target video segments by using the trained L STM networks.

According to the operation command generation method provided by the application, the operation command marked by the training set comprises the hand of an operator, namely the left hand or the right hand, so that a L STM (Chinese full name: long-Short Term Memory, English full name: L ong Short-Term Memory) network after training can directly output the hand in a target video segment, the efficiency and the accuracy of hand generation are improved, meanwhile, in the application, the operation command can be generated only by a feature extraction network and a L STM network, and the cost of training a plurality of network models is reduced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts. The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:

FIG. 1 is a flow diagram illustrating a method of operation command generation in accordance with an exemplary embodiment;

FIG. 2 is a block diagram illustrating an L STM network in accordance with an illustrative embodiment;

FIG. 3 is a flow diagram illustrating another method of operation command generation in accordance with an illustrative embodiment;

FIG. 4 is a block diagram illustrating an operation command generating apparatus according to an exemplary embodiment;

FIG. 5 is a block diagram illustrating an electronic device in accordance with an exemplary embodiment.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application discloses an operation command generation method, which improves the efficiency and accuracy of generating operation commands.

Referring to fig. 1, a flowchart of an operation command generating method according to an exemplary embodiment is shown, as shown in fig. 1, including:

s101: acquiring a training set; wherein the training set comprises a plurality of video segments labeled with operation commands, the operation commands comprising hands of an operator, a subject object, an action and a recipient object;

in this step, the training set is preprocessed, that is, the video data set is divided into a plurality of video segments according to the motion, and the operation command of each video segment is labeled.

As can be seen, in the present embodiment, in combination with a method in the field of video description (video capture), an operation command learned from a video by a robot is defined as a command sequence (i.e., a hand, a subject object, an action, and a receptor object of an operator) composed of four atomic commands, which is highly concise and has a certain language sequence, so that a time sequence relationship between the atomic commands can be defined, and the method is more suitable for robot applications.

S102: extracting RGB (red, green and blue) features and optical flow features of each video segment, and fusing the RGB features and the optical flow features to obtain fusion features;

the purpose of this step is to preprocess each video segment, i.e. to extract the RGB features and optical flow features of each video segment and fuse them into fused features. In this embodiment, the feature fusion mode may be vector stitching, that is, fusion is performed on the RGB features and the optical flow features to obtain fusion features, including: and carrying out vector splicing on the RGB features and the optical flow features to obtain the fusion features.

As a possible implementation, the step of extracting RGB features and optical flow features of each video segment may include: extracting RGB images and optical flow images from each video segment by using an opencv tool box; extracting the RGB features of each of the video segments from each of the RGB images and the optical flow features of each of the video segments from each of the optical flow images using a dual-stream 3D convolutional neural network.

In the specific implementation, the Opencv toolbox is used for extracting RGB images and optical flow images from each video segment, wherein the optical flow comprises optical flows in x and y directions, the optical flow in the x direction reflects displacement change in the action horizontal direction, the optical flow in the y direction reflects displacement change in the action vertical direction, and the calculation of the optical flow can adopt a method of TV-L1.

And S103, training L the STM network based on the fusion characteristics and the labeled operation commands corresponding to each video segment, so as to output the operation commands corresponding to the target video segment by using the trained L STM network.

In the step, an L STM network is trained based on the fused features and labeled operation commands corresponding to each video segment, and the trained L STM network can output the operation commands corresponding to the target video segment, the operation commands are mapped into a robot application program and sent to a Baxter robot to be executed.

In the encoding stage, the first L STM layer takes the fused feature of the target video segment as input, and outputs a hidden encoder vector sequence He., when training the decoding stage, the labeled operation command is input in the second L layer in combination with the hidden encoder vector sequence He, the second L STM layer converts the hidden encoder vector sequence He into a hidden decoder vector sequence 567 where the labeled operation command can be represented as a vector by using a one-hot coding technique for calculation (i.e., W in fig. 2), and finally the generation of the final operation command is realized by a softmax layer, for example, (L efand, Spatula, Sti, bull) in fig. 2, at the time of prediction stage, the second operation unit L, the end of the operation command is the operation of the second STM layer, and the end of the atomic prediction unit processes the operation command when the STm processes the second atomic prediction unit c-c, the index of the operation command is added before the decoding stage, (L efand, the sputa, the end of the operation command is the second STM-end of the atomic prediction unit, and the STm processing time of the operation command is the second atomic prediction unit c-c.

In addition, for the feature extraction of the video segment, the operation command obtained by the method provided by the embodiment of the application is obviously improved in precision by combining the optical flow feature training network while utilizing RGB features.

The embodiment of the application discloses an operation command generation method, and compared with the previous embodiment, the embodiment further explains and optimizes the technical scheme. Specifically, the method comprises the following steps:

referring to fig. 3, a flowchart of another operation command generating method according to an exemplary embodiment is shown, as shown in fig. 3, including:

s201: acquiring a training set; wherein the training set comprises a plurality of video segments labeled with operation commands, the operation commands comprising hands of an operator, a subject object, an action and a recipient object;

s202: extracting RGB images and optical flow images from each video segment by using an opencv tool box;

s203: extracting the RGB features of each of the video segments from each of the RGB images and the optical flow features of each of the video segments from each of the optical flow images using a dual-stream 3D convolutional neural network;

s204: vector splicing is carried out on the RGB features and the optical flow features to obtain the fusion features;

and S205, training L the STM network based on the fusion characteristics and the labeled operation commands corresponding to each video segment, so as to output the operation commands corresponding to the target video segment by using the trained L STM network.

In the embodiment, the 3D convolutional neural network is used to fuse the spatial information features (RGB features) and the dynamic features (optical flow features) of the motion of the object in the video segment, and the network coding features with the encoder-decoder structure are used to finally output the command which can be directly used for the robot application program. Therefore, the robot can learn the task of the operation command from the video, and the robot intelligent system can be developed to a certain extent.

Therefore, in the embodiment, the robot operation command can be output only by using the double-flow 3D convolutional neural network as the feature extractor to extract and fuse the visual features of the video and constructing the network model processing features with the encoder-decoder structure by using the L STM network, so that the workload of labeling data and the cost for training a plurality of models are greatly reduced.

In the following, an operation command generating apparatus provided by an embodiment of the present application is introduced, and an operation command generating apparatus described below and an operation command generating method described above may be referred to each other.

Referring to fig. 4, a block diagram of an operation command generating apparatus according to an exemplary embodiment is shown, as shown in fig. 4, including:

an obtaining module 401, configured to obtain a training set; wherein the training set comprises a plurality of video segments labeled with operation commands, the operation commands comprising hands of an operator, a subject object, an action and a recipient object;

an extraction module 402, configured to extract RGB features and optical flow features of each video segment, and fuse the RGB features and the optical flow features to obtain a fusion feature;

and a training module 403, configured to train L the STM network based on the fusion feature and the labeled operation command corresponding to each video segment, so as to output the operation command corresponding to the target video segment by using the trained L STM network.

Meanwhile, in the embodiment of the application, the operation command can be generated only by the feature extraction network and the L STM network, so that the cost for training a plurality of network models is reduced.

On the basis of the foregoing embodiment, as a preferred implementation, the extracting module 402 includes:

On the basis of the foregoing embodiment, as a preferred implementation manner, the fusion unit is specifically a unit that performs vector stitching on the RGB features and the optical flow features to obtain the fusion features.

On the basis of the above embodiment, as a preferred implementation, the trained L STM network comprises a first L STM layer, a second L STM layer and a softmax layer;

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

The present application further provides an electronic device, and referring to fig. 5, a structure diagram of an electronic device 500 provided in an embodiment of the present application may include a processor 11 and a memory 12, as shown in fig. 5. The electronic device 500 may also include one or more of a multimedia component 13, an input/output (I/O) interface 14, and a communication component 15.

The processor 11 is configured to control the overall operation of the electronic device 500, so as to complete all or part of the steps in the above-mentioned operation command generating method. The memory 12 is used to store various types of data to support operation at the electronic device 500, such as instructions for any application or method operating on the electronic device 500, and application-related data, such as contact data, messaging, pictures, audio, video, and so forth. The Memory 12 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk or optical disk. The multimedia component 13 may include a screen and an audio component. Wherein the screen may be, for example, a touch screen and the audio component is used for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signal may further be stored in the memory 12 or transmitted via the communication component 15. The audio assembly also includes at least one speaker for outputting audio signals. The I/O interface 14 provides an interface between the processor 11 and other interface modules, such as a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication component 15 is used for wired or wireless communication between the electronic device 500 and other devices. Wireless communication, such as Wi-Fi, bluetooth, Near Field Communication (NFC), 2G, 3G or 4G, or a combination of one or more of them, so that the corresponding communication component 15 may include: Wi-Fi module, bluetooth module, NFC module.

In an exemplary embodiment, the electronic Device 500 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable logic devices (Programmable L) P L D, Field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors or other electronic components for executing the operation command generating method.

In another exemplary embodiment, there is also provided a computer-readable storage medium including program instructions which, when executed by a processor, implement the steps of the above-described operation command generation method. For example, the computer readable storage medium may be the memory 12 including the program instructions that are executable by the processor 11 of the electronic device 500 to perform the operation command generating method described above.

The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims

1. An operation command generation method, comprising:

2. The method according to claim 1, wherein said extracting RGB features and optical flow features of each of said video segments comprises:

3. The method for generating an operation command according to claim 1, wherein fusing the RGB features and the optical flow features to obtain fused features comprises:

4. The operation command generation method according to any one of claims 1 to 3, wherein the trained L STM network includes a first L STM layer, a second L STM layer, and a softmax layer;

5. An operation command generating apparatus, comprising:

6. The operation command generating apparatus according to claim 5, wherein the extracting module comprises:

7. The device according to claim 6, wherein the fusion means is specifically a means for vector-stitching the RGB features and the optical flow features to obtain the fusion features.

8. The operation command generation apparatus according to any one of claims 5 to 7, wherein the trained L STM network includes a first L STM layer, a second L STM layer, and a softmax layer;

9. An electronic device, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the operation command generating method according to any one of claims 1 to 4 when executing the computer program.

10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the operation command generating method according to any one of claims 1 to 4.