CN116311515A

CN116311515A - Gesture recognition method, device, system and storage medium

Info

Publication number: CN116311515A
Application number: CN202310243959.7A
Authority: CN
Inventors: 时士柱; 张志伟; 张骁迪; 叶平; 王进
Original assignee: Rainbow Software Co ltd
Current assignee: Rainbow Software Co ltd
Priority date: 2023-03-08
Filing date: 2023-03-08
Publication date: 2023-06-23
Also published as: WO2024183653A1

Abstract

The invention provides a gesture recognition method, a device, a system and a storage medium, wherein the gesture recognition method comprises the steps of matching gesture features to be recognized with gesture features in a gesture feature library, and determining a first matching set; determining target gesture features in the first matching set, matching the target gesture features with gesture features in a gesture feature library, and determining a second matching set; determining a third matching set according to the first matching set and the second matching set; and matching the gesture features to be recognized with the gesture features in the third matching set, and determining a gesture recognition result of the gesture features to be recognized. Through the multiple matching process, missing recognition of gesture features to be recognized is avoided, and gesture recognition accuracy is improved.

Description

Gesture recognition method, device, system and storage medium

Technical Field

The present disclosure relates to image recognition technology, and in particular, to a gesture recognition method, device, system, and storage medium.

Background

With the development of artificial intelligence technology, gesture recognition technology has been applied to more and more fields as an emerging man-machine interaction technology. The gesture recognition technology can enable the user to get rid of the constraint of the physical input device, for example, the user can use a mobile phone photographing function and a screen sliding function through gestures; the user can control intelligent household appliances such as electric lamps, curtains, air conditioners and the like through gestures; the user can control the window opening and closing of the automobile, adjust the playing volume and the like through gestures. The gesture recognition technology brings great convenience to users by virtue of the unique advantages of non-contact and what is done is that the gesture recognition technology is used for changing the traditional man-machine interaction mode.

The existing gesture recognition technology can collect hand images of users, determine interaction instructions by recognizing gestures in the hand images, and further execute corresponding operations. However, the gesture recognition result obtained by the related technology has lower precision, so that response is not timely or error occurs.

At present, aiming at the problem of low accuracy of the obtained gesture recognition result in the related technology, no effective solution is proposed yet.

Disclosure of Invention

The application provides a gesture recognition method, device, system and storage medium, which can improve gesture recognition precision.

The application provides a gesture recognition method, which comprises the following steps:

matching gesture features to be identified with gesture features in a gesture feature library, and determining a first matching set;

determining target gesture features in the first matching set, matching the target gesture features with gesture features in the gesture feature library, and determining a second matching set;

determining a third matching set according to the first matching set and the second matching set;

and matching the gesture features to be recognized with the gesture features in the third matching set, and determining a gesture recognition result of the gesture features to be recognized.

In some of these embodiments, determining a third set of matches from the first set of matches and the second set of matches comprises:

determining an intersection feature quantity according to the intersection of the first matching set and the second matching set;

and determining the third matching set according to the first matching set and the second matching set under the condition that the intersection characteristic quantity meets a preset set quantity threshold value.

and taking the union set of the first matching set and the second matching set as the third matching set.

In some embodiments, the first matching set is determined by a similarity between the gesture feature to be recognized and a gesture feature in the gesture feature library and a preset first matching parameter, and/or the second matching set is determined by a similarity between the target gesture feature and a gesture feature in the gesture feature library and a preset second matching parameter.

In some of these embodiments, the first matching parameter comprises a first number parameter and the second matching parameter comprises a second number parameter, the second number parameter being smaller than the first number parameter.

In some embodiments, the similarity between the gesture feature to be identified and the gesture feature in the gesture feature library is calculated by a k-nearest neighbor algorithm.

In some of these embodiments, in the case where the first matching set and the second matching set each calculate a similarity by a k-nearest neighbor algorithm, the k-value used to calculate the second matching set is smaller than the k-value used to calculate the first matching set.

In some of these embodiments, matching the gesture features to be identified with gesture features in the gesture feature library, determining the first matching set includes:

performing initial matching on the gesture features to be recognized and gesture features in the gesture feature library, and adding the gesture features to be recognized to the gesture feature library under the condition that the initial matching is successful to obtain an expanded gesture feature library;

and matching the gesture features to be identified with the gesture features in the expanded gesture feature library again to obtain the first matching set.

In some of these embodiments, the method further comprises:

the gesture feature library comprises multiple types of gesture features, and the multiple gesture features of the target type are screened under the condition that the number of the gesture features of the target type is larger than the preset storage number.

In some of these embodiments, filtering the plurality of gesture features of the target type includes:

and screening the gesture features of the target type according to a preset similarity threshold and/or the acquisition time of each gesture feature in the gesture feature library.

In some embodiments, the method for acquiring the gesture feature to be identified includes:

acquiring appearance characteristics and skeleton point characteristics of gestures in an image to be detected;

and jointly determining the gesture feature to be recognized according to the appearance feature and the skeleton point feature.

In some of these embodiments, the skeletal point feature is a 2.5D feature.

In some of these embodiments, the method further comprises:

acquiring registration gesture features in a registration image;

and registering the registration gesture features according to the registration gesture features and gesture functions corresponding to the registration gesture features.

The application provides a gesture recognition device, the device includes first determination module, second determination module, third determination module and recognition module:

the first determining module is used for matching gesture features to be identified with gesture features in the gesture feature library to determine a first matching set;

The second determining module is configured to determine a target gesture feature in the first matching set, match the target gesture feature with a gesture feature in the gesture feature library, and determine a second matching set;

the third determining module is used for determining a third matching set according to the first matching set and the second matching set;

the fourth determining module is configured to match the gesture feature to be identified with the gesture feature in the third matching set, and determine a gesture recognition result of the gesture feature to be identified.

The application provides a gesture recognition system, including image acquisition device and gesture recognition device:

the image acquisition device is used for acquiring gesture characteristics to be identified;

the gesture recognition device is used for matching gesture features to be recognized with gesture features in the gesture feature library to determine a first matching set; determining target gesture features in the first matching set, matching the target gesture features with gesture features in the gesture feature library, and determining a second matching set; determining a third matching set according to the first matching set and the second matching set; and matching the gesture features to be recognized with the gesture features in the third matching set, and determining a gesture recognition result of the gesture features to be recognized.

The present application provides a computer-readable storage medium storing one or more programs executable by one or more processors to implement any of the gesture recognition methods described above.

Compared with the related art, the application comprises the following steps: matching gesture features to be identified with gesture features in a gesture feature library, and determining a first matching set; determining target gesture features in the first matching set, matching the target gesture features with gesture features in a gesture feature library, and determining a second matching set; determining a third matching set according to the first matching set and the second matching set; and matching the gesture features to be recognized with the gesture features in the third matching set, and determining a gesture recognition result of the gesture features to be recognized. Through the multiple matching process, missing recognition of gesture features to be recognized is avoided, and gesture recognition accuracy is improved.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application. Other advantages of the present application may be realized and attained by the structure particularly pointed out in the written description and drawings.

Drawings

The accompanying drawings are included to provide an understanding of the technical aspects of the present application, and are incorporated in and constitute a part of this specification, illustrate the technical aspects of the present application and together with the examples of the present application, and not constitute a limitation of the technical aspects of the present application.

FIG. 1 is a flow chart of a gesture recognition method according to an embodiment of the present application;

FIG. 2 is a flow chart of a method of determining a third set of matches according to an embodiment of the present application;

FIG. 3 is a flowchart of a method of acquiring gesture features to be recognized according to an embodiment of the present application;

FIG. 4 is a flow chart of a gesture registration method according to an embodiment of the present application;

FIG. 5 is a flow chart of a network training method according to an embodiment of the present application;

FIG. 6 is a schematic diagram of augmenting a library of hand features according to a preferred embodiment of the present application;

FIG. 7 is a schematic diagram of a matching process according to a preferred embodiment of the present application;

FIG. 8 is a block diagram of a gesture recognition apparatus according to an embodiment of the present application.

Detailed Description

The present application describes a number of embodiments, but the description is illustrative and not limiting and it will be apparent to those of ordinary skill in the art that many more embodiments and implementations are possible within the scope of the embodiments described herein. Although many possible combinations of features are shown in the drawings and discussed in the detailed description, many other combinations of the disclosed features are possible. Any feature or element of any embodiment may be used in combination with or in place of any other feature or element of any other embodiment unless specifically limited.

The present application includes and contemplates combinations of features and elements known to those of ordinary skill in the art. The embodiments, features and elements of the present disclosure may also be combined with any conventional features or elements to form a unique inventive arrangement as defined in the claims. Any feature or element of any embodiment may also be combined with features or elements from other inventive arrangements to form another unique inventive arrangement as defined in the claims. Thus, it should be understood that any of the features shown and/or discussed in this application may be implemented alone or in any suitable combination. Accordingly, the embodiments are not to be restricted except in light of the attached claims and their equivalents. Further, various modifications and changes may be made within the scope of the appended claims.

Furthermore, in describing representative embodiments, the specification may have presented the method and/or process as a particular sequence of steps. However, to the extent that the method or process does not rely on the particular order of steps set forth herein, the method or process should not be limited to the particular sequence of steps described. Other sequences of steps are possible as will be appreciated by those of ordinary skill in the art. Accordingly, the particular order of the steps set forth in the specification should not be construed as limitations on the claims. Furthermore, the claims directed to the method and/or process should not be limited to the performance of their steps in the order written, and one skilled in the art can readily appreciate that the sequences may be varied and still remain within the spirit and scope of the embodiments of the present application.

The embodiment of the application provides a gesture recognition method, as shown in fig. 1, comprising the following steps:

step S101, matching gesture features to be recognized with gesture features in a gesture feature library, and determining a first matching set.

In this embodiment, gesture features to be recognized may be directly obtained, or detected from an obtained image, where the gesture features to be recognized may be various control gestures, for example, various gestures in a scene such as a control mobile phone, a vehicle, a smart home, augmented Reality (Augmented Reality, abbreviated as AR), virtual Reality (VR), and the like.

The gesture feature library can be a preset set of various gesture features, the number of each type of gesture features can be one or more, and the process of matching the gesture features to be recognized with the gesture features in the gesture feature library can be realized through feature point detection and/or similarity calculation.

In order to avoid missing recognition of gesture features to be recognized in the recognition process, in this embodiment, a plurality of gesture feature forming sets are selected from a gesture feature library to serve as recognition results of the gesture features to be recognized. The selection criterion may be a preset parameter such as similarity, acquisition time, etc.

Step S102, determining target gesture features in the first matching set, matching the target gesture features with gesture features in the gesture feature library, and determining a second matching set.

In order to avoid missing recognition, in this embodiment, one or more gesture features in the first matching set are sequentially used as target gesture features, and are matched with gesture features in the gesture feature library again, and if a plurality of gesture features are used as target gesture features, a plurality of second matching sets corresponding to the target gesture features are obtained.

In this embodiment, the number of target gesture features is not limited, and may be one or several gesture features selected randomly in the first matching set, or one or several gesture features selected according to a preset rule, and preferably, all gesture features in the first matching set may be sequentially used as target gesture features.

Step S103, determining a third matching set according to the first matching set and the second matching set.

After the first matching set and the second matching set are obtained, a third matching set may be determined according to gesture features in the first matching set and the second matching set, for example, the third matching set may be determined according to a union or intersection of the first matching set and the second matching set, and a part of gesture features may be selected according to similarity between gesture features in the two sets and gesture features to be identified to form the third matching set.

In the case of a target gesture feature, the third matching set is determined directly from the first matching set and the second matching set.

In the case that there are a plurality of target gesture features, there are a plurality of second matching sets corresponding to each target gesture feature, at this time, the third matching set may be determined directly according to the first matching set and all the second matching sets together, or the second matching sets and the first matching sets may be first compared in a one-to-one manner, so as to obtain a plurality of comparison results, and then the third matching set may be determined according to the plurality of comparison results together.

Step S104, matching the gesture features to be recognized with the gesture features in the third matching set, and determining a gesture recognition result of the gesture features to be recognized.

And taking the third matching set as a set which is required to be matched finally for characteristic matching of the gesture characteristics to be recognized, so as to obtain a gesture recognition result of the gesture characteristics to be recognized. Specifically, the gesture recognition result in this embodiment is a control function corresponding to the gesture feature to be recognized.

It should be noted that, in this embodiment, implementation manners of each matching stage are not required; in terms of matching precision, more similar gestures are required to be covered as much as possible in the process of acquiring the first matching set and the second matching set, so that the matching precision can be set lower, the used matching algorithm can also be based on cosine distance, euclidean distance, mahalanobis distance and the like, and in the process of finally determining the gesture recognition result of the gesture feature to be recognized, the recognition accuracy is required to be improved, so that the matching precision can be set higher, and the used matching algorithm can be an algorithm with higher precision, such as Jaccard distance and the like.

Through the steps S101 to S104, the present embodiment improves the recall rate of the gesture recognition process, avoids missing recognition of the gesture feature to be recognized, and improves the accuracy of gesture recognition based on the multiple matching process between the gesture feature to be recognized and the gesture feature in the gesture feature library.

In some of these embodiments, fig. 2 is a flowchart of a method for determining a third matching set according to an embodiment of the present application, as shown in fig. 2, the method including the steps of:

step S201, determining the quantity of intersection features according to the intersection of the first matching set and the second matching set;

the more the number of intersection features of the first matching set and the second matching set is, the more similar the gesture features in the two sets are, and the accuracy of the final matching process is improved.

Step S202, determining a third matching set according to the first matching set and the second matching set under the condition that the quantity of the intersection features meets the threshold value of the quantity of the preset sets.

Under the condition that the quantity of the intersection features does not meet the threshold value of the quantity of the preset sets, the condition that gesture features which are similar to each other and meet the requirement do not exist is indicated, and at the moment, gesture recognition can be directly carried out according to the gesture features in the first matching sets.

Under the condition that a plurality of second matching sets exist, comparing each second matching set with the first matching set in sequence, confirming the feature quantity of the intersection, and updating the third matching set according to the comparison result of each second matching set until all the second matching sets are traversed.

Through the step S201 and the step S202, it can be ensured that the gesture features in the third matching set meet a higher similarity requirement while avoiding missing recognition, which is beneficial to improving the accuracy of the gesture recognition process.

Further, the union of the first matching set and the second matching set can be directly used as a third matching set, or the union of the first matching set and the second matching set can be used as the third matching set under the condition that the quantity of intersection features meets the threshold value of the quantity of the preset sets, so that the missing recognition of gesture features in the matching process is avoided, and the recall rate of the gesture features is improved. The preset number of sets threshold may be set according to the scene requirement, preferably, the preset number of sets threshold may be set according to the number of gesture features in the second matching set, for example, the preset number of sets threshold is set to be a proportion of the number of gesture features in the second matching set, and the proportion may be set according to the requirement of accuracy, for example, may be 1/2 or 1/3, etc.

In some of these embodiments, the first and second sets of matches are derived based on similarities between gesture features and matching parameters. Specifically, the first matching set is determined through similarity between gesture features to be recognized and gesture features in the gesture feature library and a preset first matching parameter, and/or the second matching set is determined through similarity between target gesture features and gesture features in the gesture feature library and a preset second matching parameter. The first matching parameter and the second matching parameter are both matching parameters for filtering gesture features, and preferably, the matching parameters may include multiple types of parameters, for example, a preset similarity threshold, a time for acquiring gesture features, the number of gesture features in the set, and so on. During screening, gesture features with high similarity can be selected preferentially, and each gesture feature can be comprehensively judged according to preset weights to determine whether the gesture feature is added into a corresponding set. In the process of determining the first matching set and the second matching set, accuracy of the gesture feature matching result can be improved through similarity and preset matching parameters.

In some of these embodiments, the matching parameters include a number parameter, in particular, the first matching parameter includes a first number parameter, and the second matching parameter includes a second number parameter, which is required to be smaller than the first number parameter during the matching calculation. Based on the limitation, under the condition that the requirement of similarity between gesture features is met, the number of gesture features in the second matching set is smaller than that of gesture features in the first matching set, so that the recall rate is improved, and meanwhile, the accuracy of a gesture feature matching result in the second matching process is improved. Preferably, when the first matching set and the second matching set are formed, gesture features are screened according to the similarity requirement, and when the number of gesture features meeting the similarity threshold is larger than the number parameter, gesture features are screened from high to low according to the similarity, so that the number of gesture features in the first matching set and the second matching set meets the requirement.

Since if two gesture features a and B are similar, B should be within the first k neighbors of a, and conversely a should be within the first k neighbors of B. However, if the two gesture features C and D are dissimilar, even if C is within the first k neighbors of D, D will not be within the first k neighbors of C. Preferably, therefore, the similarity between the gesture features to be recognized and the gesture features in the gesture feature library is calculated by a k-nearest neighbor (k-reciprocal nearest neighbor, abbreviated as k-rnn) algorithm. The method has the advantages that only the direction of the vector is considered relative to cosine similarity, the size of the vector is ignored, the matching standard is single, the k mutual nearest neighbor algorithm can effectively improve recall rate of gesture features on the premise of guaranteeing recognition accuracy through multiple matching calculation, and therefore recognition effect of current gesture features is effectively improved, and accuracy of gesture matching is improved.

Further, in the case where the first matching set and the second matching set each calculate the similarity by a k-nearest neighbor algorithm, the k value used for calculating the second matching set is smaller than the k value used for calculating the first matching set. For example, the value of k is 20 when calculating the first matching set, then the value of k may be 10 when calculating the second matching set. In the process of calculating the second matching set, the function of reducing the k value is to reduce the calculated amount and ensure the accuracy of gesture recognition.

In some embodiments, the number of gesture features in the gesture feature library is small, even only one gesture feature may be matched with the gesture to be recognized, the matching difficulty is high, the gesture feature library needs to be expanded at this time, then the gesture features are matched based on the expanded gesture feature library, specifically, the gesture features to be recognized and the gesture features in the gesture feature library are initially matched, and under the condition that the initial matching is successful, the gesture features to be recognized are added to the gesture feature library, so as to obtain the expanded gesture feature library. The initial matching can be realized by a similarity calculation method of the basis of Euclidean distance, cosine distance, marsh distance and the like.

In this embodiment, the initial matching is rough recognition, so as to determine whether the gesture to be recognized can correspond to the gesture feature in the gesture feature library, and if the initial matching is unsuccessful, the subsequent matching process is not performed, and the gesture feature to be recognized is not added into the gesture feature library.

After the extended gesture feature library is obtained, the gesture features to be recognized are matched with the gesture features in the extended gesture feature library again, and a first matching set is obtained. The matching result at this time not only contains the original gesture features in the gesture feature library, but also includes the gesture features to be recognized.

After multiple times of recognition, compared with a gesture feature library obtained during registration, the extended gesture feature library has rich gesture features, and is beneficial to avoiding missing recognition of gesture features to be recognized.

Preferably, the k mutual neighbor algorithm is usually one-to-many, and in the initial state, only one gesture feature corresponding to the gesture feature to be recognized exists in the pre-registered gesture feature library, so that when the k mutual neighbor algorithm is adopted for gesture feature recognition, initial matching can be realized through mahalanobis distance calculation, and the extended gesture feature library is obtained.

In some embodiments, the gesture feature library includes multiple types of gesture features, and when the number of gesture features of a target type is greater than a preset storage number, the multiple gesture features of the target type may be screened to reduce the data amount of the gesture features of the same type and the calculated amount in the matching process, so as to improve accuracy of matching the gesture features. The gesture features of different types correspond to different preset storage numbers, and the preset storage numbers can be set by a user and can be modified when needed. Specifically, functions corresponding to gesture features in the gesture feature library are various, for example, a control target is turned on/off, volume is adjusted, photographing or flash lamp is controlled, after the gesture feature library is expanded, the number of each type of gesture features is multiple, and if the number of one type of gesture features is excessive, the gesture features can be screened. In this embodiment, the gesture features of the target type may be any type of gesture features, and during screening, the screening conditions of the various gesture features may be the same or different as a preset value. For example, the number of gesture features corresponding to important functions may be greater, and the similarity requirement may be higher.

Preferably, in the screening process, a plurality of gesture features of the target type can be screened according to a preset similarity threshold. Specifically, gesture features below a similarity threshold among the plurality of gesture features of the target type are deleted. Further, multiple gesture features of the target type may be screened according to the acquisition time of each gesture feature in the gesture feature library, and specifically, a plurality of gesture features stored recently are reserved. When the similarity threshold and the acquisition time are used for screening gesture features, weights of the similarity and the acquisition time can be set respectively, the gesture features are evaluated according to the similarity, the first weight corresponding to the similarity, the acquisition time and the second weight corresponding to the acquisition time, and finally a plurality of gesture features with higher evaluation are taken and stored, or the gesture features with higher evaluation can be firstly ranked from high to low according to the similarity, then the gesture features with the front ranking are taken according to the acquisition time, or the gesture features can be ranked from near to far according to the acquisition time, and then a plurality of gesture features with higher similarity are taken. It should be noted that, in the screening process, the gesture features after screening may also be required to meet a preset similarity threshold and/or requirement of acquiring time, so as to ensure accuracy of the gesture feature matching process.

In some of these embodiments, fig. 3 is a flowchart of a method for acquiring gesture features to be recognized according to an embodiment of the present application, as shown in fig. 3, the method includes the steps of:

step S301, the appearance characteristic and the skeleton point characteristic of the gesture in the image to be detected are obtained.

Step S302, determining gesture features to be recognized according to the appearance features and the skeleton point features.

In this embodiment, gesture features to be recognized are obtained based on the appearance features and skeleton point features in the image to be detected at the same time, specifically, after the image to be detected is obtained, the appearance features and skeleton point features of the gesture are extracted from the image to be detected through a convolutional neural network, then the appearance features and the skeleton point features are fused, and the fused features are sent into the convolutional neural network for subsequent processing. The input size of the image to be detected is (W, H, C), W and H represent the width and height of the image area of the human hand, C is the number of channels of the image of the human hand, for example, the input size of the skeletal points of the human hand is (1,21,3), 1 is the expanded dimension, it can be understood that there is one hand, 21 is the number of joint points, and 3 is the (x, y, z) coordinates of the joint points.

Preferably, inputting an image to be detected containing a gesture into a first neural network in a gesture recognition network to obtain the appearance characteristic of the gesture; inputting hand skeleton point data of the image to be detected into a second neural network in a gesture recognition network to obtain skeleton point characteristics; and after the appearance characteristics and the skeleton point characteristics are fused, inputting the fused appearance characteristics and skeleton point characteristics into a third neural network in the gesture recognition network, and obtaining comprehensive gesture characteristics to be recognized. The method for carrying out feature fusion can be as follows: and performing splicing operation (splicing) on the appearance characteristics and the bone point characteristics in the channel dimension to realize characteristic fusion. Preferably, the third neural network can also adjust the appearance feature and the skeleton point feature to make the data sizes the same, and in order to reduce the calculation amount, the third neural network can perform feature extraction operation after performing dimension reduction calculation on the gesture feature to be identified.

Through the steps S301 and S302, compared with the 2D feature and the angle information of the bone points, the scheme in the embodiment considers the feature of the bone points and the appearance feature of the gesture at the same time, so that the false recognition in the process of searching and matching can be effectively avoided, and the matching precision is improved.

In the present application, the bone point feature may be a 3D feature or a 2.5D feature, and in order to improve the detection accuracy of the bone point feature, the bone point feature in the present application is preferably a 2.5D feature. The (x, y, z) coordinates of the bone points, z representing the depth of the bone points relative to the selected root node, e.g., the wrist points are taken as root nodes, to reflect the relative depth information between the bone points, effectively enriching the coordinate information of the bone point features.

In the prior art, most gestures which can be identified by a gesture identification technology are predefined, and users can only use the predefined gestures, so that the expandability is poor; in addition, the user needs a certain learning cost when using the predefined gestures, especially some gestures with high difficulty, which are not necessarily realized by the user, and the operability is poor.

In order to solve the above technical problem, the embodiment of the present application further provides a gesture registration method, and fig. 4 is a gesture registration method according to an embodiment of the present application, where the method includes the following steps:

Step S401, acquiring registration gesture features in a registration image.

The registered image is an image comprising a gesture of a user, and the user can upload the existing image as the registered image, and can shoot the registered image in real time through an image acquisition device of the gesture recognition system. The gesture can be customized by the user according to the scene requirement.

Step S402, registering the registration gesture features according to the registration gesture features and gesture functions corresponding to the registration gesture features.

After the registration gesture features are acquired, user-defined gesture functions such as a control switch, volume adjustment and the like can be acquired, then the registration gesture features and the corresponding gesture functions are matched, gesture registration is realized, the user-defined gestures are conveniently and subsequently identified, the corresponding operation functions are directly executed, and the operation experience of the user using gesture identification is optimized.

Through the steps S401 and S402, the user may register the user-defined gesture of interest to the preset gesture feature library, so as to expand the identifiable gesture types, and the subsequent user may match with the registered gesture and execute the corresponding function instruction when making the gesture.

Specifically, the method for acquiring the registration gesture features comprises the following steps: acquiring appearance characteristics and skeleton point characteristics of gestures in a registered image; inputting the appearance characteristics and the skeleton point characteristics into a trained gesture recognition network to obtain registration gesture characteristics.

Further, when gesture registration is performed, an identity ID can be allocated to the registered gesture feature, so that the registration gesture feature is convenient to search and retrieve.

Based on the gesture recognition method, the embodiment of the present application further provides a network training method, which is used to implement the gesture recognition process, as shown in fig. 5, and the method includes:

in step S501, a sample pair is obtained, the sample pair being one or more of a positive sample pair and a negative sample pair.

The positive sample pair comprises an anchor sample and a positive sample, the negative sample pair comprises the anchor sample and a negative sample, each sample comprises a gesture picture and skeleton point characteristics corresponding to the gesture picture, and the gesture picture can be acquired through image acquisition equipment with a monocular RGB camera.

Step S502, training the sample to the gesture recognition network to be trained.

In this embodiment, the gesture recognition network is trained by using the gesture picture and the hand skeleton point feature corresponding to the gesture in the gesture picture, so that the gesture comprehensive feature obtained through the network synthesizes the appearance feature and the skeleton point feature of the gesture, the information quantity of the obtained gesture feature is increased, and the accuracy of gesture recognition is improved.

In some embodiments, when the network training is performed based on the positive sample pair, specifically, the method includes that the anchor sample and the positive sample in the positive sample pair are respectively input into the gesture recognition network, the loss function value between the two gesture comprehensive features output by the two samples through the network is smaller than a first preset value as a training target, and the gesture recognition network is trained, namely, the training target is to pull the distance between the anchor sample and the positive sample.

Further, inputting the two samples in the positive sample pair and the two samples in the negative sample pair into the gesture recognition network, and training the gesture recognition network by taking a loss function value between the difference of the two gesture comprehensive characteristics output by the two samples in the positive sample pair through the network and the difference of the gesture comprehensive characteristics output by the two samples in the negative sample pair through the network as a training target, wherein the training target is to pull the distance between the anchor sample and the positive sample and the distance between the anchor sample and the negative sample.

In some of these embodiments, the pair of samples is a negative pair of samples, when training the gesture recognition network, comprising the steps of: and respectively inputting two samples in the negative sample pair into a gesture recognition network, and training the gesture recognition network by taking a loss function value between two gesture comprehensive features output by the two samples through the network as a training target, wherein the loss function value is larger than a third preset value, namely the training target is to pull the distance between the anchor sample and the negative sample.

Specifically, the loss function may use a soft margin triplet loss function modified from the original triplet loss function to perform the loss calculation for back propagation. The purpose of the triple loss function is to make the intra-class distance smaller and the inter-class distance larger; to achieve this, the triplet loss function constructs a number of sets of triplets (active, negative) that optimize feature vectors by learning such that the Anchor sample active is closer to the Positive sample Positive than the Anchor sample active in the European space.

the formula of the triplet loss function can be expressed as:

L＝max(d(a,p)-d(a,n)+margin,0) (1)

in the formula (1), L represents a loss function, max (0) is truncated by setting a value smaller than 0 to 0, and a represents an Anchor sample Anchor; p represents Positive samples, positive samples of the same class as a; n represents Negative, and a is a Negative sample of a different class; margin is a constant greater than 0 and the final optimization objective is to pull closer the distance between a and p and farther the distance between a and n.

The soft margin triplet loss function uses a soft plus function based on a triple loss function to replace the truncation function max (0) with a smooth approximation, thereby eliminating the truncation process, and the formula is as follows:

s(f)＝ln(1+e ^f ) (2)

In the formula (2), f represents d (a, p) -d (a, n) +margin in the formula (1).

The embodiment can enhance the generalization capability of the network by using the soft margin triplet loss function, and smoothly approximates the truncated function in the triple loss by using the exponential decay softplus function, so that the training effect on the network is improved.

The process of registering and identifying gestures is described below by way of a preferred embodiment.

The gesture registration and recognition process in the preferred embodiment is realized by four modules, namely an image acquisition module, a target detection tracking module, a human hand skeleton point detection module, a feature fusion and extraction module, and specifically:

(1) And an image acquisition module. The device is used for driving the monocular RGB camera to realize an image acquisition function, for example, a human hand image is acquired in real time and is input to a subsequent module. An infrared camera or a depth camera may also be employed in other embodiments;

(2) And a target detection tracking module. The method is used for detecting the human hand target and simultaneously matching the human hand with the front frame and the rear frame by using a target tracking algorithm. The example realizes a target detection model based on a deep learning correlation theory, and the model can output a rectangular area, category, confidence and ID of a human hand. Finally, the target detection tracking module outputs a human hand rectangular frame in the current frame to the next module;

(3) And a human hand bone point detection module. The system is used for taking the human hand rectangular frame output by the target detection tracking module as input data and outputting 2.5D skeleton point coordinates;

(4) And the feature fusion and extraction module. The system is used for respectively receiving the rectangular human hand image output by the target detection tracking module and the 2.5D bone point characteristics output by the human hand bone point detection module to obtain the comprehensive human hand characteristics, namely the system has two inputs.

In the training stage, the feature fusion and extraction module acquires the comprehensive hand features of the sample image, and the training of the gesture recognition model is completed through the loss function; in the registration stage, a feature fusion and extraction module acquires registration gesture features in a registration image as comprehensive hand features, and adds the comprehensive hand features into a gesture feature library to finish registration; and in the recognition stage, the feature fusion and extraction module acquires the current gesture features to be recognized in real time, compares the current gesture features with gesture features in a gesture feature library, and outputs a matching result.

The following details the matching process of gesture features:

s1, calculating the Mahalanobis distance between a gesture feature to be recognized and each gesture feature in a gesture feature library, and comparing each Mahalanobis distance with a preset similarity threshold value to obtain an initial matching result; wherein, the mahalanobis distance d _M The calculation formula of (x, y) may be:

wherein x and y respectively represent a human hand characteristic vector serving as a probe and a vector of gesture characteristics in a gesture characteristic library; s is a covariance matrix of the multidimensional random variable;

s2, if the gesture features to be recognized are initially matched and pass, the gesture features to be recognized are also stored in a gesture feature library, and ID which is the same as the corresponding gesture features in the gesture feature library is allocated to the gesture features, so that an expanded gesture feature library is obtained; as shown in fig. 6, the real-time motion is a gesture feature to be recognized, the Gallery is a gesture feature library, and after the gesture feature to be recognized is successfully matched with the gesture feature initially registered in the Gallery, the Gallery is added to obtain an expanded gesture feature library, namely an updated library;

s3, determining k gesture features closest to the Margard distance between the gesture features to be recognized in a gesture feature library, and calculating Jaccard distances between the k gesture features and the gesture features to be recognized, wherein at the moment, the gesture feature corresponding to the minimum Jaccard distance can be used as a matching result of the gesture features to be recognized; the calculation formula of the Jaccard distance may be:

wherein p represents probe; g _i The gesture comprehensive characteristics in a pre-store library are obtained; r is R ^* Is an extension of R, which represents a feature set conforming to the k-nearest neighbor algorithm, defined as follows:

R(p,k)＝{(g _i ∈N(p,k))∩(p∈N(g _i ,k))} (5)

in the formula, N is the gesture comprehensive characteristic of the k mutual neighbor algorithm, and is defined as follows:

where N (p, k) represents the number of candidate gesture composite features in N.

It can be seen that R (p, k) is the set of gesture features in the gesture feature library with k mutual neighbor algorithm when the gesture feature to be identified is regarded as probe.

Further, to avoid positive samples from being excluded from the probe's k-neighbors due to illumination, occlusion, angle, etc., 1/2k-reciprocal nearest neighbors of each candidate in R (p, k) can be incrementally added to the more robust set R according to the following condition ^* (p, k), namely:

formula (7) shows thatGesture feature g in R (p, k) _i Traversing these g _i Wherein i=1 to k, each g _i As a new probe, executing a k mutual neighbor algorithm of k' =k/2 in the extended gesture feature library to obtain a sum g _i Corresponding to

If R (p, k) and +.>

The number of gesture features in the intersection of (a) is greater than or equal to +.>

2/3 of the gesture feature number, R (p, k) and all the required +.>

Is taken as R ^* (p,k)。

Generating R ^* The process of (p, k) can be referred to in fig. 7. In fig. 7, Q is first taken as probe, and based on the gesture feature library after expansion, a set R (Q, 20) is obtained by calculating k mutual neighbor algorithm, namely a first matching set with k value of 20 is obtained, and it should be noted that although theoretically the number of pictures contained in the set R is 20, the gesture features are filtered during sorting, so that the actual number of pictures in the set R may be less than 20; then taking out the picture with the ID of 2 in R (Q, 20) as a new probe independently, reducing k to half of the original picture, namely k=20/2=10 at the moment, and calculating to obtain a second row set R (2, 10) through a k mutual neighbor algorithm based on the expanded gesture feature library, namely a second matching set, wherein C of R (C, 10) in the picture represents the ID of the gesture feature, and C is 2 in the embodiment; the number of intersections (i.e., the same gesture features) of R (Q, 20) with R (2, 10) is 2, while the number of R (2, 10) is 3; substituting these numbers into

The union of R (Q, 20) and R (2, 10) is satisfied, and thus the third line of FIG. 7 can be taken. Traversing the first row in turnAll gesture features, finally, all the union sets are taken as R ^* (Q, 20) to obtain a third matching set.

Finally, the gesture recognition result can be determined by calculating the similarity between the Q and each gesture feature in the third matching set.

According to the embodiment, the Jaccard distance is used as a criterion for measuring the similarity between the gesture features to be recognized and the gesture features in the gesture feature library, and as can be known from the theoretical analysis, the Jaccard distance is used for considering not only the similarity between the two gesture features, but also the similarity between the gesture features to be recognized and a batch of similar gesture features, so that the Jaccard distance contains more abundant information, and interference of some environmental factors can be eliminated. Because of this, the gesture recognition effect can be significantly improved based on the Jaccard distance recognition gesture in the embodiment. The matching process based on the k mutual neighbor algorithm uses the mahalanobis distance and the Jaccard distance to perform two times of matching, and compared with the matching process in the existing gesture recognition technology, the single matching algorithm which uses the cosine distance as the similarity basis can effectively improve the recognition effect of gestures.

The embodiment also provides a gesture recognition device, which is used for implementing the above embodiment and the preferred implementation manner, and is not described in detail. The terms "module," "unit," "sub-unit," and the like as used below may refer to a combination of software and/or hardware that performs a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementations in hardware, or a combination of software and hardware, are also possible and contemplated.

Fig. 8 is a block diagram of a gesture recognition apparatus according to an embodiment of the present application, and as shown in fig. 8, the apparatus includes a first determination module 81, a second determination module 82, a third determination module 83, and a recognition module 84:

the first determining module 81 is configured to match gesture features to be identified with gesture features in the gesture feature library, and determine a first matching set;

a second determining module 82, configured to determine a target gesture feature in the first matching set, match the target gesture feature with a gesture feature in the gesture feature library, and determine a second matching set;

a third determining module 83, configured to determine a third matching set according to the first matching set and the second matching set;

The fourth determining module 84 is configured to match the gesture feature to be identified with the gesture feature in the third matching set, and determine a gesture recognition result of the gesture feature to be identified.

Through the gesture recognition device, the recall rate of the gesture recognition process is improved, missing recognition of the gesture features to be recognized is avoided, and the gesture recognition accuracy is improved based on the multiple matching process between the gesture features to be recognized and the gesture features in the gesture feature library.

Further, determining a third set of matches from the first set of matches and the second set of matches comprises: determining the number of intersection features according to the intersection of the first matching set and the second matching set; and determining a third matching set according to the first matching set and the second matching set under the condition that the intersection characteristic quantity meets the preset set quantity threshold value. Preferably, a union of the first matching set and the second matching set may be regarded as the third matching set.

Further, the first matching set is determined according to the similarity between the gesture feature to be recognized and the gesture feature in the gesture feature library and a preset first matching parameter, and/or the second matching set is determined according to the similarity between the target gesture feature and the gesture feature in the gesture feature library and a preset second matching parameter. The first matching parameters comprise first quantity parameters, the second matching parameters comprise second quantity parameters, and the second quantity parameters are smaller than the first quantity parameters.

Further, the first matching set and/or the second matching set are/is calculated by a k-nearest neighbor algorithm. In the case that the first matching set and the second matching set each calculate the similarity by a k-nearest neighbor algorithm, the k value used for calculating the second matching set is smaller than the k value used for calculating the first matching set.

Further, matching the gesture feature to be identified with the gesture feature in the gesture feature library, and determining the first matching set includes: the gesture features to be recognized are initially matched with the gesture features in the gesture feature library, and are added to the gesture feature library under the condition that the initial matching is successful, so that an expanded gesture feature library is obtained; and matching the gesture features to be recognized with the gesture features in the expanded gesture feature library again to obtain a first matching set.

Further, the gesture feature library comprises multiple types of gesture features, and the multiple gesture features of the target type are screened under the condition that the number of the gesture features of the target type is larger than the preset storage number. Preferably, screening the plurality of gesture features of the target type includes: and screening the gesture features of the target type according to a preset similarity threshold and/or the acquisition time of each gesture feature in the gesture feature library.

Further, the method for acquiring the gesture features to be recognized comprises the following steps: acquiring appearance characteristics and skeleton point characteristics of gestures in an image to be detected; and jointly determining gesture characteristics to be recognized according to the appearance characteristics and the skeleton point characteristics. Among them, the skeletal point feature is preferably a 2.5D feature.

Further, the gesture recognition device also needs to acquire registration gesture features in the registration image; registering the registration gesture features according to the registration gesture features and gesture functions corresponding to the registration gesture features.

The application also provides a gesture recognition system, which comprises an image acquisition device and a gesture recognition device: the image acquisition device is used for acquiring gesture characteristics to be identified; the gesture recognition device is used for matching gesture features to be recognized with gesture features in the gesture feature library to determine a first matching set; determining target gesture features in the first matching set, matching the target gesture features with gesture features in the gesture feature library, and determining a second matching set; determining a third matching set according to the first matching set and the second matching set; and matching the gesture features to be recognized with the gesture features in the third matching set, and determining a gesture recognition result of the gesture features to be recognized. Through the gesture recognition system, the recall rate of the gesture recognition process is improved, missing recognition of the gesture features to be recognized is avoided, and the gesture recognition accuracy is improved based on the multiple matching process between the gesture features to be recognized and the gesture features in the gesture feature library.

Embodiments of the present application also provide a computer readable storage medium storing one or more programs executable by one or more processors to implement the gesture recognition method as in any of the previous embodiments.

Those of ordinary skill in the art will appreciate that all or some of the steps, systems, functional modules/units in the apparatus, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between the functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed cooperatively by several physical components. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.

Claims

1. A method of gesture recognition, the method comprising:

2. The gesture recognition method of claim 1, wherein determining a third set of matches from the first set of matches and the second set of matches comprises:

3. The gesture recognition method of claim 1 or 2, wherein determining a third set of matches from the first set of matches and the second set of matches comprises:

4. Gesture recognition method according to claim 1, characterized in that the first matching set is determined by a similarity between the gesture features to be recognized and gesture features in the gesture feature library and a preset first matching parameter, and/or the second matching set is determined by a similarity between the target gesture features and gesture features in the gesture feature library and a preset second matching parameter.

5. The gesture recognition method of claim 4, wherein the first matching parameter comprises a first number parameter and the second matching parameter comprises a second number parameter, the second number parameter being less than the first number parameter.

6. The gesture recognition method according to claim 4, wherein the similarity between the gesture feature to be recognized and the gesture feature in the gesture feature library is calculated by a k-nearest neighbor algorithm.

7. The gesture recognition method according to claim 6, wherein in a case where the first matching set and the second matching set each calculate a similarity by a k-nearest neighbor algorithm, a k value used for calculating the second matching set is smaller than a k value used for calculating the first matching set.

8. The gesture recognition method of claim 1, wherein matching gesture features to be recognized with gesture features in a gesture feature library, determining a first matching set comprises:

9. The gesture recognition method of claim 1, wherein the method further comprises:

10. The gesture recognition method of claim 9, wherein filtering the plurality of gesture features of the target type comprises:

11. The gesture recognition method according to claim 1, wherein the method for acquiring the gesture feature to be recognized comprises:

12. The method of claim 11, wherein the skeletal point feature is a 2.5D feature.

13. The gesture recognition method of claim 1, wherein the method further comprises:

acquiring registration gesture features in a registration image;

14. A gesture recognition apparatus, wherein the apparatus comprises a first determination module, a second determination module, a third determination module, and a recognition module:

15. A gesture recognition system, comprising an image acquisition device and a gesture recognition device:

16. A computer readable storage medium storing one or more programs executable by one or more processors to implement the gesture recognition method of any one of claims 1 to 13.