CN113223536B

CN113223536B - Voiceprint recognition method and device and terminal equipment

Info

Publication number: CN113223536B
Application number: CN202010062402.XA
Authority: CN
Inventors: 唐延欢
Original assignee: TCL Technology Group Co Ltd
Current assignee: TCL Technology Group Co Ltd
Priority date: 2020-01-19
Filing date: 2020-01-19
Publication date: 2024-04-19
Anticipated expiration: 2040-01-19
Also published as: CN113223536A

Abstract

The application is applicable to the technical field of voice processing, and provides a voiceprint recognition method, a voiceprint recognition device and terminal equipment, wherein the voiceprint recognition method comprises the following steps: acquiring an audio feature vector of the audio to be identified; inputting the audio feature vector into a target neural network to obtain a target voiceprint feature vector corresponding to the audio feature vector, wherein the target neural network consists of SENet modules, a cavity convolution network and a full-connection layer, and the cavity convolution network comprises a plurality of cavity convolution layers for extracting context information of the audio feature vector in a time dimension; and comparing the target voiceprint feature vector with registered voiceprint feature vectors to determine a target user corresponding to the audio to be identified. The embodiment of the application can improve the accuracy of voiceprint recognition and ensure the recognition efficiency.

Description

Voiceprint recognition method and device and terminal equipment

Technical Field

The application belongs to the technical field of voice processing, and particularly relates to a voiceprint recognition method, a voiceprint recognition device and terminal equipment.

Background

Voiceprint recognition (Voiceprint Recognition, VPR), also known as speaker recognition (Speaker Recognition), has long received widespread attention in the academia and industry as one of the biometric technologies. The traditional voiceprint recognition technology takes i-vectors as the classical, but the accuracy is poor. For this reason, google proposes a GE2E (Generalized end-to-end) network structure, which has higher recognition accuracy than i-vectors in voiceprint recognition. However, due to the complicated neural network structure of the GE2E, the model occupies too large space, has slow recognition speed and is not beneficial to the application of the actual production environment.

Disclosure of Invention

In view of the above, the embodiments of the present application provide a voiceprint recognition method, a voiceprint recognition device, and a terminal device, so as to solve the problem in the prior art how to improve the accuracy of voiceprint recognition and ensure the recognition efficiency.

A first aspect of an embodiment of the present application provides a voiceprint recognition method, including:

Acquiring an audio feature vector of audio to be identified, wherein the audio feature vector comprises a time dimension and a frequency spectrum feature dimension, and one unit time in the time dimension corresponds to one group of frequency spectrum feature information in the frequency spectrum feature dimension;

Inputting the audio feature vector into a target neural network to obtain a target voiceprint feature vector corresponding to the audio feature vector, wherein the target neural network consists of SENet modules, a cavity convolution network and a full-connection layer, and the cavity convolution network comprises a plurality of cavity convolution layers for extracting context information of the audio feature vector in a time dimension;

And comparing the target voiceprint feature vector with registered voiceprint feature vectors to determine a target user corresponding to the audio to be identified.

A second aspect of an embodiment of the present application provides a voiceprint recognition apparatus, including:

an audio feature vector obtaining unit, configured to obtain an audio feature vector of an audio to be identified, where the audio feature vector includes a time dimension and a spectrum feature dimension, and a unit time in the time dimension corresponds to a set of spectrum feature information in the spectrum feature dimension;

The target neural network unit is used for inputting the audio feature vector into a target neural network to obtain a target voiceprint feature vector corresponding to the audio feature vector, wherein the target neural network consists of a SENet module, a hole convolution network and a full-connection layer, and the hole convolution network comprises a plurality of hole convolution layers for extracting context information of the audio feature vector in a time dimension;

and the determining unit is used for comparing the target voiceprint feature vector with registered voiceprint feature vectors and determining a target user corresponding to the audio to be identified.

A third aspect of the embodiments of the present application provides a terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the voiceprint recognition method when executing the computer program.

A fourth aspect of the embodiments of the present application provides a computer readable storage medium storing a computer program which, when executed by a processor, implements steps of a voiceprint recognition method as described.

A fifth aspect of an embodiment of the present application provides a computer program product for causing a terminal device to perform the above-described voiceprint recognition method when the computer program product is run on the terminal device.

Compared with the prior art, the embodiment of the application has the beneficial effects that: in the embodiment of the application, the audio feature vector of the audio to be identified is subjected to feature extraction through a target neural network consisting of SENet modules, a cavity convolution network and a full-connection layer to obtain a target voiceprint feature vector, and compared with registered voiceprint feature vectors, a target user corresponding to the audio to be identified is determined, and as SENet modules can strengthen feature information extraction among channels and extract context information in the time dimension of the audio to be identified in the cavity convolution network, the feature information contained in the finally extracted target voiceprint feature vector can be more accurate and comprehensive, so that voiceprint identification is more accurate; meanwhile, the target neural network has a simple structure relative to the GE2E network, so that the complexity in extracting the voiceprint feature information can be reduced, the voiceprint feature information extraction efficiency is improved, and the voiceprint recognition efficiency is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments or the description of the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic implementation flow diagram of a first voiceprint recognition method according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a target neural network according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an implementation flow of a second voiceprint recognition method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a voiceprint recognition device according to an embodiment of the present application;

fig. 5 is a schematic diagram of a terminal device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

In order to illustrate the technical scheme of the application, the following description is made by specific examples.

It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in this specification and the appended claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

In addition, in the description of the present application, the terms "first," "second," "third," etc. are used merely to distinguish between descriptions and should not be construed as indicating or implying relative importance.

Embodiment one:

fig. 1 shows a flowchart of a first voiceprint recognition method according to an embodiment of the present application, which is described in detail below:

In S101, an audio feature vector of the audio to be identified is obtained, where the audio feature vector includes a time dimension and a spectral feature dimension, and one unit time in the time dimension corresponds to a set of spectral feature information in the spectral feature dimension.

The voiceprint recognition method in the embodiment of the application is particularly a text-independent voiceprint recognition method, namely, a user does not need to pronounce according to specified content, and the audio to be recognized in the embodiment of the application is the audio of any speaking content sent by the user. The audio to be identified is obtained through a sound collection device or a storage unit for storing the audio to be identified, and an audio feature vector of the audio to be identified is extracted through transformation and analysis of a time domain and a frequency domain, and the audio feature vector can be stored in npy file format. The audio feature vector includes a time dimension and a frequency spectrum feature dimension, where the audio feature vector may be represented by a x b, where a is a multiplier, a is a length of the audio feature vector in the time dimension, b is a length of the audio feature vector in the frequency spectrum feature dimension, one unit time in the time dimension corresponds to a set of frequency spectrum feature information in the frequency spectrum feature dimension, that is, the audio feature vector includes a number of audio feature information in unit time, and each unit time of audio feature information may be represented by a set of frequency spectrum feature information with a length of b.

Specifically, the step S101 specifically includes:

acquiring audio to be identified, and filtering silence of the audio to be identified to obtain an effective audio segment; intercepting the effective audio segment according to the target duration to obtain target audio;

And extracting the mel-frequency cepstrum coefficient (MFCC) characteristic of the target audio to obtain an audio characteristic vector.

Obtaining audio to be identified, filtering silence of the audio to be identified to obtain an effective audio segment of the audio to be identified, and intercepting the effective audio segment of the audio to be identified according to a target duration (for example, 3 seconds) to obtain a target audio. Optionally, if the duration of the valid audio segment of the audio to be identified is less than the valid duration threshold (for example, 1 second), the audio to be identified is discarded and the audio to be identified is reacquired. Optionally, if the duration of the effective audio segment of the audio to be identified is greater than or equal to the effective duration threshold and is less than or equal to the target duration, the audio to be identified is not intercepted, and the effective audio segment is directly taken as the target audio, so that the duration of the target audio in the embodiment of the application is greater than or equal to the effective duration threshold and is less than or equal to the target duration. Specifically, the target duration or the effective duration threshold is determined according to the identification accuracy of the target neural network. Optionally, the audio to be identified in the embodiment of the present application is specifically short audio (for example, audio with a duration of 1-3 seconds), and the target neural network in the embodiment of the present application can accurately perform voiceprint identification only according to the audio feature information of less short audio, that is, because the target neural network in the embodiment of the present application has high identification accuracy and less required input information, the duration of the acquired audio to be identified can be shortened, and the data volume processed by the target neural network can be reduced, so that the efficiency of voiceprint identification can be improved.

And carrying out Mel cepstrum coefficient (Mel-scale Frequency Cepstral Coefficients, MFCC) feature extraction on the target audio according to preset parameters such as sampling rate, frame length, first step length and the like to obtain an audio feature vector containing a time dimension and a frequency spectrum feature dimension, wherein the size of the audio feature vector is a specified size, namely the length in the time dimension and the length in the frequency spectrum feature dimension are the target lengths obtained according to parameter setting during the MFCC feature extraction. By way of example and not limitation, the predetermined sampling rate is 1.6k, the frame length is 25 ms, the first step is 32 ms, and the size of the audio feature vector is 96 x 64. Optionally, if the duration of the target audio is less than 3 seconds, the size of the first audio feature vector obtained after the MFCC feature extraction is less than the specified size, and at this time, a portion of the first audio feature vector, which is less than the specified size, is filled with data by "0", and the second audio feature vector with the specified size is obtained after the filling as the audio feature vector of the final input target neural network. For example, assuming that the duration of the target audio is 2 seconds, the size of the first audio feature vector obtained after the MFCC feature extraction of the target audio is 63×64, which is smaller than the predetermined size "96×64", so that the position of the size "33×64" different from the predetermined size in the first audio feature vector is data-filled with "0", and the second audio feature vector with the size 96×64 is obtained as the final audio feature vector by filling.

In S102, the audio feature vector is input into a target neural network to obtain a target voiceprint feature vector corresponding to the audio feature vector, where the target neural network is composed of SENet modules, a hole convolution network and a full connection layer, and the hole convolution network includes a plurality of hole convolution layers for extracting context information of the audio feature vector in a time dimension.

And inputting the audio feature vector with the specified size into a target neural network through a single channel (namely, the number of input channels is 1), and further extracting the features of the audio feature vector to obtain a target voiceprint feature vector corresponding to the audio feature vector. The target neural network consists of a squeeze-and-Excitation Networks (SENet for short) module, a cavity convolution network and a full-connection layer, wherein the SENet module is used for modeling the correlation among all channels in the target neural network, SENet determines the weight parameter of each channel according to sample data during training, so that the characteristic information of each channel can be accurately extracted according to the weight through the trained SENet module, the accuracy of voiceprint recognition is improved, and the cavity convolution network is used for extracting the context information of the audio characteristic vector in the time dimension. In the embodiment of the application, the context information is specifically feature information fused with spectrum feature information corresponding to a plurality of different unit times in the audio feature vector. Specifically, the hole convolution network comprises a plurality of hole convolution layers, wherein each hole convolution layer comprises a convolution kernel with a size of n 1, n is a positive integer greater than 1, and "×" is a multiplication number. And (3) analyzing the context information of the audio feature vector in n time dimensions by the convolution kernel of n x 1, namely strengthening the context relation in the audio feature extraction process, so that the accuracy of voiceprint recognition is improved. Moreover, the adoption of the cavity convolution network can lead the convolution kernel field of view to be wider under the condition of not losing the accuracy of data.

Specifically, the target neural network specifically comprises a first convolution layer, SENet modules, a first reconstruction layer, a first full-connection layer, a second reconstruction layer, a cavity convolution network, a third reconstruction layer, an average pooling layer and a second full-connection layer, and the step S102 specifically comprises:

S10201: inputting the audio feature vector into a target neural network, and obtaining a first feature vector through the first convolution layer, wherein the first feature vector comprises a time dimension, a frequency spectrum feature dimension and a channel dimension;

s10202: the first eigenvector weights the information of each channel through the SENet module to obtain a second eigenvector;

S10203: the second feature vector sequentially passes through the first reconstruction layer, the first full-connection layer and the second reconstruction layer to obtain a third feature vector;

s10204: the third feature vector sequentially passes through a plurality of layers of cavity convolution layers through the cavity convolution network, and extracts context information of the third feature vector in different time dimensions to obtain a fourth feature vector, wherein each layer of cavity convolution layer comprises a convolution kernel with the size of n 1, n is a positive integer greater than 1, and 'x' is a multiplier;

s10205: and the fourth feature vector sequentially passes through the third reconstruction layer, the average pooling layer and the second full connection layer to obtain a target voiceprint feature vector with a target size.

As shown in fig. 2, the target neural network of the embodiment of the present application is composed of a first convolution layer Convl-Relu, a SENet module, a first reconstruction layer Reshape1, a first full-connection layer Fc1, a second reconstruction layer Reshape, a hollow convolution network Dilated-Conv-Net, a third reconstruction layer Reshape3, an average pooling layer Avg-pool, and a second full-connection layer Fc 2.

In S10201, the audio feature vector is input to the target neural network (i.e. the input channel number is 1) through a single channel, and the first feature vector is obtained through the first convolution layers Convl-Relu. The first convolution layer comprises a first number of channels, the convolution kernel of each channel is 3*3, the step length is a second step length, and correspondingly, the first feature vector output through the first convolution layer comprises a channel dimension in addition to a time dimension and a frequency spectrum feature dimension, and the length of the first feature vector in the channel dimension is equal to the first number. For example, the audio feature vector has a size of 96×64, and the number of input channels is 1, which can be regarded as input data with input of 96×64×1; the first convolution layer has 32 channels, and the convolution kernel size of each channel is 3*3, and the step length is 2, and then the first feature vector with the size of 48 x 32 is obtained after the first convolution layer, wherein '48' is the length in the time dimension, the first '32' is the length in the frequency spectrum feature dimension, and the second '32' is the length in the channel dimension.

In S10202, the first eigenvector output by the first convolution layer is passed through SENet module, and the information of each channel of the first eigenvector is weighted according to the weight parameters of each channel of SENet, so as to obtain a second eigenvector. The dimensions and size of the second feature vector are identical to the first feature vector.

In S10203, the second eigenvector obtained after the channel weighting is passed through the first reconstruction layer Reshape, the first full connection layer Fc1, and the second reconstruction layer Reshape2, to obtain a third eigenvector with a length of 1 in the channel dimension. For example, the second eigenvector has a size of 48×32×32, and the eigenvector having a size of 48×32=48×1024 is obtained after passing through the first reconstruction layer Reshape; then mapping the feature vector with the size of 48-256 through a first full connection layer Fc1 consisting of a full connection layer and an activation function tanh; and expanding the channel dimension through the second reconstruction layer Reshape to obtain a third eigenvector with a size of 48×256×1. And carrying out style adjustment on the data of the second feature vector through the first reconstruction layer, the first full-connection layer and the second reconstruction layer to obtain a third feature vector with a single channel (the length of the channel dimension is 1) so as to adapt to the format requirement of the cavity convolution network on the input data.

S10204: the third feature vector sequentially passes through a plurality of layers of cavity convolution layers through the cavity convolution network, and the context information of the third feature vector in the time dimension is extracted to obtain a fourth feature vector;

And extracting context information in the time dimension of the third feature vector through the processing of a plurality of layers of cavity convolution layers containing n 1 convolution kernels in the cavity convolution network, namely, carrying out association processing on the spectrum feature information corresponding to each unit time and the spectrum feature information corresponding to the adjacent unit time to obtain a fourth feature vector containing the context information in the time dimension, wherein the size of the fourth feature vector is consistent with that of the third feature vector. Specifically, the hole convolution network comprises a plurality of hole convolution layers, wherein each hole convolution layer comprises a convolution kernel with a size of n 1, n is a positive integer greater than 1, and "×" is a multiplication number. And (3) analyzing the context information of the audio feature vector in n time dimensions by the convolution kernel of n x1, namely strengthening the context relation in the audio feature extraction process, so that the accuracy of voiceprint recognition is improved. Illustratively, the cavity convolution network is composed of five layers of cavity convolution layers of Dilated-Conv1, dilated-Conv2, dilated-Conv3, dialted-Conv4 and Dilated-Conv5, wherein the five layers of cavity convolution layers are all single channels, the step sizes are all 1, the convolution kernel sizes are 5*1, 9*1, 15 x1, 24 x1 and 24 x1 in sequence, and the corresponding cavity convolution ratios are 1,2,3, 1 and 1; assuming that the size of the third eigenvector is 48×256×1, the size of the fourth eigenvector processed by the hole convolution network is also 48×256×1.

The fourth feature vector is subjected to data pattern adjustment sequentially through the third reconstruction layer Reshape, and the average pooling layer Avg-pool is used for averaging the spectrum feature information of different unit time in the time dimension to obtain the feature vector with the length of 1 in the time dimension (namely, the single time dimension can be omitted, and the representation in the time dimension can be omitted); and then obtaining a target voiceprint feature vector with a target size through a second full connection layer consisting of the full connection layer and the tanh activation function. Illustratively, a fourth eigenvector with a size of 48×256×1 obtains eigenvectors with a size of 48×256 through a third reconstruction layer, and adds and averages spectral feature information of 48 different unit times through an averaging pooling layer to obtain eigenvectors with a size of 256 (i.e., eigenvectors with a length of 256 in a spectral feature dimension, normalized in both a time dimension and a channel dimension); then, through the second full connection layer, the feature vector with the size of 256 is mapped to a target voiceprint feature vector with the target size of 512, namely, the target voiceprint feature vector contains 512 units of feature information.

In S103, the target voiceprint feature vector is compared with registered voiceprint feature vectors, and a target user corresponding to the audio to be identified is determined.

In the embodiment of the application, registered voiceprint feature vectors and corresponding user identification information are prestored, wherein the identification information can be information such as the name and the number of the user. And (3) comparing the target voiceprint feature vector obtained in the step (S102) with the registered voiceprint feature vector, finding out the registered voiceprint feature vector with the highest similarity with the target voiceprint feature vector, determining that the user corresponding to the registered voiceprint vector is the target user corresponding to the audio to be identified, and outputting the identification information of the target user. Specifically, the registered voiceprint feature vector with the highest similarity with the target voiceprint feature vector is found out by solving the cosine similarity between the target voiceprint feature vector and each prestored registered voiceprint feature vector.

Optionally, before step S101, the method includes:

receiving a registration instruction, and acquiring identification information of a user to be registered and corresponding audio to be registered;

obtaining a voiceprint feature vector of the audio to be registered through a target neural network;

and storing the voiceprint feature vector of the audio to be registered and the corresponding user identification information into a target database to obtain the registered voiceprint feature vector and the corresponding user identification information.

Preferably, during registration, a plurality of pieces of audio to be registered of the same user to be registered are acquired, the plurality of pieces of audio to be registered pass through a target neural network simultaneously or successively, a plurality of voiceprint feature vectors of the same user to be registered are correspondingly acquired, and the average value of the plurality of voiceprint feature vectors is calculated to serve as the final voiceprint feature vector of the user to be registered for registration, so that the accuracy of registration data is further improved, and the accuracy of voiceprint recognition after the voice print feature vector is improved.

Optionally, after the step S103, the method further includes:

and if the registered voiceprint feature vector matched with the target voiceprint feature vector is not found, indicating the current user to register the target voiceprint feature vector.

If the registered voiceprint feature vector matched with the target voiceprint feature vector is not found, the user information corresponding to the current audio to be identified is not registered, so that the user is instructed to input the identification information of the current user, the identification information of the user and the target voiceprint feature vector are correspondingly stored in a target database, and the registration of the target voiceprint feature vector is completed.

In the embodiment of the application, the audio feature vector of the audio to be identified is subjected to feature extraction through a target neural network consisting of SENet modules, a cavity convolution network and a full-connection layer to obtain a target voiceprint feature vector, and compared with registered voiceprint feature vectors, a target user corresponding to the audio to be identified is determined, and as SENet modules can strengthen feature information extraction among channels and extract context information in the time dimension of the audio to be identified in the cavity convolution network, the feature information contained in the finally extracted target voiceprint feature vector can be more accurate and comprehensive, so that voiceprint identification is more accurate; meanwhile, the target neural network has a simple structure relative to the GE2E network, so that the complexity in extracting the voiceprint feature information can be reduced, the voiceprint feature information extraction efficiency is improved, and the voiceprint recognition efficiency is improved.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present application.

Embodiment two:

Fig. 3 is a schematic flow chart of a second voiceprint recognition method according to an embodiment of the present application, which is described in detail below:

in S301, sample data is acquired, wherein the sample data is audio data from different users.

And obtaining the audio feature vectors of different users as sample data by acquiring the audio data of different users and preprocessing the audio data. Or the training sample data is obtained by reading the audio feature vectors of the audio data of different users prestored in npy file form. Specifically, in the sample data, there are two or more audio feature vectors for each user.

In S302, inputting the sample data into the target neural network for training until the intra-class audio similarity and the inter-class audio similarity meet a preset condition, thereby obtaining a trained target neural network; the audio similarity in the class is the similarity between the voiceprint feature vectors corresponding to different audio data belonging to the same user, and the audio similarity in the class is the similarity between the voiceprint feature vectors corresponding to different audio data belonging to different users.

And inputting the sample data into a target neural network for training, and adjusting the learning parameters of each network layer until the intra-class audio similarity and the inter-class audio similarity meet preset conditions in the voiceprint feature vector obtained according to the sample data, so that the intra-class audio similarity is as large as possible, and the inter-class audio similarity is as small as possible. The audio similarity in the class refers to the similarity between voiceprint feature vectors belonging to the same user, and the audio similarity between the classes refers to the similarity between voiceprint feature vectors belonging to different users. Specifically, the similarity between voiceprint feature vectors can be represented by cosine similarity. Specifically, the preset condition may be that the intra-class audio similarity is greater than a first preset threshold and the inter-class audio similarity is less than a second preset threshold, or the preset condition may be that: the difference between the audio similarity in the class and the audio similarity between the classes is larger than a preset difference, so that the audio similarity in the class is as large as possible, and the audio similarity between the classes is as small as possible.

Optionally, the step S103 includes:

inputting preset sample data into the target neural network in sequence for training until the value of the target function meets the preset condition, so as to obtain the trained target neural network, wherein the target function of the target neural network is as follows:

Sc is the value of the objective function and represents the difference between the audio similarity in the class and the audio similarity between the classes; n is the number of users corresponding to the sample data input by the current batch, M is the number of sample data corresponding to each user, v _i represents the voiceprint feature vector obtained by the target neural network model of any sample data of the current batch, P represents the user corresponding to v _i, v _j is the voiceprint feature vector belonging to the same user as v _i, v _k is the voiceprint feature vector not belonging to the same user as v _i, sim { (v _i,v_j)|i≠j,v_i∈P,v_j epsilon P } represents the cosine similarity of v _i and the voiceprint feature vector v _j belonging to P, And the cosine similarity of vi and the voiceprint feature vector v _k of other users is represented.

Sample data of the number of samples of the preset batch are obtained from the data set each time and input into the target neural network for training, wherein the sample data of each batch come from audio of the number of corresponding preset sentences of users of the preset number of people. For example, the preset batch sample number is set to be 64, that is, 64 sample data are input as one batch at a time to train the target neural network, the 64 sample data are derived from 16 users, and each user corresponds to 4 sentences of audio, that is, each user corresponds to 4 sample data.

The objective function of the objective neural network isThe value Sc obtained by the objective function represents the difference between the audio similarity in the class and the audio similarity between the classes; n is the number of users corresponding to the sample data input by the current batch, M is the number of sample data corresponding to each user, NM represents N multiplied by M, and the number of samples of the preset batch is equal. v _i represents a voiceprint feature vector obtained by passing any sample data of the current batch through a target neural network model, P represents a user corresponding to v _i, v _j is a voiceprint feature vector belonging to the same user as v _i, v _k is a voiceprint feature vector not belonging to the same user as v _i, sim { (vi, v _j)|i≠j,v_i∈P,v_j ∈P } represents cosine similarity of v _i and a voiceprint feature vector v _j belonging to P,/>Representing the cosine similarity of v _i to the voiceprint feature vector v _k of the other user. And when in training, taking a negative value of Sc, training by adopting a gradient descent method until the descent gradient of the (-Sc) value is smaller than a preset value and the accuracy of the target neural network is higher than an accuracy threshold value, and obtaining the trained target neural network. That is, the preset condition of the embodiment of the present application may be that the decreasing gradient of the "(-Sc) value is smaller than the preset value and the accuracy of the target neural network is higher than the accuracy threshold", where the intra-class audio similarity is large and the inter-class audio similarity is small in the audio processed by the target neural network. Preferably, when (-Sc) obtains the minimum value, i.e. Sc is the maximum value, the cosine similarity between the voiceprint feature vectors of the sample data in the class (i.e. the sample data belonging to the same user) is as large as possible, and the cosine similarity between the voiceprint feature vectors of the sample data in the class (i.e. the sample data respectively belonging to different users) is as small as possible, so that the recognition accuracy of the corresponding target neural network is the highest.

Optionally, the learning rate of the target neural network during training is dynamically adjusted according to a preset target learning rate and the current training step number.

Specifically, the learning rate of the target neural network during training is dynamically adjusted in a mode of combining the norm up and the learning rate attenuation according to the preset target learning rate and the current training step number. Specifically, the learning rate lr at training is dynamically adjusted by the following learning rate formula:

lr＝flr×10^0.5×min(step×10^-1.5,step^-0.5)

wherein flr is a preset target learning rate, and step is the current training step number.

According to the learning rate formula, the learning rate is gradually preheated in the initial stage of training, and is increased to a preset target learning rate, so that the training convergence speed is increased; in the later stage of training after the learning rate reaches the target learning rate, the learning rate gradually decays, so that the target neural network can be converged more accurately. By the dynamic adjustment mode, the training speed and accuracy of the target neural network can be improved.

Optionally, the voiceprint recognition method is specifically applied to a far-field recording scene, and the sample data includes far-field recording data carrying background noise and a preset amount of noiseless audio data.

The voiceprint recognition method provided by the embodiment of the application is particularly applied to far-field recording scenes, such as far-field recording scenes of intelligent televisions. The audio in the far-field recording scene contains a certain background noise, and accordingly, the sample data of the training target neural network also includes far-field recording data containing the background noise. In addition, because far-field recording data may be too noisy, so that the target neural network is difficult to converge, the sample data in the embodiment of the application includes a preset amount of noiseless audio data in addition to the far-field recording data containing background noise. The far-field recording data containing background noise and the noise-free audio data with preset quantity are combined to serve as sample data to train the target neural network, so that the convergence rate of the target neural network can be improved while the trained target neural network can be ensured to accurately attach to voiceprint recognition in a far-field recording scene. Illustratively, the sample data set included in embodiments of the present application includes 16074 far-field recordings from 5512 users (each far-field recording data stored in npy files) and 255763 noiseless audio data from 2500 users (each noiseless audio data stored in npy files).

In S303, an audio feature vector of the audio to be identified is obtained, where the audio feature vector includes a time dimension and a frequency spectrum feature dimension, and one unit time in the time dimension corresponds to a set of frequency spectrum feature information in the frequency spectrum feature dimension.

In S304, the audio feature vector is input into a target neural network to obtain a target voiceprint feature vector corresponding to the audio feature vector, where the target neural network is composed of SENet modules, a hole convolution network and a full connection layer, and the hole convolution network includes a plurality of hole convolution layers for extracting context information of the audio feature vector in different time dimensions.

In S305, the target voiceprint feature vector is compared with registered voiceprint feature vectors, and a target user corresponding to the audio to be identified is determined.

S303 to S305 of the embodiment of the present application are the same as S101 to S103 of the previous embodiment, respectively, and refer to the related descriptions of S101 to S103 of the previous embodiment, which are not repeated here.

According to the embodiment of the application, the target neural network is trained until the intra-class audio similarity and the inter-class audio similarity meet the preset conditions, so that the finally trained target neural network can enable cosine similarity among voiceprint feature vectors of intra-class sample data (namely sample data belonging to the same user) to be as large as possible, and cosine similarity among voiceprint feature vectors of inter-class sample data (namely sample data respectively belonging to different users) is as small as possible, and identification accuracy of the target neural network is improved, so that accuracy of a voiceprint identification method is improved.

By way of example, and not limitation, the following provides a test verification process and results of the voiceprint recognition method of an embodiment of the present application:

accuracy test

A1: voice data from 6 users and 2 sentences of each person except the sample data set are obtained, preprocessing and MFCC feature extraction are carried out, 12 audio feature vectors are obtained, and each audio feature vector carries corresponding user identification information;

A2: inputting all the 12 audio feature vectors in the step A1 into a target neural network for feature extraction to obtain corresponding 12 voiceprint feature vectors;

a3: sequentially taking one voiceprint feature vector of the 12 voiceprint feature vectors, calculating the similarity between other voiceprint feature vectors and the current voiceprint feature vector, and judging that the model identification is correct if the voiceprint feature vector with the highest similarity and the voiceprint feature vector belong to the same user, otherwise, identifying the model incorrectly; repeating until 12 voiceprint feature vectors are traversed;

a4: and D, counting the identification result in the step A3 to obtain the final accuracy.

The verification proves that the accuracy of the voiceprint recognition method is higher than that of the voiceprint recognition method through GE 2E. For example, in the primary test result, the accuracy of the voiceprint recognition method using the GE2E network is 0.704, and the accuracy of the voiceprint recognition using the target neural network according to the embodiment of the present application is 0.805.

(II) calculation speed test

B1: sample data of 2 sentences of voice of each user from 6 users, namely 6 multiplied by 2=12 sample data (12 npy files can be used) are obtained from a data set and input into a target neural network as a batch to be tested, and the time consumed by the operation of the target neural network is recorded;

B2: and (3) repeating the step B1 for 100 times to obtain 100 time-consuming data, removing the maximum value and the minimum value of the 100 time-consuming data to obtain the remaining 98 time-consuming data, averaging the 98 time-consuming data to obtain final time consumption, and comparing the final time consumption with the operation time consumption of the GE 2E-based voiceprint recognition method.

By comparison, the final time consumption of the voice print recognition method through the target neural network is lower than the operation time consumption of the voice print recognition method based on GE 2E. For example, in one test result, the operation time of the voiceprint recognition method using the GE2E network is 0.656 seconds, and the operation time of the voiceprint recognition by using the target neural network according to the embodiment of the present application is 0.040 seconds.

Embodiment III:

fig. 4 is a schematic structural diagram of a voiceprint recognition device according to an embodiment of the present application, and for convenience of explanation, only a portion related to the embodiment of the present application is shown:

the voiceprint recognition device includes: an audio feature vector acquisition unit 41, a target neural network unit 42, a determination unit 43. Wherein:

an audio feature vector obtaining unit 41, configured to obtain an audio feature vector of an audio to be identified, where the audio feature vector includes a time dimension and a spectral feature dimension, and one unit time in the time dimension corresponds to a set of spectral feature information in the spectral feature dimension.

Optionally, the audio feature vector obtaining unit 41 includes an audio obtaining module to be identified and an MFCC feature extracting module:

The audio to be identified acquisition module is used for acquiring the audio to be identified, filtering the silence of the audio to be identified and obtaining an effective audio segment; intercepting the effective audio segment according to the target duration to obtain target audio;

And the MFCC feature extraction module is used for extracting the mel cepstrum coefficient (MFCC) features of the target audio to obtain an audio feature vector.

The target neural network unit 42 is configured to input the audio feature vector into a target neural network to obtain a target voiceprint feature vector corresponding to the audio feature vector, where the target neural network is composed of SENet modules, a hole convolution network and a full connection layer, and the hole convolution network includes a plurality of hole convolution layers for extracting context information of the audio feature vector in a time dimension.

Optionally, the target neural network unit includes a training module for acquiring sample data, wherein the sample data is from audio data of different users; inputting the sample data into the target neural network for training until the intra-class audio similarity and the inter-class audio similarity meet preset conditions, and obtaining a trained target neural network; the audio similarity in the class is the similarity between voiceprint feature vectors belonging to the same user, and the audio similarity in the class is the similarity between voiceprint feature vectors belonging to different users.

Optionally, the training module is specifically configured to sequentially input preset sample data into the target neural network to perform training until a value of the target function meets a preset condition, so as to obtain a trained target neural network, where the target function of the target neural network is:

Sc is the value of the objective function and represents the difference between the audio similarity in the class and the audio similarity between the classes; n is the number of users corresponding to the sample data input by the current batch, M is the number of sample data corresponding to each user, v _i represents the voiceprint feature vector obtained by the target neural network model of any sample data of the current batch, P represents the user corresponding to v _i, v _j is the voiceprint feature vector belonging to the same user as v _i, v _k is the voiceprint feature vector not belonging to the same user as v _i, sim { (v _i,v_j)|i≠j,v_i∈P,v_j epsilon P } represents the cosine similarity of v _i and the voiceprint feature vector v _j belonging to P, Representing the cosine similarity of v _i to the voiceprint feature vector v _k of the other user.

Optionally, the training module includes a learning rate adjustment module, configured to dynamically adjust a learning rate of the target neural network during training according to a preset target learning rate and a current training step number.

Optionally, the voiceprint recognition device is applied to a far-field recording scene, and the sample data comprises far-field recording data carrying background noise and a preset amount of noiseless audio data.

Optionally, the target neural network specifically includes a first convolution layer, SENet modules, a first reconstruction layer, a first fully-connected layer, a second reconstruction layer, a hole convolution network, a third reconstruction layer, an averaging pooling layer, and a second fully-connected layer, and the target neural network unit 42 is specifically configured to:

Inputting the audio feature vector into a target neural network, and obtaining a first feature vector through the first convolution layer, wherein the first feature vector comprises a time dimension, a frequency spectrum feature dimension and a channel dimension;

the first eigenvector weights the information of each channel through the SENet module to obtain a second eigenvector;

the second feature vector sequentially passes through the first reconstruction layer, the first full-connection layer and the second reconstruction layer to obtain a third feature vector;

The third feature vector sequentially passes through a plurality of layers of cavity convolution layers through the cavity convolution network, and the context information of the third feature vector in the time dimension is extracted to obtain a fourth feature vector, wherein each layer of cavity convolution layer comprises a convolution kernel with the size of n1, n is a positive integer greater than 1, and 'x' is a multiplier;

And the fourth feature vector sequentially passes through the third reconstruction layer, the average pooling layer and the second full connection layer to obtain a target voiceprint feature vector with a target size.

A determining unit 43, configured to compare the target voiceprint feature vector with registered voiceprint feature vectors, and determine a target user corresponding to the audio to be identified.

Optionally, the determining unit 43 further includes:

and the indicating module is used for indicating the user to register the target voiceprint feature vector if the registered voiceprint feature vector matched with the target voiceprint feature vector is not found.

It should be noted that, because the content of information interaction and execution process between the above devices/units is based on the same concept as the method embodiment of the present application, specific functions and technical effects thereof may be referred to in the method embodiment section, and will not be described herein.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

Embodiment four:

Fig. 5 is a schematic diagram of a terminal device according to an embodiment of the present application. As shown in fig. 5, the terminal device 5 of this embodiment includes: a processor 50, a memory 51 and a computer program 52, such as a voiceprint recognition program, stored in the memory 51 and executable on the processor 50. The processor 50, when executing the computer program 52, implements the steps of the respective voiceprint recognition method embodiments described above, such as steps S101 to S103 shown in fig. 1. Or the processor 50, when executing the computer program 52, performs the functions of the modules/units of the device embodiments described above, e.g. the functions of the units 41 to 43 shown in fig. 4.

By way of example, the computer program 52 may be partitioned into one or more modules/units that are stored in the memory 51 and executed by the processor 50 to complete the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions for describing the execution of the computer program 52 in the terminal device 5. For example, the computer program 52 may be divided into an audio feature vector acquisition unit, a target neural network unit, and a determination unit, each unit specifically functioning as follows:

The audio feature vector acquisition unit is used for acquiring an audio feature vector of the audio to be identified, wherein the audio feature vector comprises a time dimension and a frequency spectrum feature dimension, and one unit time in the time dimension corresponds to one group of frequency spectrum feature information in the frequency spectrum feature dimension.

The target neural network unit is used for inputting the audio feature vector into the target neural network to obtain a target voiceprint feature vector corresponding to the audio feature vector, wherein the target neural network comprises a SENet module, a hole convolution network and a full-connection layer, and the hole convolution network comprises a plurality of layers of hole convolution layers for extracting context information of the audio feature vector in a time dimension.

The terminal device 5 may be a computing device such as a desktop computer, a notebook computer, a palm computer, a cloud server, etc. The terminal device may include, but is not limited to, a processor 50, a memory 51. It will be appreciated by those skilled in the art that fig. 5 is merely an example of the terminal device 5 and does not constitute a limitation of the terminal device 5, and may include more or less components than illustrated, or may combine certain components, or different components, e.g., the terminal device may further include an input-output device, a network access device, a bus, etc.

The processor 50 may be a central processing unit (Central Processing Unit, CPU), other general purpose processor, digital signal processor (DIGITAL SIGNAL processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-programmable gate array (field-programmable GATE ARRAY, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 51 may be an internal storage unit of the terminal device 5, such as a hard disk or a memory of the terminal device 5. The memory 51 may also be an external storage device of the terminal device 5, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the terminal device 5. Further, the memory 51 may also include both an internal storage unit and an external storage device of the terminal device 5. The memory 51 is used for storing the computer program as well as other programs and data required by the terminal device. The memory 51 may also be used to temporarily store data that has been output or is to be output.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other manners. For example, the apparatus/terminal device embodiments described above are merely illustrative, e.g., the division of the modules or units is merely a logical function division, and there may be additional divisions in actual implementation, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM), a random access memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. A method of voiceprint recognition comprising:

Comparing the target voiceprint feature vector with registered voiceprint feature vectors to determine a target user corresponding to the audio to be identified;

The target neural network specifically comprises a first convolution layer, SENet modules, a first reconstruction layer, a first full-connection layer, a second reconstruction layer, a cavity convolution network, a third reconstruction layer, an average pooling layer and a second full-connection layer, the audio feature vector is input into the target neural network to obtain a target voiceprint feature vector corresponding to the audio feature vector, and the target neural network comprises:

2. The voiceprint recognition method of claim 1, wherein the acquiring an audio feature vector of the audio to be recognized comprises:

3. The voiceprint recognition method of claim 1, further comprising, prior to said obtaining an audio feature vector for audio to be recognized:

obtaining sample data, wherein the sample data is from audio data of different users;

inputting the sample data into the target neural network for training until the intra-class audio similarity and the inter-class audio similarity meet preset conditions, and obtaining a trained target neural network; the audio similarity in the class is the similarity between voiceprint feature vectors belonging to the same user, and the audio similarity in the class is the similarity between voiceprint feature vectors belonging to different users.

4. The voiceprint recognition method of claim 3, wherein inputting the sample data into the target neural network for training until the intra-class audio similarity and the inter-class audio similarity satisfy a predetermined condition, and obtaining the trained target neural network comprises:

5. The voiceprint recognition method of claim 3, wherein the learning rate of the target neural network during training is dynamically adjusted according to a preset target learning rate and a current number of training steps.

6. The voiceprint recognition method according to any one of claims 1 to 5, comprising, after said comparing said target voiceprint feature vector with registered voiceprint feature vectors:

7. A voiceprint recognition apparatus, comprising:

the determining unit is used for comparing the target voiceprint feature vector with registered voiceprint feature vectors and determining a target user corresponding to the audio to be identified;

The target neural network specifically comprises a first convolution layer, SENet modules, a first reconstruction layer, a first full-connection layer, a second reconstruction layer, a cavity convolution network, a third reconstruction layer, an average pooling layer and a second full-connection layer, and the target neural network unit is specifically configured to:

8. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 6 when the computer program is executed.

9. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 6.