CN112331207B

CN112331207B - Service content monitoring method, device, electronic equipment and storage medium

Info

Publication number: CN112331207B
Application number: CN202011060127.4A
Authority: CN
Inventors: 廖光朝
Original assignee: Audio Digital Huiyuan Shanghai Intelligent Technology Co ltd
Current assignee: Audio Digital Huiyuan Shanghai Intelligent Technology Co ltd
Priority date: 2020-09-30
Filing date: 2020-09-30
Publication date: 2024-08-30
Anticipated expiration: 2040-09-30
Also published as: CN112331207A

Abstract

The application relates to a service content monitoring method, a device and electronic equipment. The method comprises the following steps: acquiring a service phrase set and a voice to be recognized; the service expression set comprises a plurality of service expressions; recognizing the voice to be recognized based on the pre-trained voice recognition model to obtain an index network; through carrying out character matching on the index network and the service term set, determining the service term matched with the voice to be recognized; extracting target keywords from service expressions matched with the voice to be recognized; and determining the service content according to the target keywords. The method can be used for protecting and compensating the information loss caused by monitoring the service content based on the video.

Description

Service content monitoring method, device, electronic equipment and storage medium

Technical Field

The present application relates to the field of home care technologies, and in particular, to a service content monitoring method, a device, an electronic apparatus, and a storage medium.

Background

With the increasing population of elderly people, home care services have grown. The home care service is to provide the old with the care service with a certain service time by the service personnel after professional training.

When service personnel provide home care service for the home old people, the home care service manager first needs to confirm whether the service personnel provide preset service content for the home old people according to the convention. At present, the behavior of service personnel is continuously monitored by using video monitoring equipment, and the service content of the service personnel is determined based on a monitoring video. However, since the storage space occupied by the monitoring video is large, a large storage space is required for determining the service content through the monitoring video.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a service content method, apparatus, electronic device, and storage medium that can save storage space.

A method of service content monitoring, the method comprising:

acquiring a service phrase set and a voice to be recognized; the service expression set comprises a plurality of service expressions;

Recognizing the voice to be recognized based on a pre-trained voice recognition model to obtain an index network;

Determining the service term matched with the voice to be recognized by performing character matching on the index network and the service term set;

extracting target keywords from service expressions matched with the voice to be recognized;

and determining service content according to the target keywords.

In one embodiment, the training step of the speech recognition model comprises:

Acquiring a sample text corresponding to the sample voice and a pronunciation dictionary; the sample text comprises at least one word to be annotated;

performing pronunciation marking on the word segmentation to be marked according to the pronunciation dictionary to obtain a tag sequence;

And training a voice recognition model based on the sample voice and the corresponding label sequence.

In one embodiment, the pronunciation dictionary includes pronunciation word segments and corresponding pronunciation tags; the step of performing pronunciation annotation on the word to be annotated according to the pronunciation dictionary comprises the following steps:

performing word segmentation matching on the word to be marked and the pronunciation dictionary, and judging whether pronunciation word segmentation matched with the word to be marked exists in the pronunciation dictionary or not based on a word segmentation matching result;

when the pronunciation dictionary has pronunciation word segments matched with the to-be-annotated word segments, annotating the to-be-annotated word segments according to pronunciation labels corresponding to the matched pronunciation word segments;

When the pronunciation dictionary does not have pronunciation word segmentation matched with the word segmentation to be marked, segmenting the word segmentation to be marked based on a preset rule to obtain word segmentation fragments;

And taking the word segmentation segment as a word to be marked, and returning to the step of word segmentation matching of the word to be marked and the pronunciation dictionary until the pronunciation dictionary has pronunciation word segmentation matched with the word to be marked.

In one embodiment, the speech recognition model includes a speech separation enhancement model and a target recognition model; the training step of the voice recognition model comprises the following steps:

Acquiring a first loss function of a voice separation enhancement model and a second loss function of a target recognition model;

back propagation is carried out based on the second loss function so as to train an intermediate model bridged between the voice separation enhancement model and the target recognition model, and a robust representation model is obtained;

fusing the first loss function and the second loss function to obtain a target loss function;

and carrying out joint training on the voice separation enhancement model, the robust representation model and the target recognition model based on the target loss function, and ending training when a preset convergence condition is met.

In one embodiment, the identifying the speech to be identified based on the pre-trained speech recognition model, obtaining the index network includes:

extracting voice characteristics of the voice to be recognized, and determining pinyin of each word in the voice to be recognized based on the voice characteristics; the pinyin is composed of one to a plurality of sound units;

determining a mapping relation between a sound unit in the pinyin and a corresponding fuzzy sound;

Determining candidate character sequences according to the pronunciation dictionary and the mapping relation;

And generating an index network based on the candidate text sequence.

In one embodiment, the determining the service phrase matching the voice to be recognized by performing character matching on the index network and the service phrase set includes:

determining service expression matched with each candidate character sequence in the index network by carrying out character matching on the index network and the service expression set;

calculating the offset distance of each candidate character sequence relative to the matched service term;

Screening a target text sequence from the candidate text sequences based on the offset distance;

and judging the service expression matched with the target text sequence as the service expression matched with the voice to be recognized.

In one embodiment, the method further comprises:

determining all extracted target keywords;

determining the generation time of each target keyword;

and generating a nursing report based on the generation time and the target keyword.

A service content monitoring apparatus, the apparatus comprising:

The index network generation module is used for acquiring a service phrase set and voice to be recognized; the service expression set comprises a plurality of service expressions; recognizing the voice to be recognized based on a pre-trained voice recognition model to obtain an index network;

the target keyword extraction module is used for determining the service term matched with the voice to be recognized by performing character matching on the index network and the service term set; and extracting target keywords from the service expressions matched with the voice to be recognized.

And the service content determining module is used for determining the service content according to the target keywords.

A computer device comprising a memory storing a computer program and a processor which when executing the computer program performs the steps of:

and determining service content according to the target keywords.

A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

in the home-based care service process, a service phrase set and a voice to be recognized are obtained; the service expression set comprises a plurality of service expressions;

and determining the service content of the home-based care service according to the target keywords.

According to the service content monitoring method, the device, the electronic equipment and the storage medium, the voice to be recognized can be recognized based on the pre-trained voice recognition model by acquiring the service expression set and the voice to be recognized, so that an index network containing a plurality of candidate recognition results is obtained; through character matching of the index network and the service term set, service terms which can represent the voice to be recognized most can be screened out from the service term set, so that the target keywords extracted based on the service terms which can represent the voice to be recognized most are more accurate; by extracting the target keywords, the service content can be determined according to the word meanings of the target keywords, so that the service content of service personnel can be effectively monitored. Because the application monitors the service content based on the voice with smaller storage space, compared with the traditional content monitoring based on video, the application can effectively save the storage space consumed in the monitoring of the service content.

Drawings

FIG. 1 is an application environment diagram of a method of service content monitoring in one embodiment;

FIG. 2 is a flow diagram of a method for monitoring service content in one embodiment;

FIG. 3 is a schematic diagram of an indexing network in one embodiment;

FIG. 4 is a flow chart of a method of training a speech recognition model in one embodiment;

FIG. 5 is a block diagram of a service content monitoring device in one embodiment;

Fig. 6 is an internal structural diagram of an electronic device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

The service content monitoring method provided by the application can be applied to an application environment shown in figure 1. Wherein microphone pod 102 communicates with host pod 104 via a network. In the process that the service personnel provide home care service for the home old people, the microphone box 102 is worn on the service personnel and used for capturing the voice of the service personnel and identifying the voice to obtain a target keyword, and when the target keyword is consistent with the service item currently executed by the service personnel, the service personnel is judged to execute the service item. Microphone box 102 sends the target service keywords to host box 104, and the target keywords are stored by host box 104.

In one embodiment, as shown in fig. 2, a service content monitoring method is provided, and the method is applied to the microphone box in fig. 1 for illustration, and includes the following steps:

s202, acquiring a service phrase set and voice to be recognized.

The service expression set refers to a set containing at least one service expression. The service term refers to a normative term with respect and friendly effect used by service personnel and objects to be serviced in language communication during home care service, for example, the service term may be "please ask water temperature proper? "," provide your shampoo now ", etc. The voice to be recognized refers to audio information acquired by the microphone box in real time.

Specifically, when it is determined that service personnel start to provide home-based service for home old people, the microphone box acquires a service phrase set and acquires audio information in real time to obtain voice to be recognized.

In one embodiment, the host box has pre-stored therein correspondence between service items and subsets of service terms, different subsets of service terms corresponding to different service items. The service items refer to service contents which are provided by service personnel in the whole home care service process, for example, the service items can be hair washing, massage and the like. It is readily understood that service personnel can provide a variety of different service items for served objects throughout home care service. The service expression subset refers to a canonical expression associated with a service item that should be used by a service person in executing the service item.

Before providing home care services for the served object, the service personnel can agree with service items which should be provided by the served object and generate order data based on the agreed care service items, so that the microphone box can acquire a corresponding service phrase subset based on the care service items in the order data and generate a service phrase set based on the acquired service phrase subset.

In one embodiment, the order data includes service items and service times for each service item. The host box obtains order data and obtains a corresponding service expression set based on service items in the order data. The host box determines service content which is needed to be carried out by service personnel at the current moment according to the service time in the order data, and correspondingly displays service expression associated with the service content in a local screen, so that the service personnel can carry out language communication with the served object by using the standard expression according to the screen prompt information.

S204, recognizing the voice to be recognized based on the pre-trained voice recognition model to obtain an index network.

Wherein, the voice recognition is to convert the input voice signal into the corresponding text. The speech recognition model refers to a machine learning model with speech feature extraction capabilities. A speech feature is data that reflects an audio feature. The voice features may be one or more of tone color, pronunciation, frequency spectrum, etc.

Specifically, a voice recognition model is preset in the microphone box. The speech recognition model includes an endpoint detection sub-model, an acoustic sub-model, and a language sub-model. Wherein the endpoint detection submodel is used to separate speech signals from non-speech signals. The endpoint detection sub-model carries out framing processing on the voice to be recognized, extracts characteristic parameters of the voice frame, and determines a voice segment and a non-voice segment based on the characteristic parameters of the voice frame. More specifically, the speech segments and the non-speech segments may be determined based on short-time energy and zero-crossing rate, entropy of information, short-time energy frequency values, template matching, and the like.

The acoustic submodel is a model for describing the relation between the speech features and the speech modeling unit, and is an important part of the speech recognition system. Conventional speech recognition models commonly employ a GMM-HMM (Gaussian Mixture Models-Hidden Markov Model) acoustic model, where the GMM models the distribution of speech acoustic features and the HMM models the timing of the speech signal. GMM is essentially a shallow network model with weak ability to describe acoustic feature state control distribution, and speech recognition accuracy is low when training speech data is huge. The application adopts CNN-HMM (Convolutional Neural Networks-Hidden Markov Model) to carry out acoustic modeling. The CNN is a deep model, and the distribution of any data can be self-adaptively fitted through the parameter adjustment of the CNN, so that higher recognition accuracy can be realized.

After the voice fragment is obtained, the acoustic submodel performs feature extraction on the voice fragment, and recognizes the voice based on the extracted feature information to obtain a pinyin sequence corresponding to the voice to be recognized. For example, when the voice to be recognized is "washing hair is good", the pinyin sequence obtained through the acoustic submodel is "xi ge tou hao ma".

The language submodel is used for predicting the occurrence probability of the candidate character sequence corresponding to the pinyin sequence and generating an index network based on the occurrence probability. Due to the existence of homophones, when the Pinyin sequence is obtained, the language submodel determines N-1 characters through the Pinyin sequence and predicts the occurrence probability of the next character based on the N-1 characters, so that one or more candidate character sequences corresponding to the Pinyin sequence are obtained, and an index network is generated based on the obtained candidate character sequences. For example, when the pinyin sequence is "xi ge tou hao ma", the text characters corresponding to "xi" may be "wash" and "xi", the text characters predicted based on "wash" and "ge tou" may be "single" and the text characters predicted based on "west" and "ge tou" may be "follow", and the text characters predicted based on "wash", "single", "hao ma", and "west", "follow", "hao ma" are all "good", the generated index network is as shown in fig. 3. FIG. 3 is a schematic diagram of an index network, in one embodiment. The candidate character sequence takes a starting node as a starting point, takes an ending node as an ending point, and is formed by connecting the nodes and line segments, for example, the candidate character sequence is obtained by washing hair.

S206, determining the service expression matched with the voice to be recognized by performing character matching on the index network and the service expression set.

Specifically, the microphone box determines the service expression matched with each candidate word sequence in the index network by performing character matching on the index network and the service expression set, and calculates the offset distance of each candidate word sequence relative to the matched service expression. Wherein the offset distance refers to the ratio of the number of characters not present in the matched service phrase to the number of characters present in the matched service phrase, wherein the tag symbol is not counted. For example, when the candidate character sequence is "hair washing good", the matched service term is "hair washing good", the number of characters in the "hair washing good" which are not present in the "hair washing good" is 1, and the number of characters in the "hair washing good" which are present in the "hair washing good" is 4, so that the offset distance is 1/4. The microphone box takes one candidate character sequence with the smallest offset distance as a target character sequence, and judges the service expression matched with the target character sequence as the service expression matched with the voice to be recognized.

In one embodiment, the microphone box analyzes the order data to obtain all the service items that the current home care service should perform, and determines the specific service time period of each service item. When an index network corresponding to the voice to be recognized is generated, the microphone box determines the acquisition time for acquiring the voice to be recognized, and determines the service item currently being executed by the service personnel based on the acquisition time and the specific service time period of each service item. The microphone box screens candidate service expressions associated with the currently executing service item from the service expression set, and performs character matching on the candidate character sequences and the candidate service expressions to obtain the service expressions matched with each candidate character sequence. By determining the collection time for collecting the voice to be recognized and the specific service time period of each service item, candidate service words associated with the currently executed service item can be screened out from the service word set, so that the microphone box only needs to perform character matching on the screened candidate service words, and does not need to perform character matching on the whole service set, and the matching efficiency is greatly improved.

S208, extracting target keywords from the service expressions matched with the voice to be recognized.

S210, determining service contents according to the target keywords.

The target keywords refer to keywords capable of representing service items, for example, when the service content is hair washing and massaging, the corresponding target keywords can be hair washing and massaging.

Specifically, the service manager marks the target keywords in each service term in advance, so that the target keywords are extracted from the service terms matched with the voice to be recognized based on the marking result. For example, the target keyword "hair washing" can be marked in advance based on "< s >" and "</s >" to obtain the service term of "start < s > hair washing good", and the microphone box only needs to identify "< s >" and "</s >" to extract the target keyword from the matched service term, and determine the service content of the pension service based on the target keyword.

In one embodiment, when determining a specific service period of each pension service item based on the order data and determining a collection time of the voice to be recognized, the microphone box judges the service item that should be performed by the current service person based on the collection time and the specific service period, compares the service item that should be performed with the target keyword, and when the service item that should be performed is consistent with the target keyword, it may be determined that the service person is providing the corresponding pension service according to the order data.

In one embodiment, when determining a target keyword in the speech to be recognized, the microphone box sends the target keyword to the host box, which stores the target keyword correspondingly. And then, the microphone box correspondingly deletes the voice to be recognized so as to protect the privacy of service personnel and the serviced object.

In the service content monitoring method in the home care service process, the voice to be recognized can be recognized based on the pre-trained voice recognition model by acquiring the service phrase set and the voice to be recognized, so that an index network containing a plurality of candidate recognition results is obtained; through character matching of the index network and the service term set, the service term which can represent the voice to be recognized most can be screened out from the service term set, so that service contents of service personnel can be monitored effectively. Because the application monitors the service content based on the voice with smaller storage space, compared with the traditional content monitoring based on video, the application can effectively save the storage space consumed in the monitoring of the service content.

In one embodiment, the training step of the speech recognition model comprises: acquiring a sample text corresponding to the sample voice and a pronunciation dictionary; the sample text comprises at least one word to be annotated; performing pronunciation marking on the word to be marked according to the pronunciation dictionary to obtain a label sequence; the speech recognition model is trained based on the sample speech and the corresponding tag sequence.

Wherein, the sample speech refers to speech data used for training a speech recognition model. The sample text refers to text data obtained after the sample voice is subjected to voice recognition. The sample text includes a positive sample, which refers to text data containing the target keyword, and a negative sample, which refers to text data not containing the target keyword. The pronunciation dictionary refers to a dictionary for determining a mapping relationship between tone syllables and vowels of a segmented word. The pronunciation dictionary contains the pronunciation of all words and segmentations in the sample data.

Specifically, the model training personnel obtain as many sample voices as possible, manually perform voice recognition on the sample voices to obtain corresponding sample texts, and then input the sample texts corresponding to the sample voices and a pronunciation dictionary into a voice recognition model. The speech recognition model performs word segmentation processing on the sample text to obtain a plurality of to-be-annotated word segments, queries pronunciation labels corresponding to the to-be-annotated word segments in a pronunciation dictionary, and performs pronunciation annotation on the to-be-annotated word segments based on the pronunciation labels. And combining pronunciation labels corresponding to the to-be-labeled segmented words by the microphone box to obtain a label sequence. For example, the labeling format of each word in the pronunciation dictionary is: the initials, finals and tones are respectively corresponding to four tones, 1-4 are respectively corresponding to four tones, and 5 is a light sound, so that the pronunciation label corresponding to the word "Arjiu" to be marked can be "aa a 1j iu3".

Further, the voice recognition model carries out model training on the acoustic submodel and the language submodel based on the sample voice and the corresponding tag sequence until the trained model parameters meet the preset requirements.

In one embodiment, the speech recognition may be performed on the sample speech to obtain a corresponding sample text, and word segmentation may be performed on the sample text. Because the recognition accuracy of the long keywords is lower than that of the short keywords, in order to improve the recognition accuracy of the keywords, the long keywords can be split into the short keywords, for example, "hair washing service" can be split into "hair washing/service", wherein "/" is a word segmentation symbol.

In the embodiment, the pronunciation dictionary is used for automatically carrying out pronunciation marking processing on the segmented words to be marked, so that compared with the traditional manual pronunciation marking, the method and the device can improve marking efficiency and save manpower resources consumed during manual pronunciation marking.

In one embodiment, labeling the word to be labeled according to the pronunciation dictionary includes: performing word segmentation matching on the word segmentation to be marked and the pronunciation dictionary, and judging whether pronunciation word segmentation matched with the word segmentation to be marked exists in the pronunciation dictionary or not based on the word segmentation matching result; when the pronunciation dictionary has pronunciation word segmentation matched with the word segmentation to be marked, marking the word segmentation to be marked according to the pronunciation label corresponding to the matched pronunciation word segmentation; when the pronunciation dictionary does not have pronunciation word segmentation matched with the word segmentation to be marked, segmenting the word segmentation to be marked based on a preset rule to obtain word segmentation fragments; and taking the word segmentation segment as a word to be marked, and returning to the step of matching the word to be marked with the pronunciation dictionary until the pronunciation dictionary has pronunciation word matched with the word to be marked.

The pronunciation dictionary comprises pronunciation word segmentation and corresponding pronunciation labels. The pronunciation word segmentation refers to single word segmentation or characters, and the pronunciation label refers to label information obtained by labeling the pronunciation word segmentation in a labeling format of initials, finals and tones.

Specifically, the microphone box performs word segmentation matching on the word to be marked and each pronunciation word in the pronunciation dictionary, and judges whether pronunciation words matched with the word to be marked exist in the pronunciation dictionary according to a matching result. When the pronunciation dictionary has the same pronunciation word with the word to be marked, the microphone box takes the pronunciation label corresponding to the same pronunciation word as the marking result of the word to be marked. When the pronunciation dictionary does not have the same pronunciation word segmentation as the word to be marked, the microphone box segments the word to be marked based on a preset rule to obtain word segmentation fragments. For example, the preset rule may be to divide the word to be marked by using the intermediate character as a dividing point, so that when the word to be marked is "litchi radix field", the word to be marked may be divided into "litchi radix field" and "field" based on the preset rule.

Further, the microphone box takes each word segment as a word to be marked, and returns to the step of matching the word to be marked with the pronunciation dictionary until the pronunciation dictionary has pronunciation word matched with the word to be marked. For example, when the pronunciation dictionary does not have pronunciation word segmentation matched with the 'litchi aster', the microphone box further divides the 'litchi aster' to obtain the 'litchi' and the 'aster', and marks the 'litchi' and the 'aster' based on the pronunciation dictionary respectively.

In this embodiment, because the word to be marked and segmented can be continuously segmented based on the pronunciation dictionary, when the word to be marked and segmented is a rare word, the label marking can still be performed based on the word segmentation dictionary.

In one embodiment, as shown in FIG. 4, the training step of the speech recognition model includes:

S402, acquiring a first loss function of a voice separation enhancement model and a second loss function of a target recognition model.

S404, back propagation is performed based on the second loss function so as to train an intermediate model bridged between the voice separation enhancement model and the target recognition model, and a robust representation model is obtained.

S406, fusing the first loss function and the second loss function to obtain a target loss function.

S408, performing joint training on the voice separation enhancement model, the robust representation model and the target recognition model based on the target loss function, and ending training when the preset convergence condition is met.

The voice recognition model comprises a voice separation enhancement model and a target recognition model; the object recognition model includes an acoustic submodel and a language submodel. The speech separation enhancement model is a model with speech separation and/or enhancement capability after training, and specifically may be a model obtained by taking sample speech as training data and performing learning training to separate target speech from background interference in the sample speech. It will be appreciated that the speech separation enhancement model may also have the capability to pre-process speech signals for speech activity detection (Voice Activity Detection, VAD), echo cancellation, reverberation cancellation, or sound source localization, without limitation. The target recognition model is an acoustic model with voice recognition capability after training, and specifically can be a model for carrying out phoneme recognition on sample voice obtained by learning and training by taking sample voice and a tag sequence as training data. The speech separation enhancement model and the target recognition model may each be pre-trained. The pre-trained speech separation enhancement model and the speech recognition model each have a fixed model structure and model parameters.

Specifically, in order to further improve the recognition accuracy of the voice model, a voice separation enhancement model may be added to the voice model, and the voice model may be further trained based on the voice separation enhancement model. When the joint model training is needed, the microphone box acquires a pre-trained voice separation enhancement model and a target recognition model, and a first loss function adopted when the voice separation enhancement model is pre-trained and a second loss function adopted when the target recognition model is pre-trained. The loss function (loss function) is typically associated with an optimization problem as a learning criterion, i.e., solving and evaluating the model by minimizing the loss function. The first loss function adopted by the pre-training voice separation enhancement model and the second loss function adopted by the pre-training voice recognition model can be mean square error, average absolute value error, log-Cosh loss, quantile loss, ideal quantile loss and the like.

The traditional mode mainly divides the voice processing task into two completely independent subtasks: a speech separation task and a target recognition task. In this way, in the training stage, the modularization is allowed to train the voice separation enhancement model and the target recognition model respectively, and in the production test stage, the enhanced to-be-recognized output by the voice separation enhancement model is input into the target recognition model for recognition. It is easy to find that this approach does not solve the problem of differentiation between the two characterization categories well. In practical application scenes such as home care service, the phenomenon that the voice to be recognized is influenced by background music or multi-speaker interference exists commonly. Thus, the speech separation enhancement model introduces relatively serious distortion during front-end speech processing, which is not considered during the target recognition model training stage, so that the independent front-end speech separation enhancement model and the back-end target recognition model are directly cascaded, and the final speech recognition performance is seriously reduced.

To overcome the difference between the two characterization categories, embodiments of the present application bridge the intermediate model to be trained between the speech separation enhancement model and the target recognition model. The trained intermediate model may be referred to as a robust characterization model. More specifically, the microphone box determines the local gradient of descent of the second loss function generated during each iteration according to a preset deep learning optimization algorithm. And the microphone box reversely propagates the local descending gradient to the middle model so as to update model parameters corresponding to the middle model, and the training is ended when the preset training stopping condition is met.

The microphone box obtains a target loss function by carrying out preset logic operation on the first loss function and the second loss function. Taking the weighted summation as an example, assuming that the weighting factor is λ _SS, the objective loss function l=l ₂+λ_SSL₁. The weighting factor may be a numerical value set empirically or experimentally, such as 0.1. It is readily found that the importance of the speech separation enhancement model in multimodal joint training can be adjusted by adjusting the weighting factors. The microphone box determines the global descent gradient generated by the target loss function according to a preset deep learning optimization algorithm. The deep learning optimization algorithm for determining the local gradient of descent may be the same as or different from the deep learning optimization algorithm for determining the global gradient of descent. And sequentially and reversely propagating the global descending gradient generated by the target loss function from the target recognition model to each layer of the network of the robust characterization model and the voice separation enhancement model, and respectively carrying out iterative updating on model parameters corresponding to the voice separation enhancement model, the robust characterization model and the target recognition model in the process until the training is finished when the preset training stop condition is met.

In this embodiment, the intermediate model performs training by means of the second loss function back-propagation of the rear-end target recognition model, and the speech separation enhancement model and the target recognition model may be pre-trained, so that convergence can be achieved after a small number of iterative training times. In addition, the combination of the front end model and the rear end model which correspond to the loss function is used for carrying out joint training on the end-to-end network model, so that each independent model in the network architecture can comprehensively learn the interference characteristics from the voice signals in the complex acoustic environment, the performance of the overall voice processing task can be ensured, and the voice recognition accuracy is improved.

In one embodiment, identifying speech to be identified based on a pre-trained speech recognition model, obtaining an index network includes: extracting voice characteristics of voice to be recognized, and determining pinyin of each word in the voice to be recognized based on the voice characteristics; pinyin consists of one to a plurality of sound units; determining a mapping relation between a sound unit in pinyin and a corresponding fuzzy sound; determining candidate character sequences according to the pronunciation dictionary and the fuzzy sounds; an indexing network is generated based on the candidate word sequences.

Wherein, the fuzzy sound in the pinyin can be a sound unit close to the pinyin pronunciation. The ambiguous sounds may be generated due to the same semantic meaning pronouncing differently in different dialects. The sound unit refers to the initial consonant or final sound composing the pinyin.

Specifically, when the acoustic submodel obtains the voice to be recognized output by the voice separation enhancement model, voice features in the voice to be recognized can be extracted based on a preset convolution kernel. For example, pronunciation characteristics in the speech to be recognized are extracted. Meanwhile, the acoustic submodel inputs the voice characteristics into the language submodel, and the language submodel determines the pinyin corresponding to each word in the voice to be recognized according to the voice characteristics. The language sub-model obtains a fuzzy sound table, and queries all sound units in the pinyin of each word by using the fuzzy sound table to obtain the sound units with fuzzy sound, so that the mapping relation between the sound units with fuzzy sound and the fuzzy sound is established. For example, when the sound unit is "g", the ambiguous sound determined based on the ambiguous sound table is "j".

Further, the language sub-model combines the sound unit and the fuzzy sound based on the mapping relation to obtain one or more candidate pinyin corresponding to each word. For example, when the sound units are "g" and "ai", and the ambiguous sounds determined based on the ambiguous sound table are "j" and "ei", the candidate pinyin obtained by combining are "gai", "jei", "gei" and "jai". The language sub-model queries pronunciation word segments corresponding to the candidate pinyin in a pronunciation dictionary, generates candidate character sequences based on the pronunciation word segments corresponding to each word segment in the speech to be recognized, and generates an index network according to the candidate character sequences.

In this embodiment, the recognition results of the same word by the speech recognition model may be different due to the influence of the dialect, and multiple candidate recognition results can be obtained by the method provided by this embodiment, so that the target recognition result can be determined from the multiple candidate recognition results, and thus, the influence of the dialect on the recognition result can be effectively overcome.

In one embodiment, determining the service phrase that matches the speech to be recognized by character matching the indexing network with the set of service phrases includes: through carrying out character matching on the index network and the service term set, determining the service term matched with each candidate character sequence in the index network; calculating the offset distance of each candidate character sequence relative to the matched service term; screening a target text sequence from the candidate text sequences based on the offset distance; and judging the service expression matched with the target text sequence as the service expression matched with the voice to be recognized.

Specifically, the microphone box traverses each candidate word sequence in the index network, and character matches each candidate word sequence with each service term in the service term set until each candidate word sequence in the index network is determined to be the service term matched with each candidate word sequence. More specifically, the microphone box determines a candidate text sequence of the current traversal order, and uses the service phrase having the largest number of repeated characters with the candidate text sequence of the current traversal order as the service phrase matching the candidate text sequence of the current traversal order. For example, when the candidate character of the current traversal order is "hair washing good", and the service term in the service term set is "hair washing good" or "hair washing service is started", the service having the largest number of repeated characters with "hair washing good" is "hair washing good".

Further, the microphone box calculates the offset distance of each candidate word sequence relative to the matched service words, takes one candidate word sequence with the smallest offset distance as a target word sequence, and judges the service words matched with the target word sequence as the service words matched with the voice to be recognized.

In this embodiment, since the service term with the smallest offset distance is determined as the service term corresponding to the voice to be recognized, the service term screened based on the offset distance is the language text that can represent the voice to be recognized most, so that the target keyword determined based on the language text that can represent the voice to be recognized most accurately.

In one embodiment, the service content monitoring method further includes: determining all extracted target keywords; determining the generation time of each target keyword; a care report is generated based on the generation time and the target keyword.

Specifically, when it is determined that the home care service is completed, the microphone box acquires all target keywords extracted in the home care process, determines generation time of each target keyword, generates a care report according to each target keyword and the generation time of each target keyword, and then sends the generated care report to the served object.

In this embodiment, by generating the nursing report, the family of the served object can know the specific service items provided by the service personnel in the home care service process according to the nursing report.

It should be understood that, although the steps in the flowcharts of fig. 2 and 4 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 2,4 may include steps or stages that are not necessarily performed at the same time, but may be performed at different times, or the order in which the steps or stages are performed is not necessarily sequential, but may be performed in rotation or alternatively with at least some of the other steps or stages.

In one embodiment, as shown in fig. 5, there is provided a service content monitoring apparatus 500, comprising: an index network generation module 502, a target keyword extraction module 504, and a service content determination module 506, wherein:

the index network generation module 502 is configured to obtain a service phrase set and a voice to be recognized; the service expression set comprises a plurality of service expressions; and recognizing the voice to be recognized based on the pre-trained voice recognition model to obtain an index network.

The target keyword extraction module 504 is configured to determine a service phrase matched with the voice to be recognized by performing character matching on the index network and the service phrase set; and extracting the target keywords from the service expressions matched with the voice to be recognized.

The service content determining module 506 is configured to determine service content of the service according to the target keyword.

In one embodiment, the index network generation module 502 further includes a model training module 5021 for obtaining a sample text corresponding to the sample speech and a pronunciation dictionary; the sample text comprises at least one word to be annotated; performing pronunciation marking on the word to be marked according to the pronunciation dictionary to obtain a label sequence; the speech recognition model is trained based on the sample speech and the corresponding tag sequence.

In one embodiment, the model training module 5021 is further configured to match the word to be annotated with the pronunciation dictionary, and determine whether a pronunciation word matched with the word to be annotated exists in the pronunciation dictionary based on the word matching result; when the pronunciation dictionary has pronunciation word segmentation matched with the word segmentation to be marked, marking the word segmentation to be marked according to the pronunciation label corresponding to the matched pronunciation word segmentation; when the pronunciation dictionary does not have pronunciation word segmentation matched with the word segmentation to be marked, segmenting the word segmentation to be marked based on a preset rule to obtain word segmentation fragments; and taking the word segmentation segment as a word to be marked, and returning to the step of matching the word to be marked with the pronunciation dictionary until the pronunciation dictionary has pronunciation word matched with the word to be marked.

In one embodiment, the model training module 5021 is further configured to obtain a first loss function of the speech separation enhancement model and a second loss function of the target recognition model; performing back propagation based on the second loss function to train an intermediate model bridged between the speech separation enhancement model and the target recognition model to obtain a robust representation model; fusing the first loss function and the second loss function to obtain a target loss function; and carrying out joint training on the voice separation enhancement model, the robust representation model and the target recognition model based on the target loss function, and ending training when the preset convergence condition is met.

In one embodiment, the index network generation module 502 further includes a candidate text sequence determination module 5022, configured to extract a voice feature of the voice to be recognized, and determine pinyin of each word in the voice to be recognized based on the voice feature; pinyin consists of one to a plurality of sound units; determining a mapping relation between a sound unit in pinyin and a corresponding fuzzy sound; determining candidate character sequences according to the pronunciation dictionary and the mapping relation; an indexing network is generated based on the candidate word sequences.

In one embodiment, the target keyword extraction module 504 further includes an offset distance determination module 5041 for determining a service phrase to which each candidate word sequence in the index network matches by character matching the index network with the service phrase set; calculating the offset distance of each candidate character sequence relative to the matched service term; screening a target text sequence from the candidate text sequences based on the offset distance; and judging the service expression matched with the target text sequence as the service expression matched with the voice to be recognized.

In one embodiment, the service content monitoring apparatus 500 is further configured to determine all the extracted target keywords; determining the generation time of each target keyword; a care report is generated based on the generation time and the target keyword.

The specific limitation of the service content monitoring device may be referred to the limitation of the service content monitoring method hereinabove, and will not be described herein. The respective modules in the above-described service content monitoring apparatus may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, an electronic device is provided, which may be a terminal, and an internal structure diagram thereof may be as shown in fig. 6. The electronic device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic device includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the electronic device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a service content monitoring method. The display screen of the electronic equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the electronic equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the electronic equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structure shown in FIG. 6 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, an electronic device is provided that includes a memory having a computer program stored therein and a processor that when executing the computer program performs the steps of:

recognizing the voice to be recognized based on the pre-trained voice recognition model to obtain an index network;

through carrying out character matching on the index network and the service term set, determining the service term matched with the voice to be recognized;

And determining the service content according to the target keywords.

In one embodiment, the processor when executing the computer program further performs the steps of:

performing pronunciation marking on the word to be marked according to the pronunciation dictionary to obtain a label sequence;

The speech recognition model is trained based on the sample speech and the corresponding tag sequence.

In one embodiment, the pronunciation dictionary includes pronunciation segmentations and corresponding pronunciation tags; the processor when executing the computer program also implements the steps of:

Performing word segmentation matching on the word segmentation to be marked and the pronunciation dictionary, and judging whether pronunciation word segmentation matched with the word segmentation to be marked exists in the pronunciation dictionary or not based on the word segmentation matching result;

when the pronunciation dictionary has pronunciation word segmentation matched with the word segmentation to be marked, marking the word segmentation to be marked according to the pronunciation label corresponding to the matched pronunciation word segmentation;

and taking the word segmentation segment as a word to be marked, and returning to the step of matching the word to be marked with the pronunciation dictionary until the pronunciation dictionary has pronunciation word matched with the word to be marked.

Performing back propagation based on the second loss function to train an intermediate model bridged between the speech separation enhancement model and the target recognition model to obtain a robust representation model;

and carrying out joint training on the voice separation enhancement model, the robust representation model and the target recognition model based on the target loss function, and ending training when the preset convergence condition is met.

extracting voice characteristics of voice to be recognized, and determining pinyin of each word in the voice to be recognized based on the voice characteristics; pinyin consists of one to a plurality of sound units;

determining a mapping relation between a sound unit in pinyin and a corresponding fuzzy sound;

an indexing network is generated based on the candidate word sequences.

Through carrying out character matching on the index network and the service term set, determining the service term matched with each candidate character sequence in the index network;

determining all extracted target keywords;

determining the generation time of each target keyword;

a care report is generated based on the generation time and the target keyword.

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:

And determining the service content according to the target keywords.

In one embodiment, the computer program when executed by the processor further performs the steps of:

In one embodiment, the pronunciation dictionary includes pronunciation segmentations and corresponding pronunciation tags; the computer program when executed by the processor also performs the steps of:

an indexing network is generated based on the candidate word sequences.

determining all extracted target keywords;

determining the generation time of each target keyword;

a care report is generated based on the generation time and the target keyword.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. A method of service content monitoring, the method comprising:

Determining the service term matched with the voice to be recognized by performing character matching on the index network and the service term set; the service expression is obtained by character matching between the index network and a candidate service expression, the candidate service expression is associated with a currently executing service item, the candidate service expression is obtained by screening from the service expression set, the currently executing service item is determined based on the collection time of the voice to be recognized and the service time period of each service item, and the service time period of each service item is obtained by analyzing order data;

determining service content according to the target keywords;

The speech recognition model based on pre-training recognizes the speech to be recognized, and the obtaining the index network includes:

Extracting voice characteristics of the voice to be recognized, and determining pinyin of each word in the voice to be recognized based on the voice characteristics;

acquiring a fuzzy sound table, and inquiring all sound units in the pinyin of each word by utilizing the fuzzy sound table to obtain a mapping relation between the sound units with fuzzy sound and the fuzzy sound;

combining the sound unit and the fuzzy sound based on the mapping relation to obtain one or more candidate pinyin corresponding to each word;

Inquiring pronunciation word segmentation corresponding to the candidate pinyin in a pronunciation dictionary, generating a candidate character sequence based on the pronunciation word segmentation corresponding to each word segmentation in the speech to be recognized, and generating an index network according to the candidate character sequence; wherein the candidate character sequence comprises N-1 characters and the next character of the N-1 characters; the N-1 characters are determined through a pinyin sequence, and the next character is predicted based on the N-1 characters.

2. The method of claim 1, wherein the training step of the speech recognition model comprises:

3. The method of claim 2, wherein the pronunciation dictionary includes pronunciation segmentations and corresponding pronunciation tags; the step of performing pronunciation annotation on the word to be annotated according to the pronunciation dictionary comprises the following steps:

4. The method of claim 1, wherein the speech recognition model comprises a speech separation enhancement model and a target recognition model; the training step of the voice recognition model comprises the following steps:

5. The method of claim 1, wherein the speech recognition is converting an input speech signal into text corresponding thereto.

6. The method of claim 1, wherein the determining the service phrase that matches the speech to be recognized by character matching the index network and the set of service phrases comprises:

7. The method according to claim 1, wherein the method further comprises:

determining all extracted target keywords;

determining the generation time of each target keyword;

8. A service content monitoring apparatus, the apparatus comprising:

The target keyword extraction module is used for determining the service term matched with the voice to be recognized by performing character matching on the index network and the service term set; extracting target keywords from service expressions matched with the voice to be recognized; the service expression is obtained by character matching between the index network and a candidate service expression, the candidate service expression is associated with a currently executing service item, the candidate service expression is obtained by screening from the service expression set, the currently executing service item is determined based on the collection time of the voice to be recognized and the service time period of each service item, and the service time period of each service item is obtained by analyzing order data;

the service content determining module is used for determining service content according to the target keywords;

The index network generation module is also used for extracting the voice characteristics of the voice to be recognized and determining the pinyin of each word in the voice to be recognized based on the voice characteristics; acquiring a fuzzy sound table, and inquiring all sound units in the pinyin of each word by utilizing the fuzzy sound table to obtain a mapping relation between the sound units with fuzzy sound and the fuzzy sound; combining the sound unit and the fuzzy sound based on the mapping relation to obtain one or more candidate pinyin corresponding to each word; inquiring pronunciation word segmentation corresponding to the candidate pinyin in a pronunciation dictionary, generating a candidate character sequence based on the pronunciation word segmentation corresponding to each word segmentation in the speech to be recognized, and generating an index network according to the candidate character sequence; wherein the candidate character sequence comprises N-1 characters and the next character of the N-1 characters; the N-1 characters are determined through a pinyin sequence, and the next character is predicted based on the N-1 characters.

9. An electronic device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 7 when the computer program is executed.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.