CN112331207B - Service content monitoring method, device, electronic equipment and storage medium - Google Patents

Service content monitoring method, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112331207B
CN112331207B CN202011060127.4A CN202011060127A CN112331207B CN 112331207 B CN112331207 B CN 112331207B CN 202011060127 A CN202011060127 A CN 202011060127A CN 112331207 B CN112331207 B CN 112331207B
Authority
CN
China
Prior art keywords
service
voice
word
pronunciation
recognized
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011060127.4A
Other languages
Chinese (zh)
Other versions
CN112331207A (en
Inventor
廖光朝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Audio Digital Huiyuan Shanghai Intelligent Technology Co ltd
Original Assignee
Audio Digital Huiyuan Shanghai Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Audio Digital Huiyuan Shanghai Intelligent Technology Co ltd filed Critical Audio Digital Huiyuan Shanghai Intelligent Technology Co ltd
Priority to CN202011060127.4A priority Critical patent/CN112331207B/en
Publication of CN112331207A publication Critical patent/CN112331207A/en
Application granted granted Critical
Publication of CN112331207B publication Critical patent/CN112331207B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The application relates to a service content monitoring method, a device and electronic equipment. The method comprises the following steps: acquiring a service phrase set and a voice to be recognized; the service expression set comprises a plurality of service expressions; recognizing the voice to be recognized based on the pre-trained voice recognition model to obtain an index network; through carrying out character matching on the index network and the service term set, determining the service term matched with the voice to be recognized; extracting target keywords from service expressions matched with the voice to be recognized; and determining the service content according to the target keywords. The method can be used for protecting and compensating the information loss caused by monitoring the service content based on the video.

Description

Service content monitoring method, device, electronic equipment and storage medium
Technical Field
The present application relates to the field of home care technologies, and in particular, to a service content monitoring method, a device, an electronic apparatus, and a storage medium.
Background
With the increasing population of elderly people, home care services have grown. The home care service is to provide the old with the care service with a certain service time by the service personnel after professional training.
When service personnel provide home care service for the home old people, the home care service manager first needs to confirm whether the service personnel provide preset service content for the home old people according to the convention. At present, the behavior of service personnel is continuously monitored by using video monitoring equipment, and the service content of the service personnel is determined based on a monitoring video. However, since the storage space occupied by the monitoring video is large, a large storage space is required for determining the service content through the monitoring video.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a service content method, apparatus, electronic device, and storage medium that can save storage space.
A method of service content monitoring, the method comprising:
acquiring a service phrase set and a voice to be recognized; the service expression set comprises a plurality of service expressions;
Recognizing the voice to be recognized based on a pre-trained voice recognition model to obtain an index network;
Determining the service term matched with the voice to be recognized by performing character matching on the index network and the service term set;
extracting target keywords from service expressions matched with the voice to be recognized;
and determining service content according to the target keywords.
In one embodiment, the training step of the speech recognition model comprises:
Acquiring a sample text corresponding to the sample voice and a pronunciation dictionary; the sample text comprises at least one word to be annotated;
performing pronunciation marking on the word segmentation to be marked according to the pronunciation dictionary to obtain a tag sequence;
And training a voice recognition model based on the sample voice and the corresponding label sequence.
In one embodiment, the pronunciation dictionary includes pronunciation word segments and corresponding pronunciation tags; the step of performing pronunciation annotation on the word to be annotated according to the pronunciation dictionary comprises the following steps:
performing word segmentation matching on the word to be marked and the pronunciation dictionary, and judging whether pronunciation word segmentation matched with the word to be marked exists in the pronunciation dictionary or not based on a word segmentation matching result;
when the pronunciation dictionary has pronunciation word segments matched with the to-be-annotated word segments, annotating the to-be-annotated word segments according to pronunciation labels corresponding to the matched pronunciation word segments;
When the pronunciation dictionary does not have pronunciation word segmentation matched with the word segmentation to be marked, segmenting the word segmentation to be marked based on a preset rule to obtain word segmentation fragments;
And taking the word segmentation segment as a word to be marked, and returning to the step of word segmentation matching of the word to be marked and the pronunciation dictionary until the pronunciation dictionary has pronunciation word segmentation matched with the word to be marked.
In one embodiment, the speech recognition model includes a speech separation enhancement model and a target recognition model; the training step of the voice recognition model comprises the following steps:
Acquiring a first loss function of a voice separation enhancement model and a second loss function of a target recognition model;
back propagation is carried out based on the second loss function so as to train an intermediate model bridged between the voice separation enhancement model and the target recognition model, and a robust representation model is obtained;
fusing the first loss function and the second loss function to obtain a target loss function;
and carrying out joint training on the voice separation enhancement model, the robust representation model and the target recognition model based on the target loss function, and ending training when a preset convergence condition is met.
In one embodiment, the identifying the speech to be identified based on the pre-trained speech recognition model, obtaining the index network includes:
extracting voice characteristics of the voice to be recognized, and determining pinyin of each word in the voice to be recognized based on the voice characteristics; the pinyin is composed of one to a plurality of sound units;
determining a mapping relation between a sound unit in the pinyin and a corresponding fuzzy sound;
Determining candidate character sequences according to the pronunciation dictionary and the mapping relation;
And generating an index network based on the candidate text sequence.
In one embodiment, the determining the service phrase matching the voice to be recognized by performing character matching on the index network and the service phrase set includes:
determining service expression matched with each candidate character sequence in the index network by carrying out character matching on the index network and the service expression set;
calculating the offset distance of each candidate character sequence relative to the matched service term;
Screening a target text sequence from the candidate text sequences based on the offset distance;
and judging the service expression matched with the target text sequence as the service expression matched with the voice to be recognized.
In one embodiment, the method further comprises:
determining all extracted target keywords;
determining the generation time of each target keyword;
and generating a nursing report based on the generation time and the target keyword.
A service content monitoring apparatus, the apparatus comprising:
The index network generation module is used for acquiring a service phrase set and voice to be recognized; the service expression set comprises a plurality of service expressions; recognizing the voice to be recognized based on a pre-trained voice recognition model to obtain an index network;
the target keyword extraction module is used for determining the service term matched with the voice to be recognized by performing character matching on the index network and the service term set; and extracting target keywords from the service expressions matched with the voice to be recognized.
And the service content determining module is used for determining the service content according to the target keywords.
A computer device comprising a memory storing a computer program and a processor which when executing the computer program performs the steps of:
acquiring a service phrase set and a voice to be recognized; the service expression set comprises a plurality of service expressions;
Recognizing the voice to be recognized based on a pre-trained voice recognition model to obtain an index network;
Determining the service term matched with the voice to be recognized by performing character matching on the index network and the service term set;
extracting target keywords from service expressions matched with the voice to be recognized;
and determining service content according to the target keywords.
A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:
in the home-based care service process, a service phrase set and a voice to be recognized are obtained; the service expression set comprises a plurality of service expressions;
Recognizing the voice to be recognized based on a pre-trained voice recognition model to obtain an index network;
Determining the service term matched with the voice to be recognized by performing character matching on the index network and the service term set;
extracting target keywords from service expressions matched with the voice to be recognized;
and determining the service content of the home-based care service according to the target keywords.
According to the service content monitoring method, the device, the electronic equipment and the storage medium, the voice to be recognized can be recognized based on the pre-trained voice recognition model by acquiring the service expression set and the voice to be recognized, so that an index network containing a plurality of candidate recognition results is obtained; through character matching of the index network and the service term set, service terms which can represent the voice to be recognized most can be screened out from the service term set, so that the target keywords extracted based on the service terms which can represent the voice to be recognized most are more accurate; by extracting the target keywords, the service content can be determined according to the word meanings of the target keywords, so that the service content of service personnel can be effectively monitored. Because the application monitors the service content based on the voice with smaller storage space, compared with the traditional content monitoring based on video, the application can effectively save the storage space consumed in the monitoring of the service content.
Drawings
FIG. 1 is an application environment diagram of a method of service content monitoring in one embodiment;
FIG. 2 is a flow diagram of a method for monitoring service content in one embodiment;
FIG. 3 is a schematic diagram of an indexing network in one embodiment;
FIG. 4 is a flow chart of a method of training a speech recognition model in one embodiment;
FIG. 5 is a block diagram of a service content monitoring device in one embodiment;
Fig. 6 is an internal structural diagram of an electronic device in one embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
The service content monitoring method provided by the application can be applied to an application environment shown in figure 1. Wherein microphone pod 102 communicates with host pod 104 via a network. In the process that the service personnel provide home care service for the home old people, the microphone box 102 is worn on the service personnel and used for capturing the voice of the service personnel and identifying the voice to obtain a target keyword, and when the target keyword is consistent with the service item currently executed by the service personnel, the service personnel is judged to execute the service item. Microphone box 102 sends the target service keywords to host box 104, and the target keywords are stored by host box 104.
In one embodiment, as shown in fig. 2, a service content monitoring method is provided, and the method is applied to the microphone box in fig. 1 for illustration, and includes the following steps:
s202, acquiring a service phrase set and voice to be recognized.
The service expression set refers to a set containing at least one service expression. The service term refers to a normative term with respect and friendly effect used by service personnel and objects to be serviced in language communication during home care service, for example, the service term may be "please ask water temperature proper? "," provide your shampoo now ", etc. The voice to be recognized refers to audio information acquired by the microphone box in real time.
Specifically, when it is determined that service personnel start to provide home-based service for home old people, the microphone box acquires a service phrase set and acquires audio information in real time to obtain voice to be recognized.
In one embodiment, the host box has pre-stored therein correspondence between service items and subsets of service terms, different subsets of service terms corresponding to different service items. The service items refer to service contents which are provided by service personnel in the whole home care service process, for example, the service items can be hair washing, massage and the like. It is readily understood that service personnel can provide a variety of different service items for served objects throughout home care service. The service expression subset refers to a canonical expression associated with a service item that should be used by a service person in executing the service item.
Before providing home care services for the served object, the service personnel can agree with service items which should be provided by the served object and generate order data based on the agreed care service items, so that the microphone box can acquire a corresponding service phrase subset based on the care service items in the order data and generate a service phrase set based on the acquired service phrase subset.
In one embodiment, the order data includes service items and service times for each service item. The host box obtains order data and obtains a corresponding service expression set based on service items in the order data. The host box determines service content which is needed to be carried out by service personnel at the current moment according to the service time in the order data, and correspondingly displays service expression associated with the service content in a local screen, so that the service personnel can carry out language communication with the served object by using the standard expression according to the screen prompt information.
S204, recognizing the voice to be recognized based on the pre-trained voice recognition model to obtain an index network.
Wherein, the voice recognition is to convert the input voice signal into the corresponding text. The speech recognition model refers to a machine learning model with speech feature extraction capabilities. A speech feature is data that reflects an audio feature. The voice features may be one or more of tone color, pronunciation, frequency spectrum, etc.
Specifically, a voice recognition model is preset in the microphone box. The speech recognition model includes an endpoint detection sub-model, an acoustic sub-model, and a language sub-model. Wherein the endpoint detection submodel is used to separate speech signals from non-speech signals. The endpoint detection sub-model carries out framing processing on the voice to be recognized, extracts characteristic parameters of the voice frame, and determines a voice segment and a non-voice segment based on the characteristic parameters of the voice frame. More specifically, the speech segments and the non-speech segments may be determined based on short-time energy and zero-crossing rate, entropy of information, short-time energy frequency values, template matching, and the like.
The acoustic submodel is a model for describing the relation between the speech features and the speech modeling unit, and is an important part of the speech recognition system. Conventional speech recognition models commonly employ a GMM-HMM (Gaussian Mixture Models-Hidden Markov Model) acoustic model, where the GMM models the distribution of speech acoustic features and the HMM models the timing of the speech signal. GMM is essentially a shallow network model with weak ability to describe acoustic feature state control distribution, and speech recognition accuracy is low when training speech data is huge. The application adopts CNN-HMM (Convolutional Neural Networks-Hidden Markov Model) to carry out acoustic modeling. The CNN is a deep model, and the distribution of any data can be self-adaptively fitted through the parameter adjustment of the CNN, so that higher recognition accuracy can be realized.
After the voice fragment is obtained, the acoustic submodel performs feature extraction on the voice fragment, and recognizes the voice based on the extracted feature information to obtain a pinyin sequence corresponding to the voice to be recognized. For example, when the voice to be recognized is "washing hair is good", the pinyin sequence obtained through the acoustic submodel is "xi ge tou hao ma".
The language submodel is used for predicting the occurrence probability of the candidate character sequence corresponding to the pinyin sequence and generating an index network based on the occurrence probability. Due to the existence of homophones, when the Pinyin sequence is obtained, the language submodel determines N-1 characters through the Pinyin sequence and predicts the occurrence probability of the next character based on the N-1 characters, so that one or more candidate character sequences corresponding to the Pinyin sequence are obtained, and an index network is generated based on the obtained candidate character sequences. For example, when the pinyin sequence is "xi ge tou hao ma", the text characters corresponding to "xi" may be "wash" and "xi", the text characters predicted based on "wash" and "ge tou" may be "single" and the text characters predicted based on "west" and "ge tou" may be "follow", and the text characters predicted based on "wash", "single", "hao ma", and "west", "follow", "hao ma" are all "good", the generated index network is as shown in fig. 3. FIG. 3 is a schematic diagram of an index network, in one embodiment. The candidate character sequence takes a starting node as a starting point, takes an ending node as an ending point, and is formed by connecting the nodes and line segments, for example, the candidate character sequence is obtained by washing hair.
S206, determining the service expression matched with the voice to be recognized by performing character matching on the index network and the service expression set.
Specifically, the microphone box determines the service expression matched with each candidate word sequence in the index network by performing character matching on the index network and the service expression set, and calculates the offset distance of each candidate word sequence relative to the matched service expression. Wherein the offset distance refers to the ratio of the number of characters not present in the matched service phrase to the number of characters present in the matched service phrase, wherein the tag symbol is not counted. For example, when the candidate character sequence is "hair washing good", the matched service term is "hair washing good", the number of characters in the "hair washing good" which are not present in the "hair washing good" is 1, and the number of characters in the "hair washing good" which are present in the "hair washing good" is 4, so that the offset distance is 1/4. The microphone box takes one candidate character sequence with the smallest offset distance as a target character sequence, and judges the service expression matched with the target character sequence as the service expression matched with the voice to be recognized.
In one embodiment, the microphone box analyzes the order data to obtain all the service items that the current home care service should perform, and determines the specific service time period of each service item. When an index network corresponding to the voice to be recognized is generated, the microphone box determines the acquisition time for acquiring the voice to be recognized, and determines the service item currently being executed by the service personnel based on the acquisition time and the specific service time period of each service item. The microphone box screens candidate service expressions associated with the currently executing service item from the service expression set, and performs character matching on the candidate character sequences and the candidate service expressions to obtain the service expressions matched with each candidate character sequence. By determining the collection time for collecting the voice to be recognized and the specific service time period of each service item, candidate service words associated with the currently executed service item can be screened out from the service word set, so that the microphone box only needs to perform character matching on the screened candidate service words, and does not need to perform character matching on the whole service set, and the matching efficiency is greatly improved.
S208, extracting target keywords from the service expressions matched with the voice to be recognized.
S210, determining service contents according to the target keywords.
The target keywords refer to keywords capable of representing service items, for example, when the service content is hair washing and massaging, the corresponding target keywords can be hair washing and massaging.
Specifically, the service manager marks the target keywords in each service term in advance, so that the target keywords are extracted from the service terms matched with the voice to be recognized based on the marking result. For example, the target keyword "hair washing" can be marked in advance based on "< s >" and "</s >" to obtain the service term of "start < s > hair washing good", and the microphone box only needs to identify "< s >" and "</s >" to extract the target keyword from the matched service term, and determine the service content of the pension service based on the target keyword.
In one embodiment, when determining a specific service period of each pension service item based on the order data and determining a collection time of the voice to be recognized, the microphone box judges the service item that should be performed by the current service person based on the collection time and the specific service period, compares the service item that should be performed with the target keyword, and when the service item that should be performed is consistent with the target keyword, it may be determined that the service person is providing the corresponding pension service according to the order data.
In one embodiment, when determining a target keyword in the speech to be recognized, the microphone box sends the target keyword to the host box, which stores the target keyword correspondingly. And then, the microphone box correspondingly deletes the voice to be recognized so as to protect the privacy of service personnel and the serviced object.
In the service content monitoring method in the home care service process, the voice to be recognized can be recognized based on the pre-trained voice recognition model by acquiring the service phrase set and the voice to be recognized, so that an index network containing a plurality of candidate recognition results is obtained; through character matching of the index network and the service term set, the service term which can represent the voice to be recognized most can be screened out from the service term set, so that service contents of service personnel can be monitored effectively. Because the application monitors the service content based on the voice with smaller storage space, compared with the traditional content monitoring based on video, the application can effectively save the storage space consumed in the monitoring of the service content.
In one embodiment, the training step of the speech recognition model comprises: acquiring a sample text corresponding to the sample voice and a pronunciation dictionary; the sample text comprises at least one word to be annotated; performing pronunciation marking on the word to be marked according to the pronunciation dictionary to obtain a label sequence; the speech recognition model is trained based on the sample speech and the corresponding tag sequence.
Wherein, the sample speech refers to speech data used for training a speech recognition model. The sample text refers to text data obtained after the sample voice is subjected to voice recognition. The sample text includes a positive sample, which refers to text data containing the target keyword, and a negative sample, which refers to text data not containing the target keyword. The pronunciation dictionary refers to a dictionary for determining a mapping relationship between tone syllables and vowels of a segmented word. The pronunciation dictionary contains the pronunciation of all words and segmentations in the sample data.
Specifically, the model training personnel obtain as many sample voices as possible, manually perform voice recognition on the sample voices to obtain corresponding sample texts, and then input the sample texts corresponding to the sample voices and a pronunciation dictionary into a voice recognition model. The speech recognition model performs word segmentation processing on the sample text to obtain a plurality of to-be-annotated word segments, queries pronunciation labels corresponding to the to-be-annotated word segments in a pronunciation dictionary, and performs pronunciation annotation on the to-be-annotated word segments based on the pronunciation labels. And combining pronunciation labels corresponding to the to-be-labeled segmented words by the microphone box to obtain a label sequence. For example, the labeling format of each word in the pronunciation dictionary is: the initials, finals and tones are respectively corresponding to four tones, 1-4 are respectively corresponding to four tones, and 5 is a light sound, so that the pronunciation label corresponding to the word "Arjiu" to be marked can be "aa a 1j iu3".
Further, the voice recognition model carries out model training on the acoustic submodel and the language submodel based on the sample voice and the corresponding tag sequence until the trained model parameters meet the preset requirements.
In one embodiment, the speech recognition may be performed on the sample speech to obtain a corresponding sample text, and word segmentation may be performed on the sample text. Because the recognition accuracy of the long keywords is lower than that of the short keywords, in order to improve the recognition accuracy of the keywords, the long keywords can be split into the short keywords, for example, "hair washing service" can be split into "hair washing/service", wherein "/" is a word segmentation symbol.
In the embodiment, the pronunciation dictionary is used for automatically carrying out pronunciation marking processing on the segmented words to be marked, so that compared with the traditional manual pronunciation marking, the method and the device can improve marking efficiency and save manpower resources consumed during manual pronunciation marking.
In one embodiment, labeling the word to be labeled according to the pronunciation dictionary includes: performing word segmentation matching on the word segmentation to be marked and the pronunciation dictionary, and judging whether pronunciation word segmentation matched with the word segmentation to be marked exists in the pronunciation dictionary or not based on the word segmentation matching result; when the pronunciation dictionary has pronunciation word segmentation matched with the word segmentation to be marked, marking the word segmentation to be marked according to the pronunciation label corresponding to the matched pronunciation word segmentation; when the pronunciation dictionary does not have pronunciation word segmentation matched with the word segmentation to be marked, segmenting the word segmentation to be marked based on a preset rule to obtain word segmentation fragments; and taking the word segmentation segment as a word to be marked, and returning to the step of matching the word to be marked with the pronunciation dictionary until the pronunciation dictionary has pronunciation word matched with the word to be marked.
The pronunciation dictionary comprises pronunciation word segmentation and corresponding pronunciation labels. The pronunciation word segmentation refers to single word segmentation or characters, and the pronunciation label refers to label information obtained by labeling the pronunciation word segmentation in a labeling format of initials, finals and tones.
Specifically, the microphone box performs word segmentation matching on the word to be marked and each pronunciation word in the pronunciation dictionary, and judges whether pronunciation words matched with the word to be marked exist in the pronunciation dictionary according to a matching result. When the pronunciation dictionary has the same pronunciation word with the word to be marked, the microphone box takes the pronunciation label corresponding to the same pronunciation word as the marking result of the word to be marked. When the pronunciation dictionary does not have the same pronunciation word segmentation as the word to be marked, the microphone box segments the word to be marked based on a preset rule to obtain word segmentation fragments. For example, the preset rule may be to divide the word to be marked by using the intermediate character as a dividing point, so that when the word to be marked is "litchi radix field", the word to be marked may be divided into "litchi radix field" and "field" based on the preset rule.
Further, the microphone box takes each word segment as a word to be marked, and returns to the step of matching the word to be marked with the pronunciation dictionary until the pronunciation dictionary has pronunciation word matched with the word to be marked. For example, when the pronunciation dictionary does not have pronunciation word segmentation matched with the 'litchi aster', the microphone box further divides the 'litchi aster' to obtain the 'litchi' and the 'aster', and marks the 'litchi' and the 'aster' based on the pronunciation dictionary respectively.
In this embodiment, because the word to be marked and segmented can be continuously segmented based on the pronunciation dictionary, when the word to be marked and segmented is a rare word, the label marking can still be performed based on the word segmentation dictionary.
In one embodiment, as shown in FIG. 4, the training step of the speech recognition model includes:
S402, acquiring a first loss function of a voice separation enhancement model and a second loss function of a target recognition model.
S404, back propagation is performed based on the second loss function so as to train an intermediate model bridged between the voice separation enhancement model and the target recognition model, and a robust representation model is obtained.
S406, fusing the first loss function and the second loss function to obtain a target loss function.
S408, performing joint training on the voice separation enhancement model, the robust representation model and the target recognition model based on the target loss function, and ending training when the preset convergence condition is met.
The voice recognition model comprises a voice separation enhancement model and a target recognition model; the object recognition model includes an acoustic submodel and a language submodel. The speech separation enhancement model is a model with speech separation and/or enhancement capability after training, and specifically may be a model obtained by taking sample speech as training data and performing learning training to separate target speech from background interference in the sample speech. It will be appreciated that the speech separation enhancement model may also have the capability to pre-process speech signals for speech activity detection (Voice Activity Detection, VAD), echo cancellation, reverberation cancellation, or sound source localization, without limitation. The target recognition model is an acoustic model with voice recognition capability after training, and specifically can be a model for carrying out phoneme recognition on sample voice obtained by learning and training by taking sample voice and a tag sequence as training data. The speech separation enhancement model and the target recognition model may each be pre-trained. The pre-trained speech separation enhancement model and the speech recognition model each have a fixed model structure and model parameters.
Specifically, in order to further improve the recognition accuracy of the voice model, a voice separation enhancement model may be added to the voice model, and the voice model may be further trained based on the voice separation enhancement model. When the joint model training is needed, the microphone box acquires a pre-trained voice separation enhancement model and a target recognition model, and a first loss function adopted when the voice separation enhancement model is pre-trained and a second loss function adopted when the target recognition model is pre-trained. The loss function (loss function) is typically associated with an optimization problem as a learning criterion, i.e., solving and evaluating the model by minimizing the loss function. The first loss function adopted by the pre-training voice separation enhancement model and the second loss function adopted by the pre-training voice recognition model can be mean square error, average absolute value error, log-Cosh loss, quantile loss, ideal quantile loss and the like.
The traditional mode mainly divides the voice processing task into two completely independent subtasks: a speech separation task and a target recognition task. In this way, in the training stage, the modularization is allowed to train the voice separation enhancement model and the target recognition model respectively, and in the production test stage, the enhanced to-be-recognized output by the voice separation enhancement model is input into the target recognition model for recognition. It is easy to find that this approach does not solve the problem of differentiation between the two characterization categories well. In practical application scenes such as home care service, the phenomenon that the voice to be recognized is influenced by background music or multi-speaker interference exists commonly. Thus, the speech separation enhancement model introduces relatively serious distortion during front-end speech processing, which is not considered during the target recognition model training stage, so that the independent front-end speech separation enhancement model and the back-end target recognition model are directly cascaded, and the final speech recognition performance is seriously reduced.
To overcome the difference between the two characterization categories, embodiments of the present application bridge the intermediate model to be trained between the speech separation enhancement model and the target recognition model. The trained intermediate model may be referred to as a robust characterization model. More specifically, the microphone box determines the local gradient of descent of the second loss function generated during each iteration according to a preset deep learning optimization algorithm. And the microphone box reversely propagates the local descending gradient to the middle model so as to update model parameters corresponding to the middle model, and the training is ended when the preset training stopping condition is met.
The microphone box obtains a target loss function by carrying out preset logic operation on the first loss function and the second loss function. Taking the weighted summation as an example, assuming that the weighting factor is λ SS, the objective loss function l=l 2SSL1. The weighting factor may be a numerical value set empirically or experimentally, such as 0.1. It is readily found that the importance of the speech separation enhancement model in multimodal joint training can be adjusted by adjusting the weighting factors. The microphone box determines the global descent gradient generated by the target loss function according to a preset deep learning optimization algorithm. The deep learning optimization algorithm for determining the local gradient of descent may be the same as or different from the deep learning optimization algorithm for determining the global gradient of descent. And sequentially and reversely propagating the global descending gradient generated by the target loss function from the target recognition model to each layer of the network of the robust characterization model and the voice separation enhancement model, and respectively carrying out iterative updating on model parameters corresponding to the voice separation enhancement model, the robust characterization model and the target recognition model in the process until the training is finished when the preset training stop condition is met.
In this embodiment, the intermediate model performs training by means of the second loss function back-propagation of the rear-end target recognition model, and the speech separation enhancement model and the target recognition model may be pre-trained, so that convergence can be achieved after a small number of iterative training times. In addition, the combination of the front end model and the rear end model which correspond to the loss function is used for carrying out joint training on the end-to-end network model, so that each independent model in the network architecture can comprehensively learn the interference characteristics from the voice signals in the complex acoustic environment, the performance of the overall voice processing task can be ensured, and the voice recognition accuracy is improved.
In one embodiment, identifying speech to be identified based on a pre-trained speech recognition model, obtaining an index network includes: extracting voice characteristics of voice to be recognized, and determining pinyin of each word in the voice to be recognized based on the voice characteristics; pinyin consists of one to a plurality of sound units; determining a mapping relation between a sound unit in pinyin and a corresponding fuzzy sound; determining candidate character sequences according to the pronunciation dictionary and the fuzzy sounds; an indexing network is generated based on the candidate word sequences.
Wherein, the fuzzy sound in the pinyin can be a sound unit close to the pinyin pronunciation. The ambiguous sounds may be generated due to the same semantic meaning pronouncing differently in different dialects. The sound unit refers to the initial consonant or final sound composing the pinyin.
Specifically, when the acoustic submodel obtains the voice to be recognized output by the voice separation enhancement model, voice features in the voice to be recognized can be extracted based on a preset convolution kernel. For example, pronunciation characteristics in the speech to be recognized are extracted. Meanwhile, the acoustic submodel inputs the voice characteristics into the language submodel, and the language submodel determines the pinyin corresponding to each word in the voice to be recognized according to the voice characteristics. The language sub-model obtains a fuzzy sound table, and queries all sound units in the pinyin of each word by using the fuzzy sound table to obtain the sound units with fuzzy sound, so that the mapping relation between the sound units with fuzzy sound and the fuzzy sound is established. For example, when the sound unit is "g", the ambiguous sound determined based on the ambiguous sound table is "j".
Further, the language sub-model combines the sound unit and the fuzzy sound based on the mapping relation to obtain one or more candidate pinyin corresponding to each word. For example, when the sound units are "g" and "ai", and the ambiguous sounds determined based on the ambiguous sound table are "j" and "ei", the candidate pinyin obtained by combining are "gai", "jei", "gei" and "jai". The language sub-model queries pronunciation word segments corresponding to the candidate pinyin in a pronunciation dictionary, generates candidate character sequences based on the pronunciation word segments corresponding to each word segment in the speech to be recognized, and generates an index network according to the candidate character sequences.
In this embodiment, the recognition results of the same word by the speech recognition model may be different due to the influence of the dialect, and multiple candidate recognition results can be obtained by the method provided by this embodiment, so that the target recognition result can be determined from the multiple candidate recognition results, and thus, the influence of the dialect on the recognition result can be effectively overcome.
In one embodiment, determining the service phrase that matches the speech to be recognized by character matching the indexing network with the set of service phrases includes: through carrying out character matching on the index network and the service term set, determining the service term matched with each candidate character sequence in the index network; calculating the offset distance of each candidate character sequence relative to the matched service term; screening a target text sequence from the candidate text sequences based on the offset distance; and judging the service expression matched with the target text sequence as the service expression matched with the voice to be recognized.
Specifically, the microphone box traverses each candidate word sequence in the index network, and character matches each candidate word sequence with each service term in the service term set until each candidate word sequence in the index network is determined to be the service term matched with each candidate word sequence. More specifically, the microphone box determines a candidate text sequence of the current traversal order, and uses the service phrase having the largest number of repeated characters with the candidate text sequence of the current traversal order as the service phrase matching the candidate text sequence of the current traversal order. For example, when the candidate character of the current traversal order is "hair washing good", and the service term in the service term set is "hair washing good" or "hair washing service is started", the service having the largest number of repeated characters with "hair washing good" is "hair washing good".
Further, the microphone box calculates the offset distance of each candidate word sequence relative to the matched service words, takes one candidate word sequence with the smallest offset distance as a target word sequence, and judges the service words matched with the target word sequence as the service words matched with the voice to be recognized.
In this embodiment, since the service term with the smallest offset distance is determined as the service term corresponding to the voice to be recognized, the service term screened based on the offset distance is the language text that can represent the voice to be recognized most, so that the target keyword determined based on the language text that can represent the voice to be recognized most accurately.
In one embodiment, the service content monitoring method further includes: determining all extracted target keywords; determining the generation time of each target keyword; a care report is generated based on the generation time and the target keyword.
Specifically, when it is determined that the home care service is completed, the microphone box acquires all target keywords extracted in the home care process, determines generation time of each target keyword, generates a care report according to each target keyword and the generation time of each target keyword, and then sends the generated care report to the served object.
In this embodiment, by generating the nursing report, the family of the served object can know the specific service items provided by the service personnel in the home care service process according to the nursing report.
It should be understood that, although the steps in the flowcharts of fig. 2 and 4 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 2,4 may include steps or stages that are not necessarily performed at the same time, but may be performed at different times, or the order in which the steps or stages are performed is not necessarily sequential, but may be performed in rotation or alternatively with at least some of the other steps or stages.
In one embodiment, as shown in fig. 5, there is provided a service content monitoring apparatus 500, comprising: an index network generation module 502, a target keyword extraction module 504, and a service content determination module 506, wherein:
the index network generation module 502 is configured to obtain a service phrase set and a voice to be recognized; the service expression set comprises a plurality of service expressions; and recognizing the voice to be recognized based on the pre-trained voice recognition model to obtain an index network.
The target keyword extraction module 504 is configured to determine a service phrase matched with the voice to be recognized by performing character matching on the index network and the service phrase set; and extracting the target keywords from the service expressions matched with the voice to be recognized.
The service content determining module 506 is configured to determine service content of the service according to the target keyword.
In one embodiment, the index network generation module 502 further includes a model training module 5021 for obtaining a sample text corresponding to the sample speech and a pronunciation dictionary; the sample text comprises at least one word to be annotated; performing pronunciation marking on the word to be marked according to the pronunciation dictionary to obtain a label sequence; the speech recognition model is trained based on the sample speech and the corresponding tag sequence.
In one embodiment, the model training module 5021 is further configured to match the word to be annotated with the pronunciation dictionary, and determine whether a pronunciation word matched with the word to be annotated exists in the pronunciation dictionary based on the word matching result; when the pronunciation dictionary has pronunciation word segmentation matched with the word segmentation to be marked, marking the word segmentation to be marked according to the pronunciation label corresponding to the matched pronunciation word segmentation; when the pronunciation dictionary does not have pronunciation word segmentation matched with the word segmentation to be marked, segmenting the word segmentation to be marked based on a preset rule to obtain word segmentation fragments; and taking the word segmentation segment as a word to be marked, and returning to the step of matching the word to be marked with the pronunciation dictionary until the pronunciation dictionary has pronunciation word matched with the word to be marked.
In one embodiment, the model training module 5021 is further configured to obtain a first loss function of the speech separation enhancement model and a second loss function of the target recognition model; performing back propagation based on the second loss function to train an intermediate model bridged between the speech separation enhancement model and the target recognition model to obtain a robust representation model; fusing the first loss function and the second loss function to obtain a target loss function; and carrying out joint training on the voice separation enhancement model, the robust representation model and the target recognition model based on the target loss function, and ending training when the preset convergence condition is met.
In one embodiment, the index network generation module 502 further includes a candidate text sequence determination module 5022, configured to extract a voice feature of the voice to be recognized, and determine pinyin of each word in the voice to be recognized based on the voice feature; pinyin consists of one to a plurality of sound units; determining a mapping relation between a sound unit in pinyin and a corresponding fuzzy sound; determining candidate character sequences according to the pronunciation dictionary and the mapping relation; an indexing network is generated based on the candidate word sequences.
In one embodiment, the target keyword extraction module 504 further includes an offset distance determination module 5041 for determining a service phrase to which each candidate word sequence in the index network matches by character matching the index network with the service phrase set; calculating the offset distance of each candidate character sequence relative to the matched service term; screening a target text sequence from the candidate text sequences based on the offset distance; and judging the service expression matched with the target text sequence as the service expression matched with the voice to be recognized.
In one embodiment, the service content monitoring apparatus 500 is further configured to determine all the extracted target keywords; determining the generation time of each target keyword; a care report is generated based on the generation time and the target keyword.
The specific limitation of the service content monitoring device may be referred to the limitation of the service content monitoring method hereinabove, and will not be described herein. The respective modules in the above-described service content monitoring apparatus may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, an electronic device is provided, which may be a terminal, and an internal structure diagram thereof may be as shown in fig. 6. The electronic device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic device includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the electronic device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a service content monitoring method. The display screen of the electronic equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the electronic equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the electronic equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
It will be appreciated by those skilled in the art that the structure shown in FIG. 6 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In one embodiment, an electronic device is provided that includes a memory having a computer program stored therein and a processor that when executing the computer program performs the steps of:
acquiring a service phrase set and a voice to be recognized; the service expression set comprises a plurality of service expressions;
recognizing the voice to be recognized based on the pre-trained voice recognition model to obtain an index network;
through carrying out character matching on the index network and the service term set, determining the service term matched with the voice to be recognized;
extracting target keywords from service expressions matched with the voice to be recognized;
And determining the service content according to the target keywords.
In one embodiment, the processor when executing the computer program further performs the steps of:
Acquiring a sample text corresponding to the sample voice and a pronunciation dictionary; the sample text comprises at least one word to be annotated;
performing pronunciation marking on the word to be marked according to the pronunciation dictionary to obtain a label sequence;
The speech recognition model is trained based on the sample speech and the corresponding tag sequence.
In one embodiment, the pronunciation dictionary includes pronunciation segmentations and corresponding pronunciation tags; the processor when executing the computer program also implements the steps of:
Performing word segmentation matching on the word segmentation to be marked and the pronunciation dictionary, and judging whether pronunciation word segmentation matched with the word segmentation to be marked exists in the pronunciation dictionary or not based on the word segmentation matching result;
when the pronunciation dictionary has pronunciation word segmentation matched with the word segmentation to be marked, marking the word segmentation to be marked according to the pronunciation label corresponding to the matched pronunciation word segmentation;
When the pronunciation dictionary does not have pronunciation word segmentation matched with the word segmentation to be marked, segmenting the word segmentation to be marked based on a preset rule to obtain word segmentation fragments;
and taking the word segmentation segment as a word to be marked, and returning to the step of matching the word to be marked with the pronunciation dictionary until the pronunciation dictionary has pronunciation word matched with the word to be marked.
In one embodiment, the processor when executing the computer program further performs the steps of:
Acquiring a first loss function of a voice separation enhancement model and a second loss function of a target recognition model;
Performing back propagation based on the second loss function to train an intermediate model bridged between the speech separation enhancement model and the target recognition model to obtain a robust representation model;
fusing the first loss function and the second loss function to obtain a target loss function;
and carrying out joint training on the voice separation enhancement model, the robust representation model and the target recognition model based on the target loss function, and ending training when the preset convergence condition is met.
In one embodiment, the processor when executing the computer program further performs the steps of:
extracting voice characteristics of voice to be recognized, and determining pinyin of each word in the voice to be recognized based on the voice characteristics; pinyin consists of one to a plurality of sound units;
determining a mapping relation between a sound unit in pinyin and a corresponding fuzzy sound;
determining candidate character sequences according to the pronunciation dictionary and the mapping relation;
an indexing network is generated based on the candidate word sequences.
In one embodiment, the processor when executing the computer program further performs the steps of:
Through carrying out character matching on the index network and the service term set, determining the service term matched with each candidate character sequence in the index network;
calculating the offset distance of each candidate character sequence relative to the matched service term;
screening a target text sequence from the candidate text sequences based on the offset distance;
And judging the service expression matched with the target text sequence as the service expression matched with the voice to be recognized.
In one embodiment, the processor when executing the computer program further performs the steps of:
determining all extracted target keywords;
determining the generation time of each target keyword;
a care report is generated based on the generation time and the target keyword.
In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:
acquiring a service phrase set and a voice to be recognized; the service expression set comprises a plurality of service expressions;
recognizing the voice to be recognized based on the pre-trained voice recognition model to obtain an index network;
through carrying out character matching on the index network and the service term set, determining the service term matched with the voice to be recognized;
extracting target keywords from service expressions matched with the voice to be recognized;
And determining the service content according to the target keywords.
In one embodiment, the computer program when executed by the processor further performs the steps of:
Acquiring a sample text corresponding to the sample voice and a pronunciation dictionary; the sample text comprises at least one word to be annotated;
performing pronunciation marking on the word to be marked according to the pronunciation dictionary to obtain a label sequence;
The speech recognition model is trained based on the sample speech and the corresponding tag sequence.
In one embodiment, the pronunciation dictionary includes pronunciation segmentations and corresponding pronunciation tags; the computer program when executed by the processor also performs the steps of:
Performing word segmentation matching on the word segmentation to be marked and the pronunciation dictionary, and judging whether pronunciation word segmentation matched with the word segmentation to be marked exists in the pronunciation dictionary or not based on the word segmentation matching result;
when the pronunciation dictionary has pronunciation word segmentation matched with the word segmentation to be marked, marking the word segmentation to be marked according to the pronunciation label corresponding to the matched pronunciation word segmentation;
When the pronunciation dictionary does not have pronunciation word segmentation matched with the word segmentation to be marked, segmenting the word segmentation to be marked based on a preset rule to obtain word segmentation fragments;
and taking the word segmentation segment as a word to be marked, and returning to the step of matching the word to be marked with the pronunciation dictionary until the pronunciation dictionary has pronunciation word matched with the word to be marked.
In one embodiment, the computer program when executed by the processor further performs the steps of:
Acquiring a first loss function of a voice separation enhancement model and a second loss function of a target recognition model;
Performing back propagation based on the second loss function to train an intermediate model bridged between the speech separation enhancement model and the target recognition model to obtain a robust representation model;
fusing the first loss function and the second loss function to obtain a target loss function;
and carrying out joint training on the voice separation enhancement model, the robust representation model and the target recognition model based on the target loss function, and ending training when the preset convergence condition is met.
In one embodiment, the computer program when executed by the processor further performs the steps of:
extracting voice characteristics of voice to be recognized, and determining pinyin of each word in the voice to be recognized based on the voice characteristics; pinyin consists of one to a plurality of sound units;
determining a mapping relation between a sound unit in pinyin and a corresponding fuzzy sound;
determining candidate character sequences according to the pronunciation dictionary and the mapping relation;
an indexing network is generated based on the candidate word sequences.
In one embodiment, the computer program when executed by the processor further performs the steps of:
Through carrying out character matching on the index network and the service term set, determining the service term matched with each candidate character sequence in the index network;
calculating the offset distance of each candidate character sequence relative to the matched service term;
screening a target text sequence from the candidate text sequences based on the offset distance;
And judging the service expression matched with the target text sequence as the service expression matched with the voice to be recognized.
In one embodiment, the computer program when executed by the processor further performs the steps of:
determining all extracted target keywords;
determining the generation time of each target keyword;
a care report is generated based on the generation time and the target keyword.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims (10)

1. A method of service content monitoring, the method comprising:
acquiring a service phrase set and a voice to be recognized; the service expression set comprises a plurality of service expressions;
Recognizing the voice to be recognized based on a pre-trained voice recognition model to obtain an index network;
Determining the service term matched with the voice to be recognized by performing character matching on the index network and the service term set; the service expression is obtained by character matching between the index network and a candidate service expression, the candidate service expression is associated with a currently executing service item, the candidate service expression is obtained by screening from the service expression set, the currently executing service item is determined based on the collection time of the voice to be recognized and the service time period of each service item, and the service time period of each service item is obtained by analyzing order data;
extracting target keywords from service expressions matched with the voice to be recognized;
determining service content according to the target keywords;
The speech recognition model based on pre-training recognizes the speech to be recognized, and the obtaining the index network includes:
Extracting voice characteristics of the voice to be recognized, and determining pinyin of each word in the voice to be recognized based on the voice characteristics;
acquiring a fuzzy sound table, and inquiring all sound units in the pinyin of each word by utilizing the fuzzy sound table to obtain a mapping relation between the sound units with fuzzy sound and the fuzzy sound;
combining the sound unit and the fuzzy sound based on the mapping relation to obtain one or more candidate pinyin corresponding to each word;
Inquiring pronunciation word segmentation corresponding to the candidate pinyin in a pronunciation dictionary, generating a candidate character sequence based on the pronunciation word segmentation corresponding to each word segmentation in the speech to be recognized, and generating an index network according to the candidate character sequence; wherein the candidate character sequence comprises N-1 characters and the next character of the N-1 characters; the N-1 characters are determined through a pinyin sequence, and the next character is predicted based on the N-1 characters.
2. The method of claim 1, wherein the training step of the speech recognition model comprises:
Acquiring a sample text corresponding to the sample voice and a pronunciation dictionary; the sample text comprises at least one word to be annotated;
performing pronunciation marking on the word segmentation to be marked according to the pronunciation dictionary to obtain a tag sequence;
And training a voice recognition model based on the sample voice and the corresponding label sequence.
3. The method of claim 2, wherein the pronunciation dictionary includes pronunciation segmentations and corresponding pronunciation tags; the step of performing pronunciation annotation on the word to be annotated according to the pronunciation dictionary comprises the following steps:
performing word segmentation matching on the word to be marked and the pronunciation dictionary, and judging whether pronunciation word segmentation matched with the word to be marked exists in the pronunciation dictionary or not based on a word segmentation matching result;
when the pronunciation dictionary has pronunciation word segments matched with the to-be-annotated word segments, annotating the to-be-annotated word segments according to pronunciation labels corresponding to the matched pronunciation word segments;
When the pronunciation dictionary does not have pronunciation word segmentation matched with the word segmentation to be marked, segmenting the word segmentation to be marked based on a preset rule to obtain word segmentation fragments;
And taking the word segmentation segment as a word to be marked, and returning to the step of word segmentation matching of the word to be marked and the pronunciation dictionary until the pronunciation dictionary has pronunciation word segmentation matched with the word to be marked.
4. The method of claim 1, wherein the speech recognition model comprises a speech separation enhancement model and a target recognition model; the training step of the voice recognition model comprises the following steps:
Acquiring a first loss function of a voice separation enhancement model and a second loss function of a target recognition model;
back propagation is carried out based on the second loss function so as to train an intermediate model bridged between the voice separation enhancement model and the target recognition model, and a robust representation model is obtained;
fusing the first loss function and the second loss function to obtain a target loss function;
and carrying out joint training on the voice separation enhancement model, the robust representation model and the target recognition model based on the target loss function, and ending training when a preset convergence condition is met.
5. The method of claim 1, wherein the speech recognition is converting an input speech signal into text corresponding thereto.
6. The method of claim 1, wherein the determining the service phrase that matches the speech to be recognized by character matching the index network and the set of service phrases comprises:
determining service expression matched with each candidate character sequence in the index network by carrying out character matching on the index network and the service expression set;
calculating the offset distance of each candidate character sequence relative to the matched service term;
Screening a target text sequence from the candidate text sequences based on the offset distance;
and judging the service expression matched with the target text sequence as the service expression matched with the voice to be recognized.
7. The method according to claim 1, wherein the method further comprises:
determining all extracted target keywords;
determining the generation time of each target keyword;
and generating a nursing report based on the generation time and the target keyword.
8. A service content monitoring apparatus, the apparatus comprising:
The index network generation module is used for acquiring a service phrase set and voice to be recognized; the service expression set comprises a plurality of service expressions; recognizing the voice to be recognized based on a pre-trained voice recognition model to obtain an index network;
The target keyword extraction module is used for determining the service term matched with the voice to be recognized by performing character matching on the index network and the service term set; extracting target keywords from service expressions matched with the voice to be recognized; the service expression is obtained by character matching between the index network and a candidate service expression, the candidate service expression is associated with a currently executing service item, the candidate service expression is obtained by screening from the service expression set, the currently executing service item is determined based on the collection time of the voice to be recognized and the service time period of each service item, and the service time period of each service item is obtained by analyzing order data;
the service content determining module is used for determining service content according to the target keywords;
The index network generation module is also used for extracting the voice characteristics of the voice to be recognized and determining the pinyin of each word in the voice to be recognized based on the voice characteristics; acquiring a fuzzy sound table, and inquiring all sound units in the pinyin of each word by utilizing the fuzzy sound table to obtain a mapping relation between the sound units with fuzzy sound and the fuzzy sound; combining the sound unit and the fuzzy sound based on the mapping relation to obtain one or more candidate pinyin corresponding to each word; inquiring pronunciation word segmentation corresponding to the candidate pinyin in a pronunciation dictionary, generating a candidate character sequence based on the pronunciation word segmentation corresponding to each word segmentation in the speech to be recognized, and generating an index network according to the candidate character sequence; wherein the candidate character sequence comprises N-1 characters and the next character of the N-1 characters; the N-1 characters are determined through a pinyin sequence, and the next character is predicted based on the N-1 characters.
9. An electronic device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 7 when the computer program is executed.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.
CN202011060127.4A 2020-09-30 2020-09-30 Service content monitoring method, device, electronic equipment and storage medium Active CN112331207B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011060127.4A CN112331207B (en) 2020-09-30 2020-09-30 Service content monitoring method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011060127.4A CN112331207B (en) 2020-09-30 2020-09-30 Service content monitoring method, device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112331207A CN112331207A (en) 2021-02-05
CN112331207B true CN112331207B (en) 2024-08-30

Family

ID=74313342

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011060127.4A Active CN112331207B (en) 2020-09-30 2020-09-30 Service content monitoring method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112331207B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112530414B (en) * 2021-02-08 2021-05-25 数据堂(北京)科技股份有限公司 Iterative large-scale pronunciation dictionary construction method and device
CN113380231B (en) * 2021-06-15 2023-01-24 北京一起教育科技有限责任公司 Voice conversion method and device and electronic equipment
CN113593577A (en) * 2021-09-06 2021-11-02 四川易海天科技有限公司 Vehicle-mounted artificial intelligence voice interaction system based on big data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103794211A (en) * 2012-11-02 2014-05-14 北京百度网讯科技有限公司 Voice recognition method and system
CN111261146A (en) * 2020-01-16 2020-06-09 腾讯科技(深圳)有限公司 Speech recognition and model training method, device and computer readable storage medium
CN111652775A (en) * 2020-05-07 2020-09-11 上海奥珩企业管理有限公司 Method for constructing household service process management system model

Family Cites Families (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5875426A (en) * 1996-06-12 1999-02-23 International Business Machines Corporation Recognizing speech having word liaisons by adding a phoneme to reference word models
CN1787070B (en) * 2005-12-09 2011-03-16 北京凌声芯语音科技有限公司 On-chip system for language learner
CN102867512A (en) * 2011-07-04 2013-01-09 余喆 Method and device for recognizing natural speech
US9536049B2 (en) * 2012-09-07 2017-01-03 Next It Corporation Conversational virtual healthcare assistant
CN103458056B (en) * 2013-09-24 2017-04-26 世纪恒通科技股份有限公司 Speech intention judging system based on automatic classification technology for automatic outbound system
CN103578464B (en) * 2013-10-18 2017-01-11 威盛电子股份有限公司 Language model establishing method, speech recognition method and electronic device
CN103700369B (en) * 2013-11-26 2016-08-31 科大讯飞股份有限公司 Phonetic navigation method and system
CN105869640B (en) * 2015-01-21 2019-12-31 上海墨百意信息科技有限公司 Method and device for recognizing voice control instruction aiming at entity in current page
US9583097B2 (en) * 2015-01-30 2017-02-28 Google Inc. Dynamic inference of voice command for software operation from help information
KR20180115976A (en) * 2017-04-14 2018-10-24 아주대학교산학협력단 Method of operating server in nursing home system to share recipient’s information among chief of nursing home, care provider and guardian
CN107170444A (en) * 2017-06-15 2017-09-15 上海航空电器有限公司 Aviation cockpit environment self-adaption phonetic feature model training method
CN108288468B (en) * 2017-06-29 2019-07-19 腾讯科技(深圳)有限公司 Audio recognition method and device
US10552546B2 (en) * 2017-10-09 2020-02-04 Ricoh Company, Ltd. Speech-to-text conversion for interactive whiteboard appliances in multi-language electronic meetings
KR102097118B1 (en) * 2018-08-28 2020-04-10 충남대학교산학협력단 METHOD AND APPARATUS FOR TOPIC DETECTION IN DATA STREAM OF Social Network Service
CN108986790A (en) * 2018-09-29 2018-12-11 百度在线网络技术(北京)有限公司 The method and apparatus of voice recognition of contact
CN109961792B (en) * 2019-03-04 2022-01-11 阿波罗智联(北京)科技有限公司 Method and apparatus for recognizing speech
CN110310631A (en) * 2019-06-28 2019-10-08 北京百度网讯科技有限公司 Audio recognition method, device, server and storage medium
CN111341305B (en) * 2020-03-05 2023-09-26 苏宁云计算有限公司 Audio data labeling method, device and system
CN111429912B (en) * 2020-03-17 2023-02-10 厦门快商通科技股份有限公司 Keyword detection method, system, mobile terminal and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103794211A (en) * 2012-11-02 2014-05-14 北京百度网讯科技有限公司 Voice recognition method and system
CN111261146A (en) * 2020-01-16 2020-06-09 腾讯科技(深圳)有限公司 Speech recognition and model training method, device and computer readable storage medium
CN111652775A (en) * 2020-05-07 2020-09-11 上海奥珩企业管理有限公司 Method for constructing household service process management system model

Also Published As

Publication number Publication date
CN112331207A (en) 2021-02-05

Similar Documents

Publication Publication Date Title
CN111883110B (en) Acoustic model training method, system, equipment and medium for speech recognition
CN111739508B (en) End-to-end speech synthesis method and system based on DNN-HMM bimodal alignment network
US11062699B2 (en) Speech recognition with trained GMM-HMM and LSTM models
CN109410914B (en) Method for identifying Jiangxi dialect speech and dialect point
WO2017076211A1 (en) Voice-based role separation method and device
CN113707125B (en) Training method and device for multi-language speech synthesis model
CN112331207B (en) Service content monitoring method, device, electronic equipment and storage medium
US20230089308A1 (en) Speaker-Turn-Based Online Speaker Diarization with Constrained Spectral Clustering
CN113327574B (en) Speech synthesis method, device, computer equipment and storage medium
CN110992959A (en) Voice recognition method and system
CN113327575A (en) Speech synthesis method, device, computer equipment and storage medium
CN112151015A (en) Keyword detection method and device, electronic equipment and storage medium
CN114550703A (en) Training method and device of voice recognition system, and voice recognition method and device
CN111599339B (en) Speech splicing synthesis method, system, equipment and medium with high naturalness
CN112309398B (en) Method and device for monitoring working time, electronic equipment and storage medium
CN113611286B (en) Cross-language speech emotion recognition method and system based on common feature extraction
CN113823265A (en) Voice recognition method and device and computer equipment
Wazir et al. Deep learning-based detection of inappropriate speech content for film censorship
CN117711376A (en) Language identification method, system, equipment and storage medium
CN117542358A (en) End-to-end-based human-robot voice interaction system
Tailor et al. Deep learning approach for spoken digit recognition in Gujarati language
CN112309374B (en) Service report generation method, device and computer equipment
CN116072146A (en) Pumped storage station detection method and system based on voiceprint recognition
CN117223052A (en) Keyword detection method based on neural network
CN113920987A (en) Voice recognition method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant