CN110085221A - Speech emotional exchange method, computer equipment and computer readable storage medium - Google Patents
Speech emotional exchange method, computer equipment and computer readable storage medium Download PDFInfo
- Publication number
- CN110085221A CN110085221A CN201810079429.2A CN201810079429A CN110085221A CN 110085221 A CN110085221 A CN 110085221A CN 201810079429 A CN201810079429 A CN 201810079429A CN 110085221 A CN110085221 A CN 110085221A
- Authority
- CN
- China
- Prior art keywords
- emotion identification
- identification result
- audio
- mood
- emotional
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000002996 emotional effect Effects 0.000 title claims abstract description 119
- 238000000034 method Methods 0.000 title claims abstract description 97
- 238000003860 storage Methods 0.000 title claims abstract description 10
- 230000008451 emotion Effects 0.000 claims abstract description 256
- 230000036651 mood Effects 0.000 claims abstract description 184
- 230000002452 interceptive effect Effects 0.000 claims abstract description 46
- 238000004458 analytical method Methods 0.000 claims abstract description 8
- 238000004590 computer program Methods 0.000 claims description 13
- 230000001755 vocal effect Effects 0.000 claims description 8
- 230000009471 action Effects 0.000 claims description 2
- 230000003993 interaction Effects 0.000 abstract description 26
- 230000008569 process Effects 0.000 description 33
- 238000010586 diagram Methods 0.000 description 15
- 238000004364 calculation method Methods 0.000 description 14
- 208000019901 Anxiety disease Diseases 0.000 description 11
- 230000036506 anxiety Effects 0.000 description 11
- 239000002609 medium Substances 0.000 description 11
- 238000001228 spectrum Methods 0.000 description 9
- 238000004422 calculation algorithm Methods 0.000 description 8
- 230000006870 function Effects 0.000 description 8
- 230000008859 change Effects 0.000 description 5
- 239000000284 extract Substances 0.000 description 5
- 238000013461 design Methods 0.000 description 4
- 235000013399 edible fruits Nutrition 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 238000013019 agitation Methods 0.000 description 3
- 238000005311 autocorrelation function Methods 0.000 description 3
- 238000001514 detection method Methods 0.000 description 2
- 230000002708 enhancing effect Effects 0.000 description 2
- 230000036541 health Effects 0.000 description 2
- 238000012880 independent component analysis Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 230000005236 sound signal Effects 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 208000024891 symptom Diseases 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 206010027940 Mood altered Diseases 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- HUTDUHSNJYTCAR-UHFFFAOYSA-N ancymidol Chemical compound C1=CC(OC)=CC=C1C(O)(C=1C=NC=NC=1)C1CC1 HUTDUHSNJYTCAR-UHFFFAOYSA-N 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000037007 arousal Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000002612 cardiopulmonary effect Effects 0.000 description 1
- 238000005266 casting Methods 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000008921 facial expression Effects 0.000 description 1
- 238000005242 forging Methods 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 206010025482 malaise Diseases 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000011430 maximum method Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 239000012120 mounting media Substances 0.000 description 1
- 210000005036 nerve Anatomy 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000005086 pumping Methods 0.000 description 1
- 238000007670 refining Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000029058 respiratory gaseous exchange Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 229910052709 silver Inorganic materials 0.000 description 1
- 239000004332 silver Substances 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 230000002459 sustained effect Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1822—Parsing for meaning understanding
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/19—Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
- G10L15/197—Probabilistic grammars, e.g. word n-grams
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/226—Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
- G10L2015/227—Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics of the speaker; Human-factor methodology
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Acoustics & Sound (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Child & Adolescent Psychology (AREA)
- General Health & Medical Sciences (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Signal Processing (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The embodiment of the invention provides a kind of speech emotional exchange method, computer equipment and computer readable storage medium, solve the problems, such as that intelligent interaction mode in the prior art can not analyze the profound of user message and be intended to and can not provide more humane interactive experience.The speech emotional exchange method includes: to obtain Emotion identification result according to user speech message, wherein includes at least audio Emotion identification in Emotion identification result as a result, or including at least audio Emotion identification result and text Emotion identification result in Emotion identification result;Intention analysis is carried out according to the content of text of user speech message, obtains corresponding basic intent information;Corresponding mood intent information is determined according to Emotion identification result and basic intent information;And corresponding interactive instruction is determined according to mood intent information, or corresponding interactive instruction is determined according to mood intent information and basic intent information.
Description
Technical field
The present invention relates to technical field of intelligent interaction, and in particular to a kind of speech emotional exchange method, computer equipment and
Computer readable storage medium.
Background technique
With the continuous improvement that the continuous development and people of artificial intelligence technology require interactive experience, intelligent interaction
Mode gradually starts to substitute some traditional man-machine interaction modes, and has become a research hotspot.However, existing intelligence
Interactive mode is only capable of probably analyzing the semantic content of user message, and can not identify the current emotional state of user, thus nothing
Method analyzes the profound emotional need that user message actually wants to expression according to the emotional state of user, also can not basis
User message provides more humane interactive experience.For example, the emotional state made up for lost time for one is anxious user
Just to start to do the emotional state of stroke planning with one be gentle user, desired when inquiring air flight times information to obtain
Reply mode is different certainly, and according to existing semantic-based intelligent interaction mode, obtained by different users
Reply mode be identical, such as only corresponding air flight times information programme to user.
Summary of the invention
It can in view of this, the embodiment of the invention provides a kind of speech emotional exchange method, computer equipment and computers
Storage medium is read, the profound intention and nothing of user message can not be analyzed by solving intelligent interaction mode in the prior art
Method provides the problem of more humane interactive experience.
One embodiment of the invention provide a kind of speech emotional exchange method include:
Emotion identification is obtained according to user message as a result, including at least user speech message in the user message;
Intention analysis is carried out according to the content of text of the user speech message, obtains corresponding basic intent information;With
And
Corresponding interactive instruction is determined according to the Emotion identification result and the basic intent information.
One embodiment of the invention provide a kind of intelligent interaction device include:
Emotion identification module is configured to obtain Emotion identification as a result, at least wrapping in the user message according to user message
Include user speech message;
Basic intention assessment module, is configured to carry out intention analysis according to the content of text of the user speech message, obtain
To corresponding basic intent information;And
Interactive instruction determining module is configured to determine and correspond to according to the Emotion identification result and the basic intent information
Interactive instruction.
A kind of computer equipment that one embodiment of the invention provides includes: memory, processor and is stored in described deposit
The computer program executed on reservoir by the processor, the processor are realized as previously described when executing the computer program
The step of method.
A kind of computer readable storage medium that one embodiment of the invention provides, is stored thereon with computer program, described
The step of method as previously described is realized when computer program is executed by processor.
A kind of speech emotional exchange method, computer equipment and computer-readable storage medium provided in an embodiment of the present invention
Matter, on the basis of understanding the basic intent information of user, combine based on user speech message obtain Emotion identification as a result,
And corresponding mood intent information is further determined according to basic intent information and Emotion identification result, and determine corresponding have
The interactive instruction of mood, so that the profound meaning of user message can not be analyzed by solving intelligent interaction mode in the prior art
The problem of scheming and more humane interactive experience can not be provided.
Detailed description of the invention
Fig. 1 show a kind of flow diagram of speech emotional exchange method of one embodiment of the invention offer.
Fig. 2 show the process that Emotion identification result is determined in the speech emotional exchange method of one embodiment of the invention offer
Schematic diagram.
Fig. 3 show the process that Emotion identification result is determined in the speech emotional exchange method of one embodiment of the invention offer
Schematic diagram.User message in the embodiment also includes at least user speech message, Emotion identification.
Fig. 4 is shown in speech emotional exchange method provided by one embodiment of the invention according to the sound of user speech message
The flow diagram of frequency data acquisition audio Emotion identification result.
Fig. 5 show the stream that emotional characteristics model is established in speech emotional exchange method provided by one embodiment of the invention
Journey schematic diagram.
Fig. 6 show the stream that user speech message is extracted in voice mood recognition methods provided by one embodiment of the invention
Journey schematic diagram.
Fig. 7, which is shown in speech emotional exchange method provided by one embodiment of the invention, determines voice start frame and language
The flow diagram of sound end frame.
Fig. 8 show detection pronunciation frame or non-vocal frame in speech emotional exchange method provided by one embodiment of the invention
Flow diagram.
Fig. 9 show in the speech emotional exchange method of one embodiment of the invention offer and obtains base according to user speech message
The flow diagram of this intent information.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that the described embodiment is only a part of the embodiment of the present invention, instead of all the embodiments.Based on this
Embodiment in invention, every other reality obtained by those of ordinary skill in the art without making creative efforts
Example is applied, shall fall within the protection scope of the present invention.
Fig. 1 show a kind of flow diagram of speech emotional exchange method of one embodiment of the invention offer.Such as Fig. 1 institute
Show, which includes the following steps:
Step 101: Emotion identification result being obtained according to user speech message, wherein in the Emotion identification result at least
Including audio Emotion identification as a result, or including at least audio Emotion identification result and the text feelings in the Emotion identification result
Thread recognition result.
User speech message refer to the voice input by user during interacting with user or get and user
Interaction be intended to relevant with demand voice messaging.For example, felt concerned about in the customer service interaction scenarios of system in a call, user message
Concrete form can include the user speech message that user issues, and user at this time may be that client is also likely to be service
End;Again for example in intelligent robot interaction scenarios, user message just may include the input mould that user passes through the intelligent robot
Block input information (such as text or voice etc.) or the intelligent robot acquisition module collected user information
(such as facial expression, action etc.).The present invention to the specific source of user speech message and concrete form without limitation.
User speech message will include different audios due to the audio data of the user speech message of different emotional states
Feature can obtain audio Emotion identification according to the audio data of user speech message at this time as a result, and knowing according to audio mood
Other result determines Emotion identification result.
According to Emotion identification result accessed by the user message will during subsequent with basic intent information into
Row combines, to speculate that the mood of user is intended to.
Step 102: intention analysis being carried out according to user message, obtains corresponding basic intent information.
It is intention that user message intuitively reflects that basic intent information is corresponding, but can not simultaneously reflect the current shape of user
True emotional demand under state, therefore just need to integrate determining user message in conjunction with Emotion identification result and actually want to express
Profound intention and emotional need.For example, the emotional state made up for lost time for one is anxious user and one
The emotional state for just starting to do stroke planning is gentle user, when the content for the user speech message that the two is issued is similarly
When inquiring Flight Information, it is all inquiry Flight Information, but required for the two that obtained basic intent information, which is also identical,
Emotional need is obviously different.
It should be appreciated that, the particular content and acquisition modes of basic intent information different according to the concrete form of user message
It can also be different.For example, basic intent information can disappear according to user speech when user message includes user speech message
The content of text of breath carries out intention analysis and show that the corresponding content of text for being user speech message of the basic intent information is in language
The intention that adopted level is reflected can't have any emotion.
It in an embodiment of the present invention, can also root in order to further increase the accuracy of acquired basic intent information
According to current user speech message, and passing user speech message and/or subsequent user speech message is combined to be intended to
Analysis, obtains corresponding basic intent information.For example, may lack in the intention of present user speech message some keywords and
Slot position (slot), but these contents can be obtained by passing user speech message and/or subsequent user speech message.Example
Such as, the content of current user speech message is " having what specialty? " when, what subject (slot) therein was missing from, but pass through
In conjunction with passing user speech message " how is Changzhou weather? " i.e. extractable " Changzhou " is used as subject, finally obtains in this way
The basic intent information of present user speech message can be " what specialty Changzhou has? ".
Step 103: corresponding mood intent information is determined according to Emotion identification result and basic intent information.
Step 104: corresponding interactive instruction being determined according to mood intent information, or according to mood intent information and basic meaning
Figure information determines the corresponding interactive instruction.Mood intent information at this time can have specific content.
Corresponding relationship between Emotion identification result and basic intent information and interactive instruction can be by building with learning process
It is vertical.In an embodiment of the present invention, the content and form of interactive instruction includes that mode: text is presented in one or more of emotion
Export emotion presentation mode, melody plays emotion presentation mode, mode is presented in speech emotional, mode and machinery is presented in Image emotional semantic
It acts emotion and mode is presented.It should be appreciated, however, that the specific emotion presentation mode of interactive instruction can also be according to the need of interaction scenarios
It asks and adjusts, the present invention is to the particular content and form of interactive instruction and without limitation.
Specifically, the particular content of mood intent information refers to the intent information with emotion, can reflect
Reflect the emotional need of user message, mood intent information and Emotion identification result and basic intent information while basic intention
Between corresponding relationship can be pre-established by pre- learning process.In an embodiment of the present invention, which can wrap
Include affection need information corresponding with Emotion identification result, or may include affection need information corresponding with Emotion identification result with
And the incidence relation of Emotion identification result and basic intent information.The incidence relation of Emotion identification result and basic intent information can
To preset (such as by rule settings or logic judgment).For example, when the content of Emotion identification result is " anxiety ",
When the content of basic intent information is " reporting the loss credit card ", the content for the mood intent information determined just may include Emotion identification
As a result the incidence relation with basic intent information: " reporting the loss credit card, user is very anxious, and possible credit card is lost or stolen ", together
When identified affection need information can be " comfort ".The incidence relation of Emotion identification result and basic intent information can also be with
It is that (such as trained end to end model can pass through input Emotion identification result for the model that is obtained based on specific training process
Emotion intention is directly exported with basic intent information).This training pattern can be fixed depth network model (for example including
Pre-set rule), it can also be constantly updated by on-line study (such as using enhancing learning model, in a model
Objective function and reward function are set, as human-computer interaction number increases, which, which can also constantly update, is drilled
Change).
It should be appreciated, however, that mood intent information can also only exist as the mark of mapping relations.Mood intent information with
The corresponding relationship between corresponding relationship and mood intent information and basic intent information and interactive instruction between interactive instruction
It can also be pre-established by pre- learning process.
It should be appreciated that being to need to show the feedback content to the mood intent information under application scenes
's.Such as under some customer service interaction scenarios, need to be presented the mood intent information analyzed according to the voice content of client
To contact staff, to play reminding effect, corresponding mood intent information must be just determined at this time, and will be intended to the mood
The feedback content of information shows.However under other application scenarios, need to directly give corresponding interactive instruction, and
It does not need to show the feedback content to the mood intent information, it at this time can also be according to Emotion identification result and basic intention letter
Breath directly determines corresponding interactive instruction, and does not have to generate mood intent information.
It in an embodiment of the present invention, can also be in order to further increase the accuracy of acquired mood intent information
According to the Emotion identification result and basic intent information of current user speech message, and combine passing user speech message
And/or the Emotion identification result and basic intent information of subsequent user speech message, determine corresponding mood intent information.This
When just need to record the Emotion identification result and basic intent information of current user speech message in real time, in order to according to it
As reference when his user speech message determines mood intent information.For example, the content of current user speech message is " not have
It is there how bank card withdraws cash? ", acquired Emotion identification result is " anxiety ", but can not according to current user speech message
The reason of accurate judgement " anxiety " mood.Passing user speech message can be traced at this time and/or subsequent user speech disappears
Breath, as a result, it has been found that a passing user speech message be " how bank card is reported the loss? ", can then speculate the mood meaning of user
Figure information can be for " bank card loss results in mood anxiety, it is desirable to which how consulting is reported the loss or taken in the case where no bank card
Money ".Interactive instruction can be at this time generated for mood intent information, such as play following comfort voice " no card withdrawal please according to
Following steps operation, woulds you please not worry, and losing bank card can also operate by the following method ... ".
It in an embodiment of the present invention, can also in order to further increase the accuracy of acquired corresponding interactive instruction
With according to the mood intent information and basic intent information of current user speech message, and combine passing user speech message
And/or the mood intent information and basic intent information of subsequent user speech message, determine corresponding interactive instruction.At this time
Need to record the Emotion identification result and basic intent information of current user speech message in real time, in order to according to others
As reference when user speech message determines interactive instruction.
It can be seen that speech emotional exchange method provided in an embodiment of the present invention, in the basic intent information for understanding user
On the basis of, the Emotion identification based on user message acquisition is combined as a result, simultaneously further speculating that the mood of user is intended to, or straight
It connects and the interactive instruction that band is in a bad mood is provided according to basic intent information and Emotion identification result, to solve in the prior art
Intelligent interaction mode can not analyze the profound of user message and be intended to and emotional need and can not provide more humane friendship
Mutually the problem of experience.
In an embodiment of the present invention, audio Emotion identification knot is obtained according to the audio data of the user speech message
Fruit;And the Emotion identification result is determined according to the audio Emotion identification result.In an alternative embodiment of the invention, when
When user message includes user speech message, Emotion identification result can be according to audio Emotion identification result and text Emotion identification
As a result comprehensive to determine.In particular, it is desirable to obtain audio Emotion identification according to the audio data of user speech message as a result, simultaneously root
Text Emotion identification is obtained as a result, then according to audio Emotion identification result and text according to the content of text of user speech message
Emotion identification result is comprehensive to determine Emotion identification result.However it as previously mentioned, can also be true according only to audio Emotion identification result
Fixed final Emotion identification is as a result, this is not limited by the present invention.
Further, audio Emotion identification result includes one of multiple mood classification or a variety of;Or, audio mood is known
Other result corresponds to a coordinate points in multidimensional emotional space;
Or, audio Emotion identification result and the text Emotion identification result respectively include one in multiple mood classification
Kind is a variety of;Or, audio Emotion identification result and text Emotion identification result respectively correspond one in multidimensional emotional space
Coordinate points;
Wherein, the emotional factor that the corresponding psychology of each dimension in multidimensional emotional space defines, each mood point
Class includes multiple emotional intensity ranks.
It should be appreciated that audio Emotion identification result and text Emotion identification result can characterize in several ways.At this
It invents in an embodiment, the mode of discrete mood classification can be used to characterize Emotion identification as a result, audio Emotion identification at this time
As a result one of multiple mood classification or a variety of can be respectively included with text Emotion identification result.For example, in customer service interaction field
Jing Zhong, multiple mood classification is just can include: satisfied classification, tranquil classification and irritated classification, to correspond to customer service interaction scenarios
The emotional state that middle user is likely to occur;Alternatively, multiple mood classification can include: satisfied classification, tranquil classification, irritated classification
And angry classification, to correspond to the emotional state that contact staff is likely to occur in customer service interaction scenarios.It should be appreciated, however, that these
The type and quantity of mood classification can adjust, the type sum number that the present invention classifies to mood according to actual application scenarios demand
Amount does not do considered critical equally.In a further embodiment, each mood classification may also include multiple emotional intensity ranks.Tool
For body, mood classification and emotional intensity rank may be considered two dimensional parameters, can be independently of one another (for example, every kind of feelings
Thread classification has corresponding N kind emotional intensity rank, for example, slightly, moderate and severe), can also have preset corresponding relationship
(such as the classification of " agitation " mood includes three kinds of emotional intensity ranks, slight, moderate and severe;And the classification of " satisfaction " mood is only wrapped
Include two kinds of emotional intensity ranks, moderate and severe).It can be seen that emotional intensity rank at this time can regard that mood is classified as
A property parameters, when by Emotion identification process determine a kind of mood classification when, also determined that the mood classification feelings
Thread intensity rank.
In an alternative embodiment of the invention, the mode of non-discrete dimension mood model can be used also to characterize Emotion identification
As a result.Audio Emotion identification result and text Emotion identification result can respectively correspond a coordinate in multidimensional emotional space at this time
Point, the emotional factor that the corresponding psychology of each dimension in multidimensional emotional space defines.For example, PAD can be used
(Pleasure Arousal Dominanc) three dimensional mood model.The model thinks that mood has pleasure degree, activity and advantage
Three dimensions are spent, every kind of mood can all be characterized by the corresponding emotional factor of these three dimensions institute.Wherein P represents pleasure
Degree indicates the positive negative characteristic of individual emotional state;A represents activity, indicates the nerve triumph activation level of individual;D represents excellent
Gesture degree indicates individual to scene and other people state of a control.
It should be appreciated that audio Emotion identification result and text Emotion identification result can also be used other characteristic manners and carry out table
Sign, the present invention is to specific characteristic manner and without limitation.
Fig. 2 show the process that Emotion identification result is determined in the speech emotional exchange method of one embodiment of the invention offer
Schematic diagram.User message in the embodiment includes at least user speech message, and Emotion identification result is needed according to audio mood
Recognition result and the comprehensive determination of text Emotion identification result, and audio Emotion identification result and text Emotion identification result are wrapped respectively
One of multiple mood classification or a variety of are included, the method for the determination Emotion identification result may include following steps at this time:
Step 201:, will if audio Emotion identification result and text Emotion identification result include identical mood classification
Identical mood classification is used as Emotion identification result.
Step 202: if audio Emotion identification result and text Emotion identification result do not include identical mood classification,
Then by audio Emotion identification result and text Emotion identification result collectively as Emotion identification result.
Although it should be appreciated that being defined in step 202 when audio Emotion identification result and text Emotion identification result do not have
When having including the classification of identical mood, by audio Emotion identification result and text Emotion identification result collectively as Emotion identification knot
Fruit, but in other embodiments of the invention, can also take more conservative interactive strategy, for example, directly generate error information or
Emotion identification result etc. is not exported, in order to avoid causing to mislead to interactive process, the present invention is to audio Emotion identification result and text feelings
Processing mode when thread recognition result does not include the classification of identical mood does not do considered critical.
Fig. 3 show the process that Emotion identification result is determined in the speech emotional exchange method of one embodiment of the invention offer
Schematic diagram.User message in the embodiment also includes at least user speech message, and Emotion identification result is also required to basis
Audio Emotion identification result and the comprehensive determination of text Emotion identification result, and audio Emotion identification result and text feelings
Thread recognition result respectively includes one of multiple moods classification or a variety of, the method for the determination Emotion identification result may include as
Lower step:
Step 301: calculating in audio Emotion identification result in the confidence level and text Emotion identification result of mood classification
The confidence level of mood classification.
Statistically, confidence level is also referred to as reliability, confidence level or confidence coefficient.Since sample has randomness,
When being made an estimate using sampling to population parameter, the conclusion obtained is always uncertain.Therefore, it can be used in mathematical statistics
Interval estimation method estimate that probability of the error between estimated value and population parameter within the scope of certain allow has
Much, this corresponding probability is referred to as confidence level.For example, it is assumed that a change of preset mood classification and characterization mood classification
It measures related, that is, different values can be mapped to according to the classification of the size mood of the variate-value.It is tied when voice mood to be obtained identifies
When the confidence level of fruit, first passes through multiple audio Emotion identification/text Emotion identification process and obtains multiple measured values of the variable,
Then using the mean value of multiple measured value as an estimated value.The estimated value and the variable are estimated by interval estimation method again
True value between error range probability in a certain range, this probability value is bigger, and this estimated value of explanation is more accurate, i.e.,
The confidence level of current mood classification is higher.
Step 302: judging the highest mood classification of confidence level and text Emotion identification result in audio Emotion identification result
Whether the middle highest mood classification of confidence level is identical.If it is judged that be it is yes, then follow the steps 303, it is no to then follow the steps
304。
Step 303: will be in the highest mood classification of confidence level in audio Emotion identification result or text Emotion identification result
The highest mood classification of confidence level is used as Emotion identification result.
For example, when audio Emotion identification result includes satisfied classification (confidence level a1) and tranquil classifies that (confidence level is
A2), text Emotion identification result when only including satisfied classification (confidence level b1), and when a1 > a2, then it will satisfied classification
As final Emotion identification result.
Step 304: the confidence level and text mood of the highest mood classification of confidence level in comparing audio Emotion identification result
The confidence level of the highest mood classification of confidence level in recognition result.
In an embodiment of the present invention, it is contemplated that in actual application scenarios, according to the specific algorithm of Emotion identification with
And the limitation of the type and content of user speech message, in selectable audio Emotion identification result and text Emotion identification result
One exports as the Emotion identification result mainly considered, and another Emotion identification result considered as auxiliary is exported,
Then confidence level and emotional intensity rank etc. is recycled to determine final Emotion identification result because usually comprehensive.It should be appreciated that choosing
Which of audio Emotion identification result and text Emotion identification result are selected as the Emotion identification result output mainly considered
It can be depending on actual scene.However the present invention to selection audio Emotion identification result and text Emotion identification result in which
One as mainly consider Emotion identification result output and without limitation.
In an embodiment of the present invention, it is exported audio Emotion identification result as the Emotion identification result mainly considered,
The Emotion identification result output that text Emotion identification result is considered as auxiliary.At this point, if in audio Emotion identification result
The confidence level of the highest mood classification of confidence level is greater than the confidence of the highest mood classification of confidence level in text Emotion identification result
Degree executes step 305;If the confidence level of the highest mood classification of confidence level is less than text mood in audio Emotion identification result
The confidence level of the highest mood classification of confidence level, executes step 306 in recognition result;If confidence in audio Emotion identification result
The confidence level for spending highest mood classification is equal to the confidence level of the highest mood classification of confidence level in text Emotion identification result, holds
Row step 309.
Step 305: regarding the highest mood classification of confidence level in audio Emotion identification result as Emotion identification result.
Due to having selected audio Emotion identification result as the Emotion identification result output mainly considered, natively answer
Pay the utmost attention to the mood classification in audio Emotion identification result;Along with the highest mood of confidence level in audio Emotion identification result
The confidence level of classification is greater than the confidence level of the highest mood classification of confidence level in text Emotion identification result, therefore master just may be selected
The highest mood classification of confidence level is used as Emotion identification result in the audio Emotion identification result to be considered.For example, working as audio feelings
Thread recognition result include satisfied classification (confidence level a1) and it is tranquil classify (confidence level a2), and text Emotion identification result
When only including tranquil classification (confidence level b1), a1 > a2 and when a1 > b1, then by satisfied classification as final mood knowledge
Other result.
Step 306: judging in audio Emotion identification result whether to include confidence level highest in text Emotion identification result
Mood classification.If it is judged that be it is yes, then follow the steps 307;If it is judged that be it is no, then follow the steps 309.
For example, when audio Emotion identification result includes satisfied classification (confidence level a1) and tranquil classifies that (confidence level is
A2), and text Emotion identification result then needs to sentence when only including tranquil classification (confidence level b1), a1 > a2 and a1 < b1
It whether include the highest tranquil classification of confidence level in text Emotion identification result in a disconnected subaudio frequency Emotion identification result.
Step 307: further judging that confidence level is highest in the text Emotion identification result in audio Emotion identification result
Whether the emotional intensity rank of mood classification is greater than the first intensity threshold.If the result further judged be it is yes, execute step
Rapid 308;It is no to then follow the steps 309.
Step 308: regarding the highest mood classification of confidence level in text Emotion identification result as Emotion identification result.
Going to step 308 means that the highest mood classification of the confidence level in text Emotion identification result is not only credible
Degree is high, and the tendency of mood is fairly obvious, therefore conduct that the highest mood of confidence level in text Emotion identification result can be classified
Emotion identification result.
Step 309: by the highest mood classification of confidence level in audio Emotion identification result as Emotion identification as a result, or will
The highest mood classification of confidence level and the highest mood of confidence level point in text Emotion identification result in audio Emotion identification result
Class is collectively as Emotion identification result.
When the confidence level of the highest mood classification of confidence level in audio Emotion identification result is equal to text Emotion identification result
It does not include in text Emotion identification result in the confidence level or audio Emotion identification result of the middle highest mood classification of confidence level
Confidence level highest mood classification, or even if include that confidence level is most in text Emotion identification result in audio Emotion identification result
When the emotional intensity rank of high mood classification but mood classification is not high enough, illustrate to be not possible at this time according to audio Emotion identification
As a result a unified mood is exported with text Emotion identification result to classify as final Emotion identification result.At this point, at this
It invents in an embodiment, it is contemplated that select audio Emotion identification result to export as the Emotion identification result mainly considered, because
This directly regard the highest mood classification of confidence level in audio Emotion identification result as Emotion identification result.Of the invention another
It, can also be by audio Emotion identification result and text Emotion identification result collectively as Emotion identification result in one embodiment.And
The Emotion identification result of passing user speech message and/or subsequent user speech message and basic is combined during subsequent
Intent information determines corresponding mood intent information.
In an embodiment of the present invention, audio Emotion identification result and text Emotion identification result respectively correspond multidimensional emotion
A coordinate points in space, at this time can be by audio Emotion identification result and text Emotion identification result in multidimensional emotional space
In the coordinate values of coordinate points be weighted and averaged processing, the coordinate points obtained after weighted average is handled are as Emotion identification knot
Fruit.For example, audio Emotion identification result is characterized as (p1, a1, d1), text Emotion identification when using PAD three dimensional mood model
As a result be characterized as (p2, a2, d2), then final Emotion identification result just may be characterized as ((p1+p2)/2, (a1+1.3*a2)/
2, (d1+0.8*d2)/2), therein 1.3 and 0.8 is weight coefficient.Non-discrete dimension mood model is used to be more convenient for measure
The mode of change calculates final Emotion identification result.It should be appreciated, however, that combination mode is not limited to above-mentioned add
Weight average processing, the present invention respectively correspond in multidimensional emotional space to when audio Emotion identification result and text Emotion identification result
A coordinate points when determine the concrete mode of Emotion identification result without limitation.
Fig. 4 is shown in speech emotional exchange method provided by one embodiment of the invention according to the sound of user speech message
The flow diagram of frequency data acquisition audio Emotion identification result.As shown in figure 4, the audio data according to user speech message
Obtain audio Emotion identification result process include:
Step 401: extracting the audio feature vector of the user speech message in audio stream to be identified, wherein user speech disappears
One section of word in the corresponding audio stream to be identified of breath.
Audio feature vector includes value of at least one audio frequency characteristics at least one vector direction.It is in fact in this way
All audio frequency characteristics are characterized using the vector space of a multidimensional, in the vector space, the direction of audio feature vector
Can regard that the value in the vector direction different by many each leisures of audio frequency characteristics is summed in vector space as with value and
At wherein value of each audio frequency characteristics in a vector direction can regard the one-component of audio feature vector as.Include
The user speech message of different moods necessarily has different audio frequency characteristics, and the present invention exactly utilizes different moods and different audios
Corresponding relationship between feature identifies the mood of user speech message.Specifically, audio frequency characteristics may include following several
One of or it is a variety of: energy feature, pronunciation frame number feature, fundamental frequency feature, formant feature, harmonic to noise ratio feature with
And mel cepstrum coefficients feature.In an embodiment of the present invention, following vector direction: ratio can be set in the vector space
Value, mean value, maximum value, intermediate value and standard deviation.
Energy feature refers to the power spectrum characteristic of user speech message, can sum to obtain by power spectrum.Calculation formula
It can are as follows:Wherein E indicates the value of energy feature, and k represents the number of frame, and j represents the number of Frequency point, N
For frame length, P indicates the value of power spectrum.In an embodiment of the present invention, energy feature may include short-time energy first-order difference,
And/or predeterminated frequency energy size below.The calculation formula of short-time energy first-order difference can are as follows:
VE (k)=(- 2*E (k-2)-E (k-1)+E (k+1)+2*E (k+2))/3;
Predeterminated frequency energy size below can be measured by ratio value, such as 500Hz or less band energy accounts for total energy
The calculation formula of the ratio value of amount can are as follows:
Wherein j500For the corresponding frequency point number of 500Hz, k1 is the volume of the voice start frame of user speech message to be identified
Number, k2 is the number of the voice end frame of user speech message to be identified.
Pronunciation frame number feature refers to the population size of pronunciation frame in user speech message, the population size of the pronunciation frame
It can be measured by ratio value.Such as remember in the user speech message that the quantity of pronunciation frame and mute frame is respectively n1 and n2,
The ratio of frame number and mute frame number of then pronouncing is p2=n1/n2, the ratio of pronounce frame number and totalframes are as follows: p3=n1/ (n1+
n2)。
Fundamental frequency feature can be used based on the algorithm of the auto-correlation function of linear prediction (LPC) error signal and extract.
Fundamental frequency feature may include fundamental frequency and/or fundamental frequency first-order difference.The algorithm flow of fundamental frequency can be as follows: first
First, it calculates the linear predictor coefficient of pronunciation frame x (k) and calculates linear prediction estimation signalSecondly, error signal
Auto-correlation function c1:Then, in the offset ranges that corresponding fundamental frequency is 80-500Hz,
The maximum value for finding auto-correlation function, records its corresponding offset Δ h.The calculation formula of fundamental frequency F0 are as follows: F0=Fs/ Δ
H, wherein Fs is sample frequency.
Formant feature can be used based on the algorithm of the polynomial rooting of linear prediction and extract, it may include the first resonance
The first-order difference at peak, the second formant and third formant and three formants.Harmonic to noise ratio (HNR) feature can adopt
It is extracted with based on the algorithm of independent component analysis (ICA).Mel cepstrum (MFCC) coefficient characteristics may include that 1-12 rank Meier is fallen
Spectral coefficient can be used general mel cepstrum coefficients calculation process and obtain, and details are not described herein.
Can be depending on the demand of actual scene it should be appreciated which audio feature vector specifically extracted, the present invention is to institute
Extract type, quantity and the vector direction of audio frequency characteristics corresponding to audio feature vector without limitation.However in the present invention
In one embodiment, in order to obtain optimal Emotion identification effect, six above-mentioned audio frequency characteristics can be extracted simultaneously: energy feature,
Pronunciation frame number feature, fundamental frequency feature, formant feature, harmonic to noise ratio feature and mel cepstrum coefficients feature.For example,
When extracting six above-mentioned audio frequency characteristics simultaneously, extracted audio feature vector just may include 173 as shown in table 1 below
Component, using the audio feature vector and Gauss model (GMM) of the following table 1 as emotional characteristics model come to casia Chinese feelings
The accuracy that thread corpus carries out voice mood identification can achieve 74% to 80%.
Table 1
In an embodiment of the present invention, audio stream to be identified can be customer service interactive audio stream, user speech message it is corresponding to
Identify that a user in audio stream inputs voice segments or a customer service inputs voice segments.Since customer interaction process is often one
Ask a form answered, therefore a user inputs voice segments and can correspond to the primary enquirement of user in an interactive process or return
It answers, and customer service input voice segments can correspond to the primary enquirement or answer of contact staff in an interactive process.Due to one
As think user or customer service it is primary put question to or answer in can completely expression mood, therefore by the way that a user is inputted voice
The unit of section or customer service input voice segments as Emotion identification, not only can guarantee the integrality of Emotion identification, but also can guarantee visitor
Take the real-time of Emotion identification in interactive process.
Step 402: the audio feature vector of user speech message being matched with multiple emotional characteristics models, wherein more
A emotional characteristics model respectively corresponds one of multiple mood classification.
These emotional characteristics models can be by including that multiple moods are classified the multiple default of corresponding mood tag along sort
The respective audio feature vector of user speech message is learnt in advance and is established, and is equivalent to establish emotional characteristics mould in this way
Corresponding relationship between type and mood classification, each emotional characteristics model can correspond to a mood classification.As shown in figure 5, this is built
The pre- learning process of vertical emotional characteristics model can include: will classify the more of corresponding mood tag along sort including multiple moods first
A respective audio feature vector of pre-set user speech message carries out clustering processing, obtains the cluster result of default mood classification
(S51);Then, according to cluster result, the audio feature vector of the pre-set user speech message in each cluster is trained for one
A emotional characteristics model (S52).Based on these emotional characteristics models, can be obtained by the matching process based on audio feature vector
Emotional characteristics model corresponding with present user speech message is obtained, and obtains corresponding mood classification in turn.
In an embodiment of the present invention, these emotional characteristics models can be that (degree of mixing can be mixed Gauss model (GMM)
5).It can first be clustered in this way using emotional characteristics vector of the K-means algorithm to the speech samples that same mood is classified, according to
Cluster result calculates the initial value of the parameter of mixed Gauss model (the number of iterations can be 50).Then it is instructed again using E-M algorithm
Practise the corresponding mixed Gauss model (the number of iterations 200) of all kinds of moods classification.When to utilize these mixed Gauss models into
Market thread classification matching process when, can by calculate present user speech message audio feature vector respectively with multiple moods
Then likelihood probability between characteristic model determines matched emotional characteristics model by measuring the likelihood probability, such as will
Likelihood probability is greater than preset threshold and maximum emotional characteristics model as matched emotional characteristics model.
Although it should be appreciated that elaborating that emotional characteristics model can be mixed Gauss model in the above description, in fact
The emotional characteristics model can also be realized by other forms, such as support vector machines (SVM) model, K arest neighbors sorting algorithm
(KNN) model, Markov model (HMM) and neural network (ANN) model etc..
In an embodiment of the present invention, multiple mood classification can include: satisfied classification, tranquil classification and irritated point
Class, to correspond to the emotional state that user is likely to occur in customer service interaction scenarios.In another embodiment, multiple mood classification can
It include: that satisfaction is classified, tranquil classification, agitation is classified and anger classification, it may to correspond to contact staff in customer service interaction scenarios
The emotional state of appearance.That is, when audio stream to be identified is user's customer service interactive audio stream in customer service interaction scenarios, if current use
When the corresponding customer service input voice segments of family speech message, multiple mood classification can include: satisfied classification, tranquil classification and
Agitation classification;If the corresponding user of present user speech message inputs voice segments, multiple mood classification can include: satisfied
Classification, tranquil classification, irritated classification and angry classification.Classified by the above-mentioned mood to user and customer service, Ke Yigeng
Succinct is suitable for call center system, reduces calculation amount and meets the Emotion identification demand of call center system.However it should
Understand, the type and quantity of these moods classification can be adjusted according to actual application scenarios demand.
Step 403: being that the corresponding mood classification of emotional characteristics model to match is used as user speech by matching result
The mood of message is classified.
As previously described, because between emotional characteristics model and mood classification, there are corresponding relationships, therefore when according to step 402
Matching process the emotional characteristics model to match has been determined after, the corresponding mood classification of the matched emotional characteristics model is just
For the mood classification identified.For example, the matching process can lead to when these emotional characteristics models are mixed Gauss model
Cross the side for measuring the audio feature vector likelihood probability between multiple emotional characteristics models respectively of present user speech message
Formula is realized, likelihood probability is then greater than preset threshold and the corresponding mood classification of maximum emotional characteristics model is used as user
The mood of speech message is classified.
It can be seen that a kind of voice mood recognition methods provided in an embodiment of the present invention, by extracting audio stream to be identified
In user speech message audio feature vector, and using the emotional characteristics model that pre-establishes to extracted audio frequency characteristics
Vector is matched, to realize the real-time emotion identification to user speech message.
It is also understood that the mood classification that voice mood recognition methods based on the embodiment of the present invention is identified,
Specific scene demand can be also further cooperated to realize more flexible secondary applications.It in an embodiment of the present invention, can be real-time
Show the mood classification of the user speech message currently identified, specific real-time display mode can be according to actual scene demand
And it adjusts.For example, can be classified with the different colours of signal lamp to characterize different moods, in this way according to the change of signal lamp color
Change, contact staff and quality inspection personnel can be reminded to converse at present locating emotional state in real time.In another embodiment, may be used also
The mood classification of the user speech message identified in preset time period is counted, such as the audio of calling record is numbered,
The timestamp and Emotion identification result of the starting point and end point of user speech message are recorded, and a feelings are ultimately formed
Thread identifies data bank, and counts various moods occur in a period of time number and probability, makes curve graph or table, is used for
The reference frame of contact staff's service quality in a period of time is judged by enterprise.In another embodiment, it can also send in real time and institute
The corresponding mood response message of mood classification of the user speech message identified, this is applicable to prosthetic machine visitor on duty
Take scene.For example, when identify in real time at present call in user be in " anger " state when, then automatically reply user and
" anger " state is corresponding to pacify language, to calm down user mood, achievees the purpose that continue to link up.As for mood classification and mood
Corresponding relationship between response message can be pre-established by pre- learning process.
In an embodiment of the present invention, the audio feature vector for extracting the user speech message in audio stream to be identified it
Before, need first to extract user speech message from audio stream to be identified, in order to it is subsequent with user speech message be single
Position carries out Emotion identification, which can be real-time perfoming.
Fig. 6 show the stream that user speech message is extracted in voice mood recognition methods provided by one embodiment of the invention
Journey schematic diagram.As shown in fig. 6, the extracting method of the user speech message includes:
Step 601: determining the voice start frame and voice end frame in audio stream to be identified.
Voice start frame is the start frame of a user speech message, and voice end frame is the knot of a user speech message
Beam frame.After voice start frame and voice end frame has been determined, the part between voice start frame and voice end frame is institute
The user speech message to be extracted.
Step 602: extracting the audio stream part between voice start frame and voice end frame as user speech message.
In an embodiment of the present invention, as shown in fig. 7, the language in audio stream to be identified can be determined especially by following steps
Sound start frame and voice end frame:
Step 801: judging that the speech frame in audio stream to be identified is pronunciation frame or non-vocal frame.
In an embodiment of the present invention, the deterministic process of the pronunciation frame or non-vocal frame can be based on to speech terminals detection
(VAD) judgement of decision parameter and power spectrum mean value is realized, as shown in figure 8, specific as follows:
Step 8011: the pretreatment such as framing, adding window, preemphasis is carried out to audio stream to be identified.Hamming can be used in window function
Window, pre emphasis factor desirable 0.97.Remember pretreated kth frame signal be x (k)=[x (k*N), x (k*N+1) ..., x (k*N
+ N-1)], N is frame length, such as desirable 256.Which it should be appreciated, however, that whether needing to carry out preprocessing process, and need by
A little preprocessing process can depending on actual scene demand, the present invention this without limitation.
Step 8012: discrete Fourier transform (DFT) being done to pretreated kth frame signal x (k) and calculates its power
Spectrum, DFT length is taken as consistent with frame length:
P (k, j)=| FFT (x (k)) |2, j=0,1 ..., N-1;
Here j represents the number of Frequency point.
Step 8013: calculate posteriori SNR γ and prior weight ξ:
ξ (k, j)=α ξ (k-1, j)+(1- α) max (γ (k, j) -1,0);
Here factor alpha=0.98;λ is Background Noise Power spectrum, can detecte the power spectrum of initial 5 to 10 frame of beginning
Arithmetic average is as initial value;Min () and max () is respectively to take minimum function and take maximal function;Prior weight ξ (k,
J) 0.98 can be initialized as.
Step 8014: calculate likelihood ratio parameter η:
Step 8015: VAD decision parameter Γ and power spectrum mean value ρ is calculated,
VAD decision parameter can be initialized as 1.
Step 8016: judge whether the VAD decision parameter Γ (k) of kth frame signal is more than or equal to the first default VAD threshold value,
And whether ρ (k) is more than or equal to predetermined power mean value threshold value.In an embodiment of the present invention, which can be
5, which can be 0.01.
Step 8017: if two results judged in step 8016 are to be, kth frame audio signal being determined as
Pronounce frame.
Step 8018: if two in step 8016 judge at least one result be it is no, by kth frame audio signal
It is determined as mute frame, executes step 8019.
Step 8019: noise power spectrum λ is updated by following formula:
λ (k+1, j)=β * λ (k, j)+(1- β) * P (k, j);
Here factor beta be smoothing factor, can value be 0.98.
It can be seen that by constantly recycle method and step as shown in Figure 5 can real-time monitoring go out in audio stream to be identified
Pronunciation frame and non-vocal frame.The recognition result of these pronunciation frames and non-vocal frame is subsequent identification voice start frame and voice knot
The basis of beam frame.
Step 802: after determining the voice end frame of the preceding paragraph user speech message or present user speech
When message is the first segment user speech message of the audio stream to be identified, when there is the first preset quantity speech frame continuously to be sentenced
Break for pronunciation frame when, using first speech frame in the first preset quantity speech frame as the language of present user speech message
Sound start frame.
In an embodiment of the present invention, two end markers flag_start and flag_end can be set first, respectively generation
The detecting state variable of predicative sound start frame and voice end frame, ture and false respectively represent appearance and do not occur.When
When flag_end=ture, then illustrates that the end frame of a user speech message has been determined, start to detect at this time next
The start frame of a user speech message.And it is more than or equal to the second preset threshold when the VAD decision parameter of continuous 30 frame signal meets
When, illustrate that 30 frame has come into a user speech message, at this time using first speech frame in 30 frame as voice
Start frame, flag_start=ture;Otherwise lag_start=false.
Step 803: after determining the voice start frame of present user speech message, when there is the second preset quantity
When speech frame is continuously judged as non-vocal frame, illustrate that the second preset quantity speech frame has been not belonging to the user speech and has disappeared
Breath, at this time terminates first speech frame in the second preset quantity speech frame as the voice of present user speech message
Frame.
Specifically, still continuing to use above example, as flag_start=ture, then explanation has come into a use
The family speech message and voice start frame of the user speech message has been determined starts to check present user speech message at this time
End frame.And when the VAD decision parameter of continuous 30 frame signal meets and is less than third predetermined threshold value, it is determined as active user's language
The sound end of message, flag_end=ture, the first frame of corresponding 30 frames are voice end frame;Otherwise flag_end=false.
In an embodiment of the present invention, in order to further increase the accuracy of judgement degree of voice start frame and voice end frame,
It avoids judging by accident, second preset threshold and third predetermined threshold value may make to be all larger than aforementioned pronunciation frame and non-vocal frame identification process
In the first preset threshold, such as second preset threshold can be 40, the third predetermined threshold value can be 20.
It can be seen that by method and step as shown in Figure 7, can determine the voice start frame in audio stream to be identified with
And voice end frame, and the user speech message between extractable voice start frame and voice end frame carries out Emotion identification.
Although it should be appreciated that above-mentioned Fig. 7 and Fig. 8 embodiment description in introduce some design factors, parameter just
Initial value and some judgment thresholds, but the initial value of these design factors, parameter and judgment threshold can be according to actual applications
Scene and adjust, the present invention to the size of the initial value of these design factors, parameter and judgment threshold without limitation.
Fig. 9 show in the speech emotional exchange method of one embodiment of the invention offer and obtains base according to user speech message
The flow diagram of this intent information.As shown in figure 9, the process of the basic intent information of the acquisition may include following steps:
Step 901: preset semantic templates multiple in the content of text and semantic knowledge-base of user speech message are carried out
Matching is with the matched semantic template of determination;Wherein the corresponding relationship between semantic template and basic intent information is pre-established in language
In adopted knowledge base, the corresponding one or more semantic templates of same intent information.
It should be appreciated that carrying out the matching (such as standard ask, extend and ask semantic template) of semanteme by semantic template is one
Kind implementation, the speech text information of user's input directly can also extract word, word, sentence vector characteristics by network (may
Attention mechanism is added) directly matches or classify.
Step 902: obtaining basic intent information corresponding with matched semantic template.
In an embodiment of the present invention, the content of text of user speech message can be right with " standard is asked " in semantic knowledge-base
It answers, " standard is asked " is used to indicate that the text of some knowledge point, and main target is that expression is clear, convenient for safeguarding.Here " asking "
It narrowly should not be interpreted as " inquiring ", and should broadly understand one " input ", being somebody's turn to do " input " has corresponding " output ".With
Family to intelligent interaction machine when inputting, the most ideal situation is that asked using standard, then the intelligent semantic identifying system horse of machine
Above it will be appreciated that the meaning of user.
However, user often not uses standard to ask, but the form of some deformations that standard is asked, as extend
It asks.Therefore, for intelligent semantic identification, the extension that the standard that needs in knowledge base is asked is asked, which, which asks, asks table with standard
There is slight difference up to form, but expresses identical meaning.Therefore, in a further embodiment of the invention, semantic template is
The set for indicating one or more semantic formulas of a certain semantic content combines language according to scheduled rule by developer
Adopted content generates, i.e., the sentence of a variety of different expression ways of corresponding semantic content can be described by a semantic template,
The possible various deformation of content of text to cope with user speech message.In this way by the content of text of user message and preset language
Adopted template is matched, and is avoided using " standard is asked " for being only capable of describing a kind of expression way and is identified limitation when user message
Property.
Ontology generic attribute is done for example, by using abstract semantics and is further abstracted.The abstract semantics of one classification pass through one group of pumping
The different expression of a kind of abstract semantics are described as the set of semantic formula, to express more abstract semanteme, these are abstracted
Semantic formula is expanded on component.
It should be appreciated that the particular content and part of speech of semantic component word, the particular content and part of speech and language of semantic rules word
The definition and collocation of adopted symbol all can be as developer specific interactive service fields according to applied by the speech emotional exchange method
Scape and preset, the present invention is to this and without limitation.
In an embodiment of the present invention, the process of matched semantic template is determined according to the content of text of user speech message
It can be realized by similarity calculation process.Specifically, calculating the content of text and multiple preset semantemes of user speech message
Multiple text similarities between template, then using the highest semantic template of text similarity as matched semantic template.Phase
It can be used one of following calculation method or a variety of: editing distance calculation method like degree, n-gram calculation method,
JaroWinkler calculation method and Soundex calculation method.In a further embodiment, when identifying that user speech disappears
When semantic component word and semantic rules word in the content of text of breath, in user speech message and semantic template it is included it is semantic at
Participle and semantic rules word can also be converted to simplified text-string, to improve the efficiency of Semantic Similarity Measurement.
In an embodiment of the present invention, as previously mentioned, semantic template can be made of semantic component word and semantic rules word, and
These semantic component words and semantic rules word are closed with these words in the part of speech in semantic template and the grammer between word again
It is related, therefore the similarity calculation process can specifically: first identify the word of word in user speech Message-text, word
Property and grammatical relation, then identify semantic component word and semantic rules therein according to the part of speech of word and grammatical relation
Word, then the semantic component word identified and semantic rules word are introduced into vector space model to calculate the text of user speech message
Multiple similarities between this content and multiple preset semantic templates.It in an embodiment of the present invention, can the side of participle as follows
Word in the content of text of one of method or a variety of identification user speech message, the language between the part of speech and word of word
Method relationship: hidden markov model approach, Forward Maximum Method method, reverse maximum matching process and name Entity recognition side
Method.
In an embodiment of the present invention, as previously mentioned, semantic template can be multiple semantemes of a certain semantic content of expression
The set of expression formula can describe the language of a variety of different expression ways of corresponding semantic content by a semantic template at this time
Sentence, is asked with multiple extensions that the same standard of correspondence is asked.Therefore in the content of text and preset semanteme for calculating user speech message
When semantic similarity between template, need to calculate the content of text of user speech message with multiple preset semantic templates respectively
At least one extension of expansion ask between similarity, then using the highest extension of similarity ask corresponding semantic template as
Matched semantic template.The extension of these expansion is asked can be according to the semantic component word and/or semantic rules included by semantic template
Word and/or semantic symbol and obtain.
Certainly the method for obtaining basic intent information is not limited to this, and the speech text information of user's input can directly lead to
It crosses network and extracts word, word, sentence vector characteristics (attention mechanism may such as be added) and directly match or be categorized into and be intended to letter substantially
Breath is to realize.
It can be seen that speech emotional exchange method provided by through the embodiment of the present invention is, it can be achieved that according to user emotion
State is different and provides the intelligent interaction mode of different answer services, is thus greatly improved the experience of intelligent interaction.For example, working as
Speech emotional exchange method provided by the embodiment of the present invention is applied in the tangible machine people in bank's customer service field, user's term
Sound says entity customer service robot: " credit card to report the loss what if? ".Entity customer service robot receives user's language by microphone
Sound message, and it is " anxiety " that the audio data by analyzing user speech message, which obtains audio Emotion identification result, and by audio
Emotion identification result is as final Emotion identification result;User speech message is converted into text, obtains the basic meaning of client
Figure information be " reporting the loss credit card " (the step for may also need to be related to combine passing or subsequent user speech message and silver
The semantic knowledge-base in row field);Then, Emotion identification result " anxiety " and basic intent information " reporting the loss credit card " is contacted
Together, obtain mood intent information " reporting the loss credit card, user is very anxious, and possible credit card is lost or stolen " (the step for
The semantic knowledge-base of passing or subsequent user speech message and the bank field may be needed to be related to combining);It determines corresponding
Interactive instruction: screen export credit jam step-out is rapid, while mood classification " comfort ", emotional intensity grade is presented by voice broadcast
Wei not be high, export to user meet that the mood instructs may be brisk, the medium word speed of tone voice broadcast: " report the loss credit
The step of card, shows see screen, woulds you please not worry, if it is losing credit card or is stolen, jam is freezed at once after losing, no
It can cause damages to your property and prestige ... ".
In an embodiment of the present invention, application scenes (such as bank's customer service) may also consider the privacy of interaction content
Property and avoid voice broadcast from operating, and be changed to realize interactive instruction in a manner of plain text or animation.The mould of this interactive instruction
State selection can be adjusted according to application scenarios.
It should be appreciated that can be by adjusting voice for the presentation mode of mood classification and emotional intensity rank in interactive instruction
The modes such as the word speed of casting and intonation realize which is not limited by the present invention.
It applies for another example working as speech emotional exchange method provided by the embodiment of the present invention in the virtual of intelligent terminal
When in intelligent personal assistants application, user says intelligent terminal with voice: " most fast path is assorted from family to airport
? ".Virtual Intelligent personal assistant applications receive user speech message by the microphone of intelligent terminal, and pass through analysis
It is " excitement " that the audio data of user speech message, which obtains audio Emotion identification result,;It is simultaneously text by user speech message transformation
This, and it is " anxiety " that the content of text by analyzing user speech message, which obtains text Emotion identification result, by logic judgment
By " excitement " and " anxiety " two kinds of mood classification simultaneously as Emotion identification result.By combining passing or subsequent user's language
The basic intent information that the semantic knowledge-base of sound message and this field obtains client is " to obtain the user road most fast from the home to airport
Diameter navigation ".Since " anxiety " and basic intent information " it is most fast from the home to airport to be obtained user by Virtual Intelligent personal assistant applications
Path navigation " the mood intent information that links together is " to obtain user's path navigation most fast from the home to airport, use
Family is very anxious, may worry overdue aircraft ";And the mood that " excitement " and basic intent information link together is intended to believe
Breath is " obtaining user's path navigation most fast from the home to airport, user is very excited, may travel at once ";Therefore, here
Two kinds of mood intent informations can be generated, at this time in combination with passing or subsequent user speech message, it is found that front user mentions " I
Flight be at 11 points and take off, need several points to set out? ", judge the Emotion identification result of user then for " anxiety ", mood is intended to letter
Breath is " obtaining user's path navigation most fast from the home to airport, user is very anxious, may worry overdue aircraft ".It determines corresponding
Interactive instruction: screen exports navigation information, while mood classification " comfort " and " warning ", emotional intensity is presented by voice broadcast
Rank be respectively height, export to user meet the mood instruct may be smooth tone, medium word speed voice broadcast: " from
You finish home address to the most fast path planning in airport, please show and navigate by screen, and normally travel is estimated can be at 1 hour
It inside arrives at the airport, woulds you please not worry.In addition it reminds and carries out time planning, drive with caution, drive under the speed limit."
It is applied in a kind of intelligent wearable device for another example working as speech emotional exchange method provided by the embodiment of the present invention
When, user says intelligent wearable device with voice when movement: " my present heartbeat what state? ".Intelligence wearing is set
It is standby that user speech message is received by microphone, and the audio data by analyzing user speech message obtains audio Emotion identification
It as a result is PAD three dimensional mood model vector (p1, a1, d1), the audio data by analyzing user speech message obtains text feelings
Thread recognition result is PAD three dimensional mood model vector (p2, a2, d2), in conjunction with audio Emotion identification result and text Emotion identification
As a result final Emotion identification result (p3, a3, d3) is obtained, the combination of " worry " and " anxiety " is characterized.At the same time, intelligence
Wearable device is " to obtain the heart of user by combining the semantic knowledge-base in medical treatment & health field to obtain the basic intent information of client
Hop count evidence ".Then, Emotion identification result (p3, a3, d3) and basic be intended to " heartbeat data of acquisition user " are contacted one
It rises, obtaining mood intent information is " to obtain the heartbeat data of user, user concerns, and may currently have rapid heart beat etc. no
Suitable symptom ".Interactive instruction is determined according to the corresponding relationship between mood intent information and interactive instruction: in output heartbeat data
Mood (p6, a6, d6) is presented simultaneously, i.e., " comforts " and the combination of " encouragement ", emotional intensity is respectively height, while starting prison in real time
The program of control heartbeat continues 10min, and the voice broadcast of word speed brisk with tone, slow: " your current heartbeat data is every point
Clock 150 times, would you please not worry, which still belongs to normal heartbeat range.If any feeling that the malaise symptoms such as rapid heart beat please put
Feelings of getting relaxed, which are breathed deeply, to be adjusted.Your previous health data shows that heart working is good, can be by keeping regular forging
Refining enhancing cardio-pulmonary function." then give more sustained attention the emotional state of user.It " is wrong with if user says after 5min." pass through
It is three dimensional mood model vector (p7, a7, d7) that Emotion identification process, which obtains Emotion identification result, characterizes " pain ", then again
Updating interactive instruction are as follows: screen exports heartbeat data, while mood (p8, a8, d8) is presented by voice broadcast, i.e., " warns ",
Emotional intensity is respectively high, exports alarm sound, and the voice broadcast of word speed sedate with tone, slow: " your current beats
It has been more than normal range (NR) according to being 170 times per minute, has woulded you please stop motion, adjustment breathing.If you need to seek help please by screen."
One embodiment of the invention also provides a kind of computer equipment, including memory, processor and is stored in memory
On the computer program that is executed by processor, which is characterized in that processor realizes such as preceding any implementation when executing computer program
Speech emotional exchange method described in example.
One embodiment of the invention also provides a kind of computer readable storage medium, is stored thereon with computer program, special
Sign is, the speech emotional exchange method as described in preceding any embodiment is realized when computer program is executed by processor.The meter
Calculation machine storage medium can be any tangible media, such as floppy disk, CD-ROM, DVD, hard disk drive, even network medium etc..
Although being produced it should be appreciated that can be computer program the foregoing describe a kind of way of realization of embodiment of the present invention
Product, but the method or apparatus of embodiments of the present invention can be come in fact according to the combination of software, hardware or software and hardware
It is existing.Hardware components can use special logic to realize;Software section can store in memory, by instruction execution appropriate
System, such as microprocessor or special designs hardware execute.It will be understood by those skilled in the art that above-mentioned side
Method and equipment can be used computer executable instructions and/or is included in the processor control code to realize, such as such as
Disk, the mounting medium of CD or DVD-ROM, the programmable memory of such as read-only memory (firmware) or such as optics or
Such code is provided in the data medium of electrical signal carrier.Methods and apparatus of the present invention can be by such as ultra-large
The semiconductor or such as field programmable gate array of integrated circuit or gate array, logic chip, transistor etc. can be compiled
The hardware circuit realization of the programmable hardware device of journey logical device etc., can also be soft with being executed by various types of processors
Part is realized, can also be realized by the combination such as firmware of above-mentioned hardware circuit and software.
It will be appreciated that though it is referred to several modules or unit of device in the detailed description above, but this stroke
It point is only exemplary rather than enforceable.In fact, according to an illustrative embodiment of the invention, above-described two or
More multimode/unit feature and function can realize in a module/unit, conversely, an above-described module/mono-
The feature and function of member can be to be realized by multiple module/units with further division.In addition, above-described certain module/
Unit can be omitted under certain application scenarios.
It should be appreciated that determiner " first ", " second " and " third " etc. used in description of the embodiment of the present invention is only used
In clearer elaboration technical solution, can not be used to limit the scope of the invention.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention
Within mind and principle, made any modification, equivalent replacement etc. be should all be included in the protection scope of the present invention.
Claims (10)
1. a kind of speech emotional exchange method characterized by comprising
Emotion identification result is obtained according to user speech message, wherein audio mood is included at least in the Emotion identification result
Audio Emotion identification result and the text Emotion identification result are included at least in recognition result or the Emotion identification result;
Intention analysis is carried out according to the content of text of the user speech message, obtains corresponding basic intent information;
Corresponding mood intent information is determined according to the Emotion identification result and the basic intent information;
And the corresponding interactive instruction is determined according to the mood intent information, or according to the mood intent information and institute
It states basic intent information and determines the corresponding interactive instruction.
2. speech emotional exchange method according to claim 1, which is characterized in that the interactive instruction includes following one kind
Or mode is presented in a variety of emotions: mode is presented in text output emotion, melody plays emotion and mode, speech emotional presentation mould is presented
Mode is presented in state, Image emotional semantic and mode is presented in mechanical action emotion.
3. speech emotional exchange method according to claim 1, which is characterized in that the mood intent information includes and institute
State the corresponding affection need information of Emotion identification result;Or,
The mood intent information includes the affection need information corresponding with the Emotion identification result and the mood
The incidence relation of recognition result and the basic intent information.
4. speech emotional exchange method according to claim 1, which is characterized in that described to be obtained according to user speech message
Emotion identification result includes:
Audio Emotion identification result is obtained according to the audio data of the user speech message;And according to the audio mood
Recognition result determines the Emotion identification result;
Or,
Audio Emotion identification is obtained as a result, and according to the user speech message according to the audio data of the user speech message
Content of text obtain text Emotion identification result;And according to the audio Emotion identification result and the text mood
Recognition result determines the Emotion identification result.
5. speech emotional exchange method according to claim 4, which is characterized in that the audio Emotion identification result includes
One of multiple mood classification are a variety of;Or, the audio Emotion identification result corresponds to a seat in multidimensional emotional space
Punctuate;
Or, the audio Emotion identification result and the text Emotion identification result respectively include one in multiple mood classification
Kind is a variety of;Or, the audio Emotion identification result and the text Emotion identification result respectively correspond multidimensional emotional space
In a coordinate points;
Wherein, the emotional factor that the corresponding psychology of each dimension in the multidimensional emotional space defines, each feelings
Thread classification includes multiple emotional intensity ranks.
6. speech emotional exchange method according to claim 4, which is characterized in that described according to the user speech message
Audio data obtain audio Emotion identification result include:
The audio feature vector of the user speech message is extracted, wherein the user speech message corresponds to the audio to be identified
One section of word in stream;
The audio feature vector of the user speech message is matched with multiple emotional characteristics models, wherein the multiple feelings
Thread characteristic model respectively corresponds one in multiple mood classification;And
It is that the corresponding mood classification of the emotional characteristics model to match is used as the user speech message by matching result
Mood classification.
7. speech emotional exchange method according to claim 1, which is characterized in that described according to the user speech message
Audio data obtain audio Emotion identification result further comprise:
Determine the voice start frame and voice end frame in the audio stream to be identified;And
The audio stream part between the voice start frame and the voice end frame is extracted as the user speech message.
8. speech emotional exchange method according to claim 7, which is characterized in that the determination audio stream to be identified
In voice start frame and voice end frame include:
Judge that the speech frame in the audio stream to be identified is pronunciation frame or non-vocal frame;
After the voice end frame of the preceding paragraph sound bite or it is current it is unidentified to first segment sound bite when, when having
When first preset quantity speech frame is continuously judged as pronunciation frame, by first in the first preset quantity speech frame
The voice start frame of the speech frame as current speech segment;And
After the voice start frame of current speech segment, when there is the second preset quantity speech frame to be continuously judged as non-
When pronunciation frame, using first speech frame in the second preset quantity speech frame as the voice of current speech segment
End frame.
9. a kind of computer equipment, including memory, processor and being stored on the memory is executed by the processor
Computer program, which is characterized in that the processor is realized when executing the computer program as appointed in claim 1 to 8
The step of one the method.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program
It realizes when being executed by processor such as the step of any one of claims 1 to 8 the method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810079429.2A CN110085221A (en) | 2018-01-26 | 2018-01-26 | Speech emotional exchange method, computer equipment and computer readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810079429.2A CN110085221A (en) | 2018-01-26 | 2018-01-26 | Speech emotional exchange method, computer equipment and computer readable storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110085221A true CN110085221A (en) | 2019-08-02 |
Family
ID=67412786
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810079429.2A Pending CN110085221A (en) | 2018-01-26 | 2018-01-26 | Speech emotional exchange method, computer equipment and computer readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110085221A (en) |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110931006A (en) * | 2019-11-26 | 2020-03-27 | 深圳壹账通智能科技有限公司 | Intelligent question-answering method based on emotion analysis and related equipment |
CN111026843A (en) * | 2019-12-02 | 2020-04-17 | 北京智乐瑟维科技有限公司 | Artificial intelligent voice outbound method, system and storage medium |
CN111179903A (en) * | 2019-12-30 | 2020-05-19 | 珠海格力电器股份有限公司 | Voice recognition method and device, storage medium and electric appliance |
CN111563386A (en) * | 2020-07-14 | 2020-08-21 | 北京每日优鲜电子商务有限公司 | Semantic processing method and system based on artificial intelligence |
CN111833907A (en) * | 2020-01-08 | 2020-10-27 | 北京嘀嘀无限科技发展有限公司 | Man-machine interaction method, terminal and computer readable storage medium |
CN111832317A (en) * | 2020-07-09 | 2020-10-27 | 平安普惠企业管理有限公司 | Intelligent information diversion method and device, computer equipment and readable storage medium |
CN112002329A (en) * | 2020-09-03 | 2020-11-27 | 深圳Tcl新技术有限公司 | Physical and mental health monitoring method and device and computer readable storage medium |
CN112233699A (en) * | 2020-10-13 | 2021-01-15 | 中移(杭州)信息技术有限公司 | Voice broadcasting method, intelligent voice device and computer readable storage medium |
CN112767969A (en) * | 2021-01-29 | 2021-05-07 | 苏州思必驰信息科技有限公司 | Method and system for determining emotion tendentiousness of voice information |
CN113035181A (en) * | 2019-12-09 | 2021-06-25 | 斑马智行网络(香港)有限公司 | Voice data processing method, device and system |
CN113409778A (en) * | 2020-03-16 | 2021-09-17 | 阿里巴巴集团控股有限公司 | Voice interaction method, system and terminal |
CN113743126A (en) * | 2021-11-08 | 2021-12-03 | 北京博瑞彤芸科技股份有限公司 | Intelligent interaction method and device based on user emotion |
CN114138960A (en) * | 2021-11-30 | 2022-03-04 | 中国平安人寿保险股份有限公司 | User intention identification method, device, equipment and medium |
CN115047824A (en) * | 2022-05-30 | 2022-09-13 | 青岛海尔科技有限公司 | Digital twin multimodal device control method, storage medium, and electronic apparatus |
CN115101074A (en) * | 2022-08-24 | 2022-09-23 | 深圳通联金融网络科技服务有限公司 | Voice recognition method, device, medium and equipment based on user speaking emotion |
CN115460317A (en) * | 2022-09-05 | 2022-12-09 | 西安万像电子科技有限公司 | Emotion recognition and voice feedback method, device, medium and electronic equipment |
CN116030811A (en) * | 2023-03-22 | 2023-04-28 | 广州小鹏汽车科技有限公司 | Voice interaction method, vehicle and computer readable storage medium |
CN117079673A (en) * | 2023-10-17 | 2023-11-17 | 青岛铭威软创信息技术有限公司 | Intelligent emotion recognition method based on multi-mode artificial intelligence |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103489453A (en) * | 2013-06-28 | 2014-01-01 | 陆蔚华 | Product emotion qualification method based on acoustic parameters |
CN103531198A (en) * | 2013-11-01 | 2014-01-22 | 东南大学 | Speech emotion feature normalization method based on pseudo speaker clustering |
CN103593054A (en) * | 2013-11-25 | 2014-02-19 | 北京光年无限科技有限公司 | Question-answering system combining emotion recognition and output |
CN105681546A (en) * | 2015-12-30 | 2016-06-15 | 宇龙计算机通信科技(深圳)有限公司 | Voice processing method, device and terminal |
CN106537294A (en) * | 2016-06-29 | 2017-03-22 | 深圳狗尾草智能科技有限公司 | Method, system and robot for generating interactive content of robot |
CN106531162A (en) * | 2016-10-28 | 2017-03-22 | 北京光年无限科技有限公司 | Man-machine interaction method and device used for intelligent robot |
CN106570496A (en) * | 2016-11-22 | 2017-04-19 | 上海智臻智能网络科技股份有限公司 | Emotion recognition method and device and intelligent interaction method and device |
CN106776936A (en) * | 2016-12-01 | 2017-05-31 | 上海智臻智能网络科技股份有限公司 | intelligent interactive method and system |
CN107516511A (en) * | 2016-06-13 | 2017-12-26 | 微软技术许可有限责任公司 | The Text To Speech learning system of intention assessment and mood |
CN107562816A (en) * | 2017-08-16 | 2018-01-09 | 深圳狗尾草智能科技有限公司 | User view automatic identifying method and device |
-
2018
- 2018-01-26 CN CN201810079429.2A patent/CN110085221A/en active Pending
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103489453A (en) * | 2013-06-28 | 2014-01-01 | 陆蔚华 | Product emotion qualification method based on acoustic parameters |
CN103531198A (en) * | 2013-11-01 | 2014-01-22 | 东南大学 | Speech emotion feature normalization method based on pseudo speaker clustering |
CN103593054A (en) * | 2013-11-25 | 2014-02-19 | 北京光年无限科技有限公司 | Question-answering system combining emotion recognition and output |
CN105681546A (en) * | 2015-12-30 | 2016-06-15 | 宇龙计算机通信科技(深圳)有限公司 | Voice processing method, device and terminal |
CN107516511A (en) * | 2016-06-13 | 2017-12-26 | 微软技术许可有限责任公司 | The Text To Speech learning system of intention assessment and mood |
CN106537294A (en) * | 2016-06-29 | 2017-03-22 | 深圳狗尾草智能科技有限公司 | Method, system and robot for generating interactive content of robot |
CN106531162A (en) * | 2016-10-28 | 2017-03-22 | 北京光年无限科技有限公司 | Man-machine interaction method and device used for intelligent robot |
CN106570496A (en) * | 2016-11-22 | 2017-04-19 | 上海智臻智能网络科技股份有限公司 | Emotion recognition method and device and intelligent interaction method and device |
CN106776936A (en) * | 2016-12-01 | 2017-05-31 | 上海智臻智能网络科技股份有限公司 | intelligent interactive method and system |
CN107562816A (en) * | 2017-08-16 | 2018-01-09 | 深圳狗尾草智能科技有限公司 | User view automatic identifying method and device |
Cited By (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110931006A (en) * | 2019-11-26 | 2020-03-27 | 深圳壹账通智能科技有限公司 | Intelligent question-answering method based on emotion analysis and related equipment |
CN111026843A (en) * | 2019-12-02 | 2020-04-17 | 北京智乐瑟维科技有限公司 | Artificial intelligent voice outbound method, system and storage medium |
CN111026843B (en) * | 2019-12-02 | 2023-03-14 | 北京智乐瑟维科技有限公司 | Artificial intelligent voice outbound method, system and storage medium |
CN113035181A (en) * | 2019-12-09 | 2021-06-25 | 斑马智行网络(香港)有限公司 | Voice data processing method, device and system |
CN111179903A (en) * | 2019-12-30 | 2020-05-19 | 珠海格力电器股份有限公司 | Voice recognition method and device, storage medium and electric appliance |
CN111833907A (en) * | 2020-01-08 | 2020-10-27 | 北京嘀嘀无限科技发展有限公司 | Man-machine interaction method, terminal and computer readable storage medium |
CN113409778A (en) * | 2020-03-16 | 2021-09-17 | 阿里巴巴集团控股有限公司 | Voice interaction method, system and terminal |
CN111832317A (en) * | 2020-07-09 | 2020-10-27 | 平安普惠企业管理有限公司 | Intelligent information diversion method and device, computer equipment and readable storage medium |
CN111832317B (en) * | 2020-07-09 | 2023-08-18 | 广州市炎华网络科技有限公司 | Intelligent information flow guiding method and device, computer equipment and readable storage medium |
CN111563386A (en) * | 2020-07-14 | 2020-08-21 | 北京每日优鲜电子商务有限公司 | Semantic processing method and system based on artificial intelligence |
CN112002329A (en) * | 2020-09-03 | 2020-11-27 | 深圳Tcl新技术有限公司 | Physical and mental health monitoring method and device and computer readable storage medium |
CN112002329B (en) * | 2020-09-03 | 2024-04-02 | 深圳Tcl新技术有限公司 | Physical and mental health monitoring method, equipment and computer readable storage medium |
CN112233699A (en) * | 2020-10-13 | 2021-01-15 | 中移(杭州)信息技术有限公司 | Voice broadcasting method, intelligent voice device and computer readable storage medium |
CN112233699B (en) * | 2020-10-13 | 2023-04-28 | 中移(杭州)信息技术有限公司 | Voice broadcasting method, intelligent voice equipment and computer readable storage medium |
CN112767969A (en) * | 2021-01-29 | 2021-05-07 | 苏州思必驰信息科技有限公司 | Method and system for determining emotion tendentiousness of voice information |
CN113743126B (en) * | 2021-11-08 | 2022-06-14 | 北京博瑞彤芸科技股份有限公司 | Intelligent interaction method and device based on user emotion |
CN113743126A (en) * | 2021-11-08 | 2021-12-03 | 北京博瑞彤芸科技股份有限公司 | Intelligent interaction method and device based on user emotion |
CN114138960A (en) * | 2021-11-30 | 2022-03-04 | 中国平安人寿保险股份有限公司 | User intention identification method, device, equipment and medium |
CN115047824A (en) * | 2022-05-30 | 2022-09-13 | 青岛海尔科技有限公司 | Digital twin multimodal device control method, storage medium, and electronic apparatus |
CN115101074B (en) * | 2022-08-24 | 2022-11-11 | 深圳通联金融网络科技服务有限公司 | Voice recognition method, device, medium and equipment based on user speaking emotion |
CN115101074A (en) * | 2022-08-24 | 2022-09-23 | 深圳通联金融网络科技服务有限公司 | Voice recognition method, device, medium and equipment based on user speaking emotion |
CN115460317A (en) * | 2022-09-05 | 2022-12-09 | 西安万像电子科技有限公司 | Emotion recognition and voice feedback method, device, medium and electronic equipment |
CN116030811A (en) * | 2023-03-22 | 2023-04-28 | 广州小鹏汽车科技有限公司 | Voice interaction method, vehicle and computer readable storage medium |
CN117079673A (en) * | 2023-10-17 | 2023-11-17 | 青岛铭威软创信息技术有限公司 | Intelligent emotion recognition method based on multi-mode artificial intelligence |
CN117079673B (en) * | 2023-10-17 | 2023-12-19 | 青岛铭威软创信息技术有限公司 | Intelligent emotion recognition method based on multi-mode artificial intelligence |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110085221A (en) | Speech emotional exchange method, computer equipment and computer readable storage medium | |
CN110085262A (en) | Voice mood exchange method, computer equipment and computer readable storage medium | |
US11373641B2 (en) | Intelligent interactive method and apparatus, computer device and computer readable storage medium | |
Hema et al. | Emotional speech recognition using cnn and deep learning techniques | |
CN110085220A (en) | Intelligent interaction device | |
Narendra et al. | Glottal source information for pathological voice detection | |
CN110085211A (en) | Speech recognition exchange method, device, computer equipment and storage medium | |
Bone et al. | Robust unsupervised arousal rating: A rule-based framework withknowledge-inspired vocal features | |
Jing et al. | Prominence features: Effective emotional features for speech emotion recognition | |
Koolagudi et al. | Choice of a classifier, based on properties of a dataset: case study-speech emotion recognition | |
Mower et al. | Interpreting ambiguous emotional expressions | |
Gharavian et al. | Speech emotion recognition using FCBF feature selection method and GA-optimized fuzzy ARTMAP neural network | |
Al-Dujaili et al. | Speech emotion recognition: a comprehensive survey | |
Origlia et al. | Continuous emotion recognition with phonetic syllables | |
Sethu et al. | Speech based emotion recognition | |
Levitan et al. | Combining Acoustic-Prosodic, Lexical, and Phonotactic Features for Automatic Deception Detection. | |
CN117352000A (en) | Speech classification method, device, electronic equipment and computer readable medium | |
CN113853651B (en) | Apparatus and method for speech-emotion recognition with quantized emotion state | |
Dhar et al. | A system to predict emotion from Bengali speech | |
Kalatzantonakis-Jullien et al. | Investigation and ordinal modelling of vocal features for stress detection in speech | |
Shanmugam et al. | Understanding the Use of Acoustic Measurement and Mel Frequency Cepstral Coefficient (MFCC) Features for the Classification of Depression Speech | |
Chang | Speech Analysis Methodologies towards Unobtrusive Mental Health Monitoring | |
Singh | High level speaker specific features as an efficiency enhancing parameters in speaker recognition system | |
Bertero et al. | Towards Universal End-to-End Affect Recognition from Multilingual Speech by ConvNets | |
Ignatius et al. | A survey on paralinguistics in tamil speech processing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190802 |
|
RJ01 | Rejection of invention patent application after publication |