CN111209751A - Chinese word segmentation method, device and storage medium - Google Patents

Chinese word segmentation method, device and storage medium Download PDF

Info

Publication number
CN111209751A
CN111209751A CN202010095159.1A CN202010095159A CN111209751A CN 111209751 A CN111209751 A CN 111209751A CN 202010095159 A CN202010095159 A CN 202010095159A CN 111209751 A CN111209751 A CN 111209751A
Authority
CN
China
Prior art keywords
word
vector
text
word vector
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010095159.1A
Other languages
Chinese (zh)
Other versions
CN111209751B (en
Inventor
宋博川
张强
柴博
贾全烨
戴铁潮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
State Grid Zhejiang Electric Power Co Ltd
Global Energy Interconnection Research Institute
Original Assignee
State Grid Corp of China SGCC
State Grid Zhejiang Electric Power Co Ltd
Global Energy Interconnection Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, State Grid Zhejiang Electric Power Co Ltd, Global Energy Interconnection Research Institute filed Critical State Grid Corp of China SGCC
Priority to CN202010095159.1A priority Critical patent/CN111209751B/en
Publication of CN111209751A publication Critical patent/CN111209751A/en
Application granted granted Critical
Publication of CN111209751B publication Critical patent/CN111209751B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a Chinese word segmentation method, a device and a storage medium, wherein the method comprises the following steps: acquiring a word vector of each word corresponding to the text; inputting each word vector into a projection layer of the long-term and short-term memory network model to obtain the initial probability that the word vector output by the projection layer belongs to each target class; acquiring a text vector corresponding to a target field word stock; inputting the initial probability that the word vector belongs to each target class and the text vector to a conditional random field layer of the long-short term memory network model; adjusting the initial probability that the word vector belongs to each target category according to the text vector to obtain a label sequence; and obtaining a word segmentation sequence of the text according to the label sequence. By implementing the method, the initial probability that the word vector belongs to each target category is calculated and adjusted by utilizing the long-short term memory network model and the target field word stock to obtain the Chinese word segmentation sequence, so that the accuracy of the word segmentation result is improved.

Description

Chinese word segmentation method, device and storage medium
Technical Field
The invention relates to the field of natural language processing, in particular to a Chinese word segmentation method, a Chinese word segmentation device and a storage medium.
Background
Chinese word segmentation is a process of recombining continuous word sequences into word sequences according to a certain specification. In English, space is used as a natural delimiter between words, and Chinese is only a character, a sentence and a paragraph which can be simply delimited by an obvious delimiter, and only one formal delimiter does not exist for the word.
In the related technology, the word segmentation method is based on the traditional statistical learning word segmentation method, but a rule template needs to be designed manually, and the problem of serious data sparsity is faced, so that the accuracy of word segmentation results is low.
Disclosure of Invention
Therefore, the technical problem to be solved by the present invention is to overcome the defect of low accuracy of word segmentation result in the prior art, so as to provide a method, an apparatus and a storage medium for Chinese word segmentation.
According to a first aspect, an embodiment of the present invention provides a chinese word segmentation method, including the following steps:
acquiring a word vector of each word corresponding to the text; inputting each word vector into a projection layer of a long-term and short-term memory network model to obtain the initial probability that the word vector output by the projection layer belongs to each target class; acquiring a text vector corresponding to a target field word stock; inputting the initial probability of the word vector being affiliated to each target category and the text vector into a conditional random field layer of the long-short term memory network model; adjusting the initial probability that the word vector belongs to each target category according to the text vector to obtain a label sequence; and obtaining a word segmentation sequence of the text according to the label sequence.
With reference to the first aspect, in a first implementation manner of the first aspect, the obtaining a word vector of each word corresponding to a text includes: inputting the text into a first coding layer of the long-short term memory network model to obtain an initial word vector of each word corresponding to the text; and inputting the initial word vector of each corresponding word into a second coding layer to obtain a word vector representing a context relationship, and taking the word vector representing the context relationship as the word vector of each word corresponding to the text.
With reference to the first aspect, in a second implementation manner of the first aspect, the target category includes a head position of a multi-word, a middle position of the multi-word, a tail position of the multi-word, and a single-word.
With reference to the first aspect, in a third implementation manner of the first aspect, the adjusting, according to the text vector, the initial probability that the word vector belongs to each target class to obtain a label of the word vector includes: obtaining a transition probability matrix; and adjusting the initial probability of the word vector being affiliated to each target category according to the transition probability matrix to obtain the label of the word vector.
According to a second aspect, an embodiment of the present invention provides a chinese word segmentation apparatus, including: the word vector acquisition module is used for acquiring a word vector of each word corresponding to the text; the initial probability acquisition module is used for inputting each word vector to a projection layer of the long-term and short-term memory network model to obtain the initial probability that the word vector output by the projection layer belongs to each target class; the text vector acquisition module is used for acquiring text vectors corresponding to the target field lexicon; the conditional random field layer input module is used for inputting the initial probability that the word vector belongs to each target class and the text vector to a conditional random field layer of the long-short term memory network model; the label obtaining module is used for adjusting the initial probability that the word vector belongs to each target category according to the text vector to obtain a label sequence; and the word segmentation sequence acquisition module is used for acquiring the word segmentation sequence of the text according to each label sequence.
With reference to the second aspect, in a first implementation manner of the second aspect, the word vector obtaining module includes: an initial word vector obtaining module, configured to input the text to a first coding layer of the long and short term memory network model, so as to obtain an initial word vector of each word corresponding to the text; and the word vector acquisition submodule is used for inputting the initial word vector of each corresponding word into a second coding layer to obtain a word vector representing the context relationship, and taking the word vector representing the context relationship as the word vector of each word corresponding to the text.
With reference to the second aspect, in a second embodiment of the second aspect, the target category includes a head position of the multi-word, a middle position of the multi-word, a tail position of the multi-word, and a single-word.
With reference to the second aspect, in a third implementation manner of the second aspect, the tag obtaining module includes: the transition probability matrix acquisition module is used for acquiring a transition probability matrix; and the label obtaining submodule is used for adjusting the initial probability that the word vector belongs to each target category according to the transition probability matrix to obtain the label of the word vector.
According to a third aspect, an embodiment of the present invention provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of the chinese word segmentation method according to the first aspect or any one of the embodiments of the first aspect when executing the program.
According to a fourth aspect, an embodiment of the present invention provides a storage medium, on which computer instructions are stored, and the instructions, when executed by a processor, implement the steps of the chinese word segmentation method according to the first aspect or any one of the embodiments of the first aspect.
The technical scheme of the invention has the following advantages:
1. the invention provides a Chinese word segmentation method/device, which calculates and adjusts the initial probability that a word vector in input text information belongs to each target category through a long-short term memory network model and an externally introduced target field word stock, thereby obtaining a Chinese word segmentation sequence of the text and improving the accuracy of word segmentation results.
2. The invention provides a Chinese word segmentation method/device, which obtains a hidden layer vector containing context information by inputting a word vector into a second coding layer, so that the accuracy of the initial probability of the word vector belonging to each target class in the subsequent calculation is higher, and the accuracy of the word segmentation result is further improved.
3. The invention provides a Chinese word segmentation method/device, which restrains initial probability through a transition probability matrix and adjusts the initial probability through a restraint condition so as to adjust a word vector label and further improve the accuracy of Chinese word segmentation.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flowchart illustrating a specific example of a Chinese segmentation method according to an embodiment of the present invention;
FIG. 2 is a schematic block diagram of a specific example of a Chinese word segmentation apparatus according to an embodiment of the present invention;
fig. 3 is a schematic block diagram of a specific example of an electronic device in the embodiment of the present invention.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; the two elements may be directly connected or indirectly connected through an intermediate medium, or may be communicated with each other inside the two elements, or may be wirelessly connected or wired connected. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
In addition, the technical features involved in the different embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The embodiment of the application provides a Chinese word segmentation method, as shown in fig. 1, which comprises the following steps:
s110, obtaining a word vector of each word corresponding to the text.
For example, the word vector of each word corresponding to the text may be obtained by encoding characters each having a unique id value. And then selecting a corresponding word vector in a preset word embedding matrix according to the id value to obtain a word vector which can be processed by the neural network. Assuming the id value of the character "one" is 239, the word vector v numbered 239 in the word-embedding matrix is selected239,v239The word vector that can be processed by the neural network can be converted into an expression corresponding to the word vector by using a word embedding method, and the specific method can be that a word skipping model and a continuous bag of words model in a word2vec tool are used for training the models so that the word skipping model and the continuous bag of words model convert the text into the word vector of each word. In this embodiment, the obtaining manner of the word vector of each word corresponding to the text is not limited, and those skilled in the art can determine the word vector as needed.
And S120, inputting each word vector into a projection layer of the long-term and short-term memory network model to obtain the initial probability that the word vector output by the projection layer belongs to each target class.
Illustratively, a long-short term memory network model for word segmentation is obtained in advance according to a large number of training samples, and the long-short term memory network model comprises a projection layer and a conditional random field layer, wherein the projection layer is used for calculating the initial probability that each word vector belongs to each target class. The initial probability can be calculated by taking the word vector of each word as v, the initial probability is s, and s is obtained by v through a linear transformation: s ═ Wv + b. W is a matrix of (4, h), h is the dimension of the word vector for each word, and b is an arbitrary number. And (4) composing the initial probability into an initial probability matrix (t,4), wherein t is the number of characters of the input sentence, namely, each word vector corresponds to the characters in the sentence one by one. The target category can be the first position of the multi-character words, the middle position of the multi-character words, the tail position of the multi-character words and the single-character words. Taking the input text of "building a vehicle through the bridge of Yangtze river in Nanjing City" as an example, the probabilities that the word vectors corresponding to "south", "Beijing", "City", "Long", "Jiang", "Large", "bridge", "build", "complete", "through" and "vehicle" are respectively subordinate to the head of the multi-character words, the middle of the multi-character words, the tail of the multi-character words and the single-character words are respectively calculated, and the obtained probabilities are taken as initial probabilities. For example, the word vectors corresponding to the "long" are respectively subordinate to the head of the multi-word, the middle of the multi-word, the tail of the multi-word and the single word, and the calculated probabilities are respectively 0.3, 0.1, 0.4 and 0.2. The target category is not limited in this embodiment, and can be determined by those skilled in the art as needed.
And S130, acquiring a text vector corresponding to the target field lexicon.
Illustratively, the target domain word library represents a word library corresponding to the text content domain, and the word library may include word combinations of common words, professional words, latest words, and the like of the domain. The manner of obtaining the text vector corresponding to the target domain thesaurus may still be according to the manner of obtaining the word vector of each word corresponding to the text, which is described in the step S110, and is not described herein again. In this embodiment, the obtaining manner of the text vector corresponding to the target field thesaurus is not limited, and a person skilled in the art can determine the obtaining manner as required.
And S140, inputting the initial probability of the word vector belonging to each target class and the text vector to a conditional random field layer of the long-short term memory network model.
Illustratively, the conditional random field layer is a probabilistic undirected graph model that adjusts the probability that an output word vector belongs to each target class by the initial probability that the input word vector belongs to each target class and the text vector.
And S150, adjusting the initial probability that the word vector belongs to each target class according to the text vector to obtain a label sequence.
Illustratively, the label of the word vector may be "B", "M", "E", "S", where "B" represents the first word in a multi-word, "M" represents other words except the first word and the last word in a multi-word, and "E" represents the last word in a multi-word, and "S" represents a single word, and in this embodiment, the label of the word vector corresponds to the target category of step S120. According to the text vector, the initial probability of the word vector belonging to each target category can be adjusted by comparing the text vector existing in the target field thesaurus with the word vector of each word corresponding to the text, and when the word vector and the adjacent word vector exist in the text vector in the target field thesaurus, increasing the initial probability of the word vector in the corresponding target category or increasing the weight bias of the word vector in the corresponding target category; and when the word vector and the adjacent word vector do not exist in the text vector in the target field word stock, reducing the initial probability of the word vector in the corresponding target category or reducing the weight bias of the word vector in the corresponding target category, so as to obtain the final probability of the adjusted initial probability of each word vector to each target category, or calculating by utilizing a Viterbi algorithm to obtain the final word segmentation result label sequence. The calculation method of the weight bias may be a preset value, which does not change in the whole calculation process, or may be an average value of the hidden vector matrix of the current input sentence, or an average value of all elements in the hidden vector matrix of the current input sentence, and all negative numbers are replaced by 0.
For example, still taking the input text of "the bridge builds a communication car in the Yangtze river of Nanjing city" as an example, two division ways exist for "the bridge builds a communication car in the Yangtze river of Nanjing city": "Nanjing civic/Jiangtang bridge" and "Nanjing civic/Changjiang bridge", and the initial probability of "long" at the tail position of the multi-character word obtained in the above step S120 is the highest, which indicates that the sentence is preliminarily divided into "Nanjing civic/Jiangtang bridge", and that "Changjiang bridge" exists in the text vector corresponding to the target domain thesaurus, but "Jiangtang bridge" does not exist, so that the initial probability of being long at the head position of the multi-character word is increased to 0.4, the probability of being long at the tail position of the multi-character word is decreased to 0.3. Then, the probability of the length of the word vector in each target category is 0.4, 0.1, 0.3 and 0.2, and the label corresponding to the target category with the highest initial probability in each target category, namely the target category corresponding to 0.4, is selected, and the label is corresponding to B.
For another example, taking the input text as "you have a look at southern mountain", the weight bias is a, assuming the initial probability that each word vector belongs to each target class, and establishing an initial probability matrix as shown in table 1.
TABLE 1
B M E S
Youyou (a kind of toy) 0.3 -0.4 0.5 0.6
However, the device is not suitable for use in a kitchen 0.3 -0.4 0.5 0.6
See 0.3 -0.4 0.5 0.6
South China 0.3 -0.4 0.5 0.6
Mountain 0.3 -0.4 0.5 0.6
At this time, "leisurely" and "southern mountain" exist in the target domain thesaurus. And "you" and "south" are the beginning of a word, then a weight offset a is added to the "B" label of the words "you" and "south" in the matrix; if "then" and "mountain" are the end of a word, then a weight offset a is added to the "E" label of "then" mountain "in the matrix. The matrix added with weight bias becomes as shown in table 2:
TABLE 2
Figure BDA0002384370370000091
Figure BDA0002384370370000101
In this embodiment, the magnitude of the weight offset a may be a preset value, for example, 0.2, which is not changed in the whole calculation process; it is also possible to calculate the average value of the hidden vector matrix of the current input sentence, which in this example is: ((0.3+0.5+0.6) × 5+ (-0.4) × 5)/20 ═ 0.25; it is also possible to compute the average of all the elements in the hidden vector matrix of the current input sentence, but replace all the negatives with 0, which in this example is: ((0.3+0.5+0.6) × 5+0 × 5)/20 ═ 0.35.
If the tag sequence of one sentence is: y ═ y1,2,...,n),y1、y2...ynEach represents a label of a sentence, and the label may be any one of "B", "M", "E", and "S". The score of the label y for any word vector x in the input sentence is:
Figure BDA0002384370370000102
where E is the initial probability matrix and T is the matrix of the initial probabilities of adding weight biases or modifying word vectors at the corresponding target classes. The final probability can be determined by softmax:
Figure BDA0002384370370000103
calculating the label sequence y corresponding to the whole sentence by the above formulai
The method for adjusting the initial probability that the word vector belongs to each target category according to the text vector is not limited in this embodiment, and can be determined by those skilled in the art as needed.
And S160, obtaining a word segmentation sequence of the text according to the label sequence.
Illustratively, still taking the input texts as "the Nanjing city Changjiang river bridge builds a traffic car" and "you see south mountain" as examples, the finally obtained tag sequences are BMEBMMEBE and BESBE respectively, and according to the tag sequences, the obtained text word segmentation sequences are "the Nanjing city/the Changjiang river bridge builds/traffic car" and "you see/south mountain" respectively.
The embodiment provides a Chinese word segmentation method, which is characterized in that the initial probability that a word vector in input text information belongs to each target category is calculated and adjusted through a long-short term memory network model and an externally introduced target field word stock, so that a Chinese word segmentation sequence of the text is obtained, and the accuracy of word segmentation results is improved.
As an optional implementation manner of this embodiment, step S110 includes:
firstly, inputting a text into a first coding layer of a long-term and short-term memory network model to obtain an initial word vector of each word corresponding to the text.
For example, the first coding layer may be a character coding layer, which realizes coding of the input text into an initial word vector that can be processed by the long-term and short-term memory network, and the specific coding manner of the first coding layer is referred to step S110 above, which is not described herein again. The first coding layer is not limited in this embodiment, and those skilled in the art can determine the first coding layer as needed.
Secondly, inputting the initial word vector of each corresponding word into a second coding layer to obtain a word vector representing the context, and taking the word vector representing the context as the word vector of each word corresponding to the text.
Illustratively, the second coding layer may be a long-short term memory network coding layer, and the coding of the word vectors obtained by the first coding layer is implemented to obtain hidden layer vectors, each hidden layer vector corresponds to each character in the input sentence one to one, and the hidden layer vectors are used as the word vectors of each word corresponding to the text. Assuming that there are 13 characters in the input sentence, there are 13 corresponding hidden vectors. Each hidden layer vector not only contains information of a single character, but also contains context information of the character in a sentence.
In the method for Chinese word segmentation provided by the embodiment, the hidden layer vector containing the context information is obtained by inputting the word vector into the second coding layer, so that the accuracy of the initial probability of the word vector belonging to each target category in the subsequent calculation is higher, and the accuracy of the word segmentation result is further improved.
As an optional implementation manner of this embodiment, step S150 includes:
first, a transition probability matrix is obtained.
Illustratively, the transition probability matrix is used to constrain the computation of the initial probabilities. For example, the probability that the first word vector of the input text is B is the highest, and the probability that the word vector after B is M or E is higher than S, at this time, a transition probability matrix implementation constraint is established. The transition probability matrix can be obtained by randomly initializing the transition probability matrix, and the transition probability matrix is updated iteratively along with the training of the long-term and short-term memory network model. The obtaining method of the transition probability matrix is not limited in this embodiment, and can be determined by those skilled in the art as needed.
And secondly, adjusting the initial probability that the word vector belongs to each target class according to the transition probability matrix to obtain the label of the word vector.
For example, according to the transition probability matrix, the initial probability that the word vector belongs to each target class may be adjusted by using parameters in the transition probability matrix as weights for calculating the initial probability that the word vector belongs to each target class, adjusting the initial probability that each word vector belongs to each target class through the weights to obtain the probability that each adjusted word vector belongs to each target class, selecting a label corresponding to the target class with the highest initial probability that the word vector is adjusted in each target class, and using the label as the label of the word vector to obtain a label sequence.
According to the Chinese word segmentation method provided by the embodiment, the initial probability is constrained through the transition probability matrix, and the initial probability is adjusted through the constraint condition, so that the word vector label is adjusted, and the accuracy of Chinese word segmentation is further improved.
An embodiment of the present application provides a chinese word segmentation apparatus, as shown in fig. 2, including:
a word vector obtaining module 210, configured to obtain a word vector of each word corresponding to the text; the specific implementation manner is shown in the corresponding part of step S110 of the method of this embodiment, and is not described herein again.
An initial probability obtaining module 220, configured to input each word vector to a projection layer of the long-term and short-term memory network model, so as to obtain an initial probability that the word vector output by the projection layer belongs to each target class; the specific implementation manner is shown in the corresponding part of step S120 of the method of this embodiment, and is not described herein again.
A text vector obtaining module 230, configured to obtain a text vector corresponding to the target domain thesaurus; the specific implementation manner is shown in the corresponding part of step S130 of the method of this embodiment, and is not described herein again.
A conditional random field layer input module 240, configured to input the initial probability that the word vector belongs to each target class and the text vector to a conditional random field layer of the long-short term memory network model; the specific implementation manner is shown in the corresponding part of step S140 of the method of this embodiment, and is not described herein again.
A label obtaining module 250, configured to adjust an initial probability that the word vector belongs to each target category according to the text vector, so as to obtain a label of the word vector; the specific implementation manner is shown in the corresponding part of step S150 of the method of this embodiment, and is not described herein again.
And the word segmentation sequence obtaining module 260 is configured to obtain a word segmentation sequence of the text according to the label of each word vector. The specific implementation manner is shown in the corresponding part of step S160 of the method of this embodiment, and is not described herein again.
The embodiment provides a Chinese word segmentation device, which calculates and adjusts the initial probability that a word vector in input text information belongs to each target category through a long-short term memory network model and an externally introduced target field word stock, so as to obtain a Chinese word segmentation sequence of the text and improve the accuracy of word segmentation results.
As an optional implementation manner of the present application, the word vector obtaining module 210 includes:
the initial word vector acquisition module is used for inputting the text into a first coding layer of the long-term and short-term memory network model to obtain an initial word vector of each word corresponding to the text; the specific implementation manner is shown in the corresponding part of the method of the embodiment, and is not described herein again.
And the word vector acquisition submodule is used for inputting the initial word vector of each corresponding word into the second coding layer to obtain a word vector representing the context relationship, and taking the word vector representing the context relationship as the word vector of each word corresponding to the text. The specific implementation manner is shown in the corresponding part of the method of the embodiment, and is not described herein again.
As an optional implementation manner of this embodiment, the target category includes a first position of the multi-word, a middle position of the multi-word, a last position of the multi-word, and a single-word. The specific implementation manner is shown in the corresponding part of the method of the embodiment, and is not described herein again.
As an optional implementation manner of this embodiment, the tag obtaining module 250 includes:
the transition probability matrix acquisition module is used for acquiring a transition probability matrix; the specific implementation manner is shown in the corresponding part of the method of the embodiment, and is not described herein again.
And the label obtaining submodule is used for adjusting the initial probability that the word vector belongs to each target category according to the transition probability matrix to obtain the label of the word vector. The specific implementation manner is shown in the corresponding part of the method of the embodiment, and is not described herein again.
The embodiment of the present application also provides an electronic device, as shown in fig. 3, including a processor 310 and a memory 320, where the processor 310 and the memory 320 may be connected by a bus or in other manners.
Processor 310 may be a Central Processing Unit (CPU). The Processor 310 may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, or any combination thereof.
The memory 320, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the chinese word segmentation method in the embodiments of the present invention. The processor executes various functional applications and data processing of the processor by executing non-transitory software programs, instructions, and modules stored in the memory.
The memory 320 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created by the processor, and the like. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 320 may optionally include memory located remotely from the processor, which may be connected to the processor via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The one or more modules are stored in the memory 320 and, when executed by the processor 310, perform the chinese segmentation method in the embodiment shown in fig. 1.
The details of the electronic device may be understood with reference to the corresponding related description and effects in the embodiment shown in fig. 1, and are not described herein again.
The embodiment also provides a computer storage medium, wherein the computer storage medium stores computer-executable instructions, and the computer-executable instructions can execute the Chinese word segmentation method in any method embodiment. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid State Drive (SSD), or the like; the storage medium may also comprise a combination of memories of the kind described above.
It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are within the scope of the invention.

Claims (10)

1. A Chinese word segmentation method is characterized by comprising the following steps:
acquiring a word vector of each word corresponding to the text;
inputting each word vector into a projection layer of a long-term and short-term memory network model to obtain the initial probability that the word vector output by the projection layer belongs to each target class;
acquiring a text vector corresponding to a target field word stock;
inputting the initial probability of the word vector being affiliated to each target category and the text vector into a conditional random field layer of the long-short term memory network model;
adjusting the initial probability that the word vector belongs to each target category according to the text vector to obtain a label sequence;
and obtaining word segmentation sequences of the text according to each label sequence.
2. The method of claim 1, wherein the obtaining a word vector for each word corresponding to text comprises:
inputting the text into a first coding layer of the long-short term memory network model to obtain an initial word vector of each word corresponding to the text;
and inputting the initial word vector of each corresponding word into a second coding layer to obtain a word vector representing a context relationship, and taking the word vector representing the context relationship as the word vector of each word corresponding to the text.
3. The method of claim 1, wherein the target categories include first order of multi-word words, middle order of multi-word words, tail order of multi-word words, and single word words.
4. The method of claim 1, wherein adjusting the initial probability that the word vector belongs to each target class according to the text vector to obtain the label of the word vector comprises:
obtaining a transition probability matrix;
and adjusting the initial probability of the word vector being affiliated to each target category according to the transition probability matrix to obtain the label of the word vector.
5. A Chinese word segmentation device is characterized by comprising:
the word vector acquisition module is used for acquiring a word vector of each word corresponding to the text;
the initial probability acquisition module is used for inputting each word vector to a projection layer of the long-term and short-term memory network model to obtain the initial probability that the word vector output by the projection layer belongs to each target class;
the text vector acquisition module is used for acquiring text vectors corresponding to the target field lexicon;
the conditional random field layer input module is used for inputting the initial probability that the word vector belongs to each target class and the text vector to a conditional random field layer of the long-short term memory network model;
the label obtaining module is used for adjusting the initial probability that the word vector belongs to each target category according to the text vector to obtain a label of the word vector;
and the word segmentation sequence acquisition module is used for acquiring the word segmentation sequence of the text according to the label of each word vector.
6. The apparatus of claim 5, wherein the word vector obtaining module comprises:
an initial word vector obtaining module, configured to input the text to a first coding layer of the long and short term memory network model, so as to obtain an initial word vector of each word corresponding to the text;
and the word vector acquisition submodule is used for inputting the initial word vector of each corresponding word into a second coding layer to obtain a word vector representing the context relationship, and taking the word vector representing the context relationship as the word vector of each word corresponding to the text.
7. The apparatus of claim 5, wherein the target categories comprise first order of multi-word words, middle order of multi-word words, tail order of multi-word words, and single word words.
8. The apparatus of claim 5, wherein the tag acquisition module comprises:
the transition probability matrix acquisition module is used for acquiring a transition probability matrix;
and the label obtaining submodule is used for adjusting the initial probability that the word vector belongs to each target category according to the transition probability matrix to obtain the label of the word vector.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the chinese word segmentation method according to any one of claims 1-4 are implemented when the program is executed by the processor.
10. A storage medium having stored thereon computer instructions, which when executed by a processor, carry out the steps of the chinese word segmentation method according to any one of claims 1 to 4.
CN202010095159.1A 2020-02-14 2020-02-14 Chinese word segmentation method, device and storage medium Active CN111209751B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010095159.1A CN111209751B (en) 2020-02-14 2020-02-14 Chinese word segmentation method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010095159.1A CN111209751B (en) 2020-02-14 2020-02-14 Chinese word segmentation method, device and storage medium

Publications (2)

Publication Number Publication Date
CN111209751A true CN111209751A (en) 2020-05-29
CN111209751B CN111209751B (en) 2023-07-28

Family

ID=70790013

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010095159.1A Active CN111209751B (en) 2020-02-14 2020-02-14 Chinese word segmentation method, device and storage medium

Country Status (1)

Country Link
CN (1) CN111209751B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112052670A (en) * 2020-08-28 2020-12-08 丰图科技(深圳)有限公司 Address text word segmentation method and device, computer equipment and storage medium
CN113705194A (en) * 2021-04-12 2021-11-26 腾讯科技(深圳)有限公司 Extraction method and electronic equipment for short

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106886516A (en) * 2017-02-27 2017-06-23 竹间智能科技(上海)有限公司 The method and device of automatic identification statement relationship and entity
US20170229115A1 (en) * 2014-12-08 2017-08-10 Samsung Electronics Co., Ltd. Method and apparatus for training language model and recognizing speech
US20170364766A1 (en) * 2014-12-22 2017-12-21 Gonzalo Vaca First-Person Camera Based Visual Context Aware System
CN107734131A (en) * 2016-08-11 2018-02-23 中兴通讯股份有限公司 A kind of short message sorting technique and device
CN108038103A (en) * 2017-12-18 2018-05-15 北京百分点信息科技有限公司 A kind of method, apparatus segmented to text sequence and electronic equipment
CN109002436A (en) * 2018-07-12 2018-12-14 上海金仕达卫宁软件科技有限公司 Medical text terms automatic identifying method and system based on shot and long term memory network
CN109493977A (en) * 2018-11-09 2019-03-19 天津新开心生活科技有限公司 Text data processing method, device, electronic equipment and computer-readable medium
CN109558583A (en) * 2017-09-27 2019-04-02 株式会社理光 A kind of method, device and equipment automatically generating digest
KR20190065665A (en) * 2017-12-04 2019-06-12 주식회사 솔루게이트 Apparatus and method for recognizing Korean named entity using deep-learning
CN110263325A (en) * 2019-05-17 2019-09-20 交通银行股份有限公司太平洋信用卡中心 Chinese automatic word-cut
CN110276052A (en) * 2019-06-10 2019-09-24 北京科技大学 A kind of archaic Chinese automatic word segmentation and part-of-speech tagging integral method and device
US20200050667A1 (en) * 2018-08-09 2020-02-13 CloudMinds Technology, Inc. Intent Classification Method and System

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170229115A1 (en) * 2014-12-08 2017-08-10 Samsung Electronics Co., Ltd. Method and apparatus for training language model and recognizing speech
US20170364766A1 (en) * 2014-12-22 2017-12-21 Gonzalo Vaca First-Person Camera Based Visual Context Aware System
CN107734131A (en) * 2016-08-11 2018-02-23 中兴通讯股份有限公司 A kind of short message sorting technique and device
CN106886516A (en) * 2017-02-27 2017-06-23 竹间智能科技(上海)有限公司 The method and device of automatic identification statement relationship and entity
CN109558583A (en) * 2017-09-27 2019-04-02 株式会社理光 A kind of method, device and equipment automatically generating digest
KR20190065665A (en) * 2017-12-04 2019-06-12 주식회사 솔루게이트 Apparatus and method for recognizing Korean named entity using deep-learning
CN108038103A (en) * 2017-12-18 2018-05-15 北京百分点信息科技有限公司 A kind of method, apparatus segmented to text sequence and electronic equipment
CN109002436A (en) * 2018-07-12 2018-12-14 上海金仕达卫宁软件科技有限公司 Medical text terms automatic identifying method and system based on shot and long term memory network
US20200050667A1 (en) * 2018-08-09 2020-02-13 CloudMinds Technology, Inc. Intent Classification Method and System
CN109493977A (en) * 2018-11-09 2019-03-19 天津新开心生活科技有限公司 Text data processing method, device, electronic equipment and computer-readable medium
CN110263325A (en) * 2019-05-17 2019-09-20 交通银行股份有限公司太平洋信用卡中心 Chinese automatic word-cut
CN110276052A (en) * 2019-06-10 2019-09-24 北京科技大学 A kind of archaic Chinese automatic word segmentation and part-of-speech tagging integral method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ZENGJIAN LIU等: "Chinese Clinical Entity Recognition via Attention-Based CNN-LSTM-CRF", 《2018 IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS WORKSHOP》 *
司念文;王衡军;李伟;单义栋;谢鹏程;: "基于注意力长短时记忆网络的中文词性标注模型", 计算机科学, no. 04 *
汤步洲;王晓龙;王轩;张强;: "语句级汉字拼音输入技术评估方法的研究", 中文信息学报, no. 05 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112052670A (en) * 2020-08-28 2020-12-08 丰图科技(深圳)有限公司 Address text word segmentation method and device, computer equipment and storage medium
CN112052670B (en) * 2020-08-28 2024-04-02 丰图科技(深圳)有限公司 Address text word segmentation method, device, computer equipment and storage medium
CN113705194A (en) * 2021-04-12 2021-11-26 腾讯科技(深圳)有限公司 Extraction method and electronic equipment for short

Also Published As

Publication number Publication date
CN111209751B (en) 2023-07-28

Similar Documents

Publication Publication Date Title
US11928439B2 (en) Translation method, target information determining method, related apparatus, and storage medium
CN106502985B (en) neural network modeling method and device for generating titles
CN111144110B (en) Pinyin labeling method, device, server and storage medium
CN111859964B (en) Method and device for identifying named entities in sentences
US11232263B2 (en) Generating summary content using supervised sentential extractive summarization
CN110569505B (en) Text input method and device
CN112329476B (en) Text error correction method and device, equipment and storage medium
EP3732629A1 (en) Training sequence generation neural networks using quality scores
CN110472062B (en) Method and device for identifying named entity
CN113673228B (en) Text error correction method, apparatus, computer storage medium and computer program product
US10878201B1 (en) Apparatus and method for an adaptive neural machine translation system
US11227110B1 (en) Transliteration of text entry across scripts
US20180173689A1 (en) Transliteration decoding using a tree structure
CN112417878B (en) Entity relation extraction method, system, electronic equipment and storage medium
TW201544976A (en) Natural language processing system, natural language processing method, and natural language processing program
CN111209751A (en) Chinese word segmentation method, device and storage medium
CN110489727B (en) Person name recognition method and related device
CN109858031B (en) Neural network model training and context prediction method and device
CN111274793B (en) Text processing method and device and computing equipment
US10810380B2 (en) Transliteration using machine translation pipeline
CN112784611A (en) Data processing method, device and computer storage medium
CN114626378A (en) Named entity recognition method and device, electronic equipment and computer readable storage medium
CN112749551A (en) Text error correction method, device and equipment and readable storage medium
CN117034916A (en) Method, device and equipment for constructing word vector representation model and word vector representation
CN113505587B (en) Entity extraction method, related device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant