CN1102779C - Simplified Chinese character-the original complex form changingover apparatus - Google Patents

Simplified Chinese character-the original complex form changingover apparatus Download PDF

Info

Publication number
CN1102779C
CN1102779C CN96103701A CN96103701A CN1102779C CN 1102779 C CN1102779 C CN 1102779C CN 96103701 A CN96103701 A CN 96103701A CN 96103701 A CN96103701 A CN 96103701A CN 1102779 C CN1102779 C CN 1102779C
Authority
CN
China
Prior art keywords
chinese
complex form
word
conversion
chinese characters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN96103701A
Other languages
Chinese (zh)
Other versions
CN1134568A (en
Inventor
郭俊桔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Panasonic Holdings Corp
Original Assignee
Matsushita Electric Industrial Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Matsushita Electric Industrial Co Ltd filed Critical Matsushita Electric Industrial Co Ltd
Publication of CN1134568A publication Critical patent/CN1134568A/en
Application granted granted Critical
Publication of CN1102779C publication Critical patent/CN1102779C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Landscapes

  • Document Processing Apparatus (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides the device for exactly converting simplified characters and unabridged characters. In addition to information showing the character and reading of Chinese characters, a word conversion dictionary is referred to. Specifically, a word converting part records a source document and a language flag inputted from an input part into a buffer and refers to the language flag and a word conversion dictionary, Chinese character is reloaded into a correspondent word. A character to reading converting part refers to the language flag recorded in the buffer and a system dictionary and converts the character of the source document to a reading symbol. A reading to character converting part refers to the language flag, system dictionary and a simplified/unabridged character correspondence dictionary and converts the reading symbol of the source document to the character of a target document according to the algorithm of conversion from reading to character, and the result is outputted from an output part.

Description

The simple complex form of Chinese characters file conversion device of Chinese
Technical field
The present invention relates to the conversion equipment of simplified Chinese character word file and complex form of Chinese characters file.
Background technology
In recent years, interchange between China's Mainland and the Taiwan is frequent, file dealing between the two also increases thereupon, but, since not contacts in the period of reaching 40, the Chinese character style difference that not only use two places, and word and vocabulary (using method of single or multiple words used in everyday) etc. also have a great difference, therefore, be difficult to the file that mutual understanding the other side uses.For example, Taiwan is called " laser printer " with laser printer, and the continent then is referred to as " laser printer ".Therefore, the needs of changing mutually between the complex form of Chinese characters file that use in simplified Chinese character file that use in the continent and Taiwan roll up, and in the mutual conversion between simplified Chinese character file and complex form of Chinese characters file, need to solve the following technology and the difficulty of language aspect.
(1) number of words commonly used of simplified Chinese character approximately is 8000, and the complex form of Chinese characters commonly used is more than simplified Chinese character commonly used.For example, the complex form of Chinese characters commonly used of computer realm just has 13,053.So, have the situation of several complex forms of Chinese characters corresponding to a simplified Chinese character, corresponding as " back " of simplified Chinese character with " back " and " Hou " of the complex form of Chinese characters.Therefore, under situation, need to select the suitable complex form of Chinese characters from the simplified Chinese character file conversion to complex form of Chinese characters file.
(2) owing to social, cultural difference, word and using method thereof is also inequality.For example, be used to show " level (technical merit etc.) " of daily said degree, the China's Mainland is called " level ", and Taiwan then is called " level ".
So, developed the file write with simplified Chinese character and the conversion equipment of the file write with the complex form of Chinese characters.
There is Chinese simple complex form of Chinese characters file conversion device that " A Text Conversion SystemBetween Simplified and Complex Chinese Characters Based on OCRApproaches " described device such as the 187th to 201 page of 1994 the 7th time Computational Linguistics research association collection of thesis of the Republic of China is arranged earlier.This paper has illustrated the example from the simplified Chinese character file to complex form of Chinese characters file conversion device.Fig. 1 has shown the structure of this device.In the figure, label 100 is can import with simplified Chinese character file or the complex form of Chinese characters file source document input block as source document.Label 300 is literal usage frequency tables of each literal usage frequency of storage.Label 350 is property data bases of each character features value of storage.Label 200 is that the literal that extracts literal from pictorial data cuts the unit.Label 210 is that the feature of calculating and extract the eigenwert of selected literal from pictorial data is extracted the unit out.Label 220 is with reference to literal usage frequency table 300 and property data base 350, carries out the contrast unit of literal contrast.Label 500 is word conversion equipments of stores words.Label 510 is basic devices of storage normal words information.Label 520 is code corresponding tables of storage simplified Chinese character and complex form of Chinese characters character code.Label 530 is that storage waits the BIGRAM table that connects frequency between the adjacent literal of obtaining according to statistics in advance.Label 400 is alternated Chinese character or word to be converted to the literal network (Word lattice: the Chinese language device network structure of making a comment or criticism and forming at the Chinese character of each candidate of handling, word etc.), it has conversion equipment and literal correcting devices such as word.Label 420 is Chinese character and word converting units of seeking the best transition path.Label 410 is literal amending units of discerning literal by method correction mistake such as artificial.Label 600 is output units that back gained file destination is changed in output.
Be example with the simplified Chinese character file shown in Fig. 2 (a) below, the conversion sequence that is transformed into the complex form of Chinese characters in the simple complex form of Chinese characters file conversion device of above Chinese from simplified Chinese character is described.
In case the simplified Chinese character file by shown in source document input block 100 input Fig. 2 (a) just will read in the OCR device as the image of figure by OCR (optical character reading device).And, after the image that cuts unit 200 each literal of extraction by literal, extract the eigenwert that unit 210 calculates each literal out by feature.Utilize contrast unit 220, detect the simplified Chinese character of candidate with reference to literal usage frequency table 300 and property data base 350.Carry out the operation of text conversion and after-treatment device then.At first, at each the corresponding candidate word that takes out shown in Fig. 2 (b) with reference to code corresponding tables 520 in the Chinese language device 400.As index button, searching word conversion equipment 500 and basic device 510 take out the candidate word shown in Fig. 2 (c) with the target characters combination of each candidate of having taken out.The candidate word that has taken out according to the literal network organization shown in Fig. 2 (d).By Chinese character and word converting unit 420 with reference to BIGRAM table 530 (according to adjacent two literal of corpus (CORPUS) or the usage frequency of word), according to Statistical bigram Markov Language Model (adopting the markov language model of statistics formula BIGRAM) from the literal network, take out the best shown in Fig. 2 (e), also be, after taking out the highest transduction pathway of possibility, by output unit 600 outputs.
Following brief description obtains the best transition route method.
Utilize the statistics (P (C of BIGRAM i| C I-1) and (P (C iWith | S i)), find out and can make the path of following functional value for maximum.
Max_P(C|S)=P(C 1,C 2…C n|S 1,S 2…S n)→
Sum.P(C i|C i-1)(C i|S i)(i=1,…,N)
SS: the alternated Chinese character collection, the candidate collection that for example first is read is made as S 1
C: the element set of candidate collection S, for example, C 1Expression S 1First alternated Chinese character.
P (C i| C I-1): represent that a word is C i, and the word before this word is C I-1The time probability that in BIGRAM, occurs.
With identical method above-mentioned formula is applied to following formula.→Sum.P(W i|W i-1)P(C i|S i)(i=1,…,N)
SP (W i| W I-1): represent that a word is W i, and the previous word of this word is W I-1The time probability that in BIGRAM, occurs.
But, above content has had illustrated in the 6th joint that is stated from PattemRecognition magazine nineteen ninety the 23rd volume the 5th phase 509 to 528 pages article " n-Gramsandtheir implement to natural language understanding " of E.J.Yannakoudakis and P.J.Hutton, and " cross the threshold | OR lecture " that the algorithm of Markovian process is published as modern mathematics society 1981 the 96th page is illustrated, all be well-known technology.So omit explanation.
Fig. 3 represents the flow process of above order.
There is following problem in the prior art of having mentioned above:
(1) for example simplified Chinese character " in " often be identified as " doing " by mistake, thereby be difficult to revise by aftertreatment by the literal of OCR mistake identification.
(2) owing to just utilize Word message, thereby can not effectively handle, so the correctness of conversion has certain limitation to the word of changing voice of Chinese.Here " word of changing voice " is meant the Chinese character of multiple pronunciation.If enumerate object lesson, so just " he does clean work " with the simplified Chinese character file is example, " do " pronunciation that " ganl " and " gan4 " arranged, therefore, because " universe " of the complex form of Chinese characters is corresponding with " doing " (" ganl ") and " doing " (" gan4 "), thereby, in prior art, " do do " be (ganl) (ganl) " so, generally be converted to " work that his universe is clean ", and can not get correct " he does clean work ".
(3) because the difference of works mode and writing article content, be difficult to obtain different extensive and the Chinese language data bank of balance comparatively, so, when extracting BIGRAM, need many man-hours and a large amount of funds.Also have, if the Chinese language data bank is unbalanced, the BIGRAM of extraction table will influence the correctness of word conversion.The corpus here (CORPUS) is meant and adjacent literal, the relevant database of word probability of occurrence.
So, need the device that a kind of cheapness and correct simplified Chinese character word and complex form of Chinese characters file are changed mutually.
Summary of the invention
Purpose of the present invention is exactly to solve such problem.
To achieve these goals, the simple complex form of Chinese characters file conversion device of Chinese of the present invention is to distinguish the simplified Chinese character word and the complex form of Chinese characters with predetermined language identification, for example the former uses " 00 ", the latter represents with " 01 ", this device is that a kind of literal and pronunciation information utilized will be converted to simplified Chinese character word file conversion device with the file destination of another kind of written record with the original character of simplified Chinese character or complex form of Chinese characters record, this device characteristic is, it is equipped with: the vocabulary conversion equipment, and this device grouping (group that comprises essence) storage simplified Chinese character vocabulary (comprises single or multiple Chinese character, Chinese idiom, statement etc.) and the complex form of Chinese characters vocabulary corresponding with it; System and device, this device store pronunciation symbol (comprising phonetic symbol, phonetic symbol and multiple symbol) and simplified and unsimplified Hanzi (no matter be single Chinese character or a plurality of Chinese character) or the word (comprise phrase, habitual sentence etc.) corresponding with it; Letter complex form of Chinese characters corresponding intrument, this device is stored the simplified Chinese character and the complex form of Chinese characters corresponding with it in groups; The vocabulary converting unit, this element removes to retrieve above-mentioned vocabulary conversion equipment with the vocabulary in the source document that the simplified Chinese character or the complex form of Chinese characters write down, and finds out the vocabulary that suitable equivalent is rewritten source document; Text conversion in the file that the pronunciation converting unit of literal, this element reference system device are produced above-mentioned vocabulary converting unit is pronunciation symbol (comprising various word phonemic notations); The text conversion unit of pronunciation, this element reference system device and simple complex form of Chinese characters corresponding intrument are converted to above-mentioned pronunciation symbol according to the set transfer algorithm from the pronunciation symbol to the literal file destination literal of another kind of font.
Person again, the Chinese simplified and traditional body file conversion device of the invention described above, wherein: the aforementioned system device has the non-word of changing voice with Chinese letter, unsimplified Hanzi, word and is stored in change voice block device and the word of will changing voice of non-in the non-block of changing voice and is stored in the block device of changing voice (being included in the interior word of same storer is distinguished with both signs) in the territory, block of changing voice; The text conversion unit of aforementioned pronunciation has by the conversion method of long consensus method, this conversion method has adopted the longest consensus method (also having adopted other methods such as so-called form elements analytical method simultaneously), and the longest consensus method at first makes syllable (Chinese character) number of formation preferential as a kind of transfer algorithm, and next makes the syllable that had existed already preferential.
Moreover, civilian simple complex form of Chinese characters file conversion device among the invention described above, wherein: the text conversion unit of above-mentioned pronunciation has the conversion equipment (comprising other device) by usage frequency, and said apparatus is preferentially changed the high literal of usage frequency, word as a kind of transfer algorithm.
Again moreover, the Chinese simple complex form of Chinese characters file conversion device of the invention described above, wherein: the conversion equipment of above-mentioned usage frequency has the usage frequency device for switching by file content, and this device can be such as switching according to the change in field and technical file, the corresponding transfer algorithm of character property file or the switch transition algorithm of source document used frequency table when the conversion.
According to said structure, in the simple complex form of Chinese characters file conversion device of Chinese of the present invention, the user by input source documents such as OCR, disc driver and expression is simplified or the language identification of traditional font (comprising indication and input that essence is identical) after, the vocabulary converting unit has been stored the vocabulary conversion equipment of simplified Chinese character vocabulary and the complex form of Chinese characters vocabulary corresponding with it in advance with the vocabulary retrieval of letter (numerous) body word source document, finds out the vocabulary of suitable corresponding word change source document.System and device has been stored pronunciation symbol and simplified and unsimplified Hanzi or the word corresponding with it in advance.The pronunciation converting unit of literal is the pronunciation mark with reference to the said system device with the text conversion of source document.Letter complex form of Chinese characters corresponding intrument has been stored the simplified Chinese character complex form of Chinese characters corresponding with it in advance in groups.The text conversion unit reference system device of pronunciation and simple complex form of Chinese characters corresponding intrument are converted to above-mentioned pronunciation symbol by the set transfer algorithm from the pronunciation symbol to the literal literal of the file destination of numerous (letter) body word.
Person again, in the Chinese simple complex form of Chinese characters file conversion device of the invention described above, the simplified Chinese character and the complex form of Chinese characters of the simple complex form of Chinese characters file conversion device of above-mentioned Chinese are distinguished with language identification.No matter Chinese Chinese character, word is the simplified Chinese character or the complex form of Chinese characters, and system and device all is divided into the non-block of changing voice with the non-word of changing voice respectively with the word of changing voice and is also stored (comprise by the character code series arrangement, adopt the division of other sign) with the block of changing voice.The text conversion unit of pronunciation adopts and changes as the longest consensus method of transfer algorithm.
Moreover in the Chinese simple complex form of Chinese characters file conversion device of the invention described above, high Chinese character of usage frequency and word are preferentially changed in the text conversion unit of the pronunciation of above-mentioned Chinese simple complex form of Chinese characters file conversion device.
Again moreover, in the Chinese simple complex form of Chinese characters file conversion device of the invention described above, above-mentioned by the frequency table that can switch above-mentioned transfer algorithm or be used to change according to original file content by the usage frequency switching control of file content in the conversion equipment of usage frequency.
Below according to embodiment explanation the present invention.
Description of drawings
Fig. 1 has shown the example that Chinese simple complex form of Chinese characters file conversion device structure is arranged earlier.
Fig. 2 has shown the treatment scheme that Chinese simple complex form of Chinese characters file conversion device is arranged earlier.
Fig. 3 shows has of Chinese simple complex form of Chinese characters file conversion device to handle example earlier.
Fig. 4 is the embodiment structural drawing of civilian simple complex form of Chinese characters file conversion device among the present invention.
Fig. 5 is the concept map of the data structure of the vocabulary conversion equipment in the above-mentioned embodiment.
In this figure, (a) when becoming complex form of Chinese characters file, the simplified Chinese character file conversion using.
(b) when becoming the simplified Chinese character file, complex form of Chinese characters file conversion using.
Fig. 6 is the process flow diagram that has mainly shown vocabulary converting unit operation in the above-mentioned embodiment.
Fig. 7 is the concept map of system and device data structure in the above-mentioned embodiment.
Fig. 8 is the main process flow diagram that shows the pronunciation converting unit operation of above-mentioned embodiment Chinese words.
Fig. 9 conceptually shows the data structure of simple complex form of Chinese characters corresponding intrument in the above-mentioned embodiment.
Figure 10 is the main process flow diagram that shows the text conversion unit operations of pronunciation in the above-mentioned embodiment.
Figure 11 is to be the instantiation of complex form of Chinese characters file at above-mentioned embodiment from the simplified Chinese character file conversion, the Chinese character and the word that take out according to (c2) step (step) result of text conversion unit Figure 10 of pronunciation.
Figure 12 shown (c3) according to Figure 10 step process result taken out with the corresponding literal of each Chinese-character pronunciation.
Figure 13 represent (c4) according to Figure 10 that step process result took out with the corresponding literal of pronunciation each Chinese character.
Embodiment
Fig. 4 is the pie graph of an embodiment of the Chinese simple complex form of Chinese characters file conversion device relevant with the present invention.
In the figure, label 10 is by the communication line OCR that links to each other with the outside, and it is the input block by input source document such as disc driver and language identification.Label 20 is vocabulary conversion equipments of lexical gap corresponding tables between the sort of China's Mainland shown in the storage map 5 and the Taiwan.Label 30 is with reference to language identification and vocabulary conversion equipment, will the source document vocabulary consistent with docuterm be rewritten as the vocabulary converting unit of corresponding word.Fig. 6 represents the treatment scheme of vocabulary converting unit 30.This figure elaborates in the back.Label 50 is that each pronunciation symbol of storage reaches the simplified Hanzi corresponding with it and the system and device of word and unsimplified Hanzi and word (comprising the word of changing voice).System and device 50 image patterns 7 divide simplified like that and Chinese character and word are being stored in the traditional font.Label 60 is to be the pronunciation converting unit of the literal of pronunciation symbol with reference to language identification and system and device 50 with the text conversion of source document.Fig. 8 represents the treatment scheme of converting unit 60.This figure also describes in detail in the back.Label 80 is simple complex form of Chinese characters corresponding intruments of corresponding relation between the storage simplified Chinese character and the complex form of Chinese characters.Label 70 is the text conversion unit that the pronunciation symbol of source document are converted to the pronunciation of file destination with reference to simple complex form of Chinese characters corresponding intrument 80.Figure 10 represents the treatment scheme of the text conversion unit 70 of pronunciation.This figure will describe in detail in the back.Label 90 is according to the conversion process output unit of export target file as a result.Label 40 is impact dampers of interim stores processor result.
In addition, also have the literal amending unit of being proofreaded by the translator at last, these all are well-known technology, so the diagram of omission and explanation.
Operation below with reference to the flowchart text present embodiment.
The operation of vocabulary converting unit 30 shown in Figure 6 at first is described.
(a1) in by input block 10 input languages sign and source document, they are recorded in the impact damper 40.
(a2) language identification that is write down with reference to impact damper 40 is taken out docuterm and the displacement word corresponding with it in order from vocabulary conversion equipment 20.
(a3) judge whether all to have taken out docuterm.
Still having under the situation of docuterm, entering (a4) and handle, after the characters matching of docuterm, turning back to (a2) as index button and original character.
Do not having under the situation of docuterm, finishing the processing of vocabulary converting unit 30, entering the processing of the pronunciation converting unit 60 of literal.
The following describes the operation of the pronunciation converting unit 60 of literal shown in Figure 8.
(b1) literal of the source document of conversion has been made in input by vocabulary converting unit 30.
(b2) after cutting out the literal of source document by each syllable, they are recorded impact damper 40 according to specificator (for example the comma in the article, fullstop).
(b3) take out each syllable be recorded in the impact damper 40 respectively, reference system device 50 is behind the pronunciation symbol with the text conversion of the non-word of changing voice (unequivocal word is preferential) and records impact damper 40.
(b4) to by the block of changing voice of the literal reference system device 50 that the word of changing voice is arranged of impact damper 40 record, the word of changing voice is converted to suitable pronunciation symbol.
(b5) with reference to the literal of the source document of impact damper 40, according to the pronunciation symbol of each literal of Chinese grammer correction buffering 40.For example, " mal " of the pronunciation of " mother ", and the tone of the 2nd " mother " of " mother " is read not according to 1 (the highest in 4), but should be by softly (owing to syllable has lost original tone continuously, weak and pronunciation lightly) reads to be " ma0 ", so will revise the pronunciation symbol of the 2nd " mother ".
About the conversion of the pronunciation of literal and since be such as Te Kaipingdi 4-238397 number in disclosed well-known technology on August 26th, 1992, so omitted explanation.
Pronunciation converting unit 60 processing of above literal finish, and just enter the processing of the text conversion unit 70 of pronunciation shown in Figure 10.The following describes its content.
(c1) input is by the pronunciation converting unit 60 resulting source document pronunciation symbols of literal.
(c2) reference system device 50 cuts out whole syllables that pronunciation symbol may become syllable.And alternated Chinese character that will be corresponding with each syllable and word record impact damper 40.
(c3) literal of the source document that is write down with reference to impact damper 40 as index button, in the complex form of Chinese characters corresponding intrument 80 of conforming to the principle of simplicity takes out each literal and each literal corresponding character, and stores impact damper 40 into.
(c4) from impact damper 40, take out each syllable alternated Chinese character and the word may become syllable, with reference to impact damper 40 in the corresponding literal of each pronunciation symbol, delete unsuitable candidate and promptly delete low Chinese character of possibility and word.
(c5) utilize the longest consensus method, from the Chinese character of candidate and word, select suitable literal, in view of the above, finish the operation of the text conversion unit 70 of pronunciation.
With the source document from Chinese simplified Chinese character file, the complex form of Chinese characters file that is converted to as file destination is an example, specifically describes the operation of present embodiment below.
If by the source document of input block 10 input " he does not use software as yet " and be called the language identification of " simplified ", then vocabulary converting unit 30 will be with reference to the vocabulary conversion equipment 20 shown in Fig. 5 (a).Because the docuterm that is called " software " is arranged, so " software " that " software " of source document be replaced into corresponding word arranged, source document is rewritten as after " he does not use software as yet ", the source document and the language identification of having replaced recorded impact damper 40.Then, language identification that the pronunciation converting unit 60 of literal is write down with reference to impact damper 40 and system and device 50 shown in Figure 7 are pronunciation symbol described below " tal shang4 uei4 sh3 yueng4 ruan3 ti3 " with the text conversion of having rewritten.Then, the text conversion unit 70 that enters pronunciation shown in Figure 10 is handled.If the pronunciation symbol in (c1) more than the input so just cuts syllable in (c2),, take out the syllable that might become Chinese character, word, vocabulary by conversion.Syllable that has taken out and the alternated Chinese character corresponding with it and word are as shown in figure 11.
In (c3) of Figure 10, with reference to simple complex form of Chinese characters corresponding intrument 80 shown in Figure 7 and each literal of the source document in the impact damper 40, take out with shown in Figure 12 with the corresponding literal of each Chinese-character pronunciation.
Then the limit is deleted inappropriate alternated Chinese character and word with reference to what take out in the complex form of Chinese characters corresponding intrument 80 of conforming to the principle of simplicity with above-mentioned corresponding Chinese character and word limit in (c4), just can obtain high Chinese character of the possibility shown in Figure 13 and word.
In (c5), utilize the longest consensus method, be converted to " he does not use software as yet ".
Export the text strings of having changed by output unit 90 at last.
Abovely the present invention has been described, has the invention is not restricted to above-mentioned embodiment, so long as in the scope that does not change its spirit, just can carry out suitable improvement and implemented according to embodiment.For example:
(1) system and device is regardless of simplified, traditional font storage Chinese character and word, but is divided into simplified and traditional font and according to the sequential storage of pronunciation symbol with its connotation sign field.
(2) the employed transfer algorithm in text conversion unit of pronunciation is not limited to the longest consensus method, also can adopt the form elements analytical method of investigating several numbers of literal unconfirmed etc., also can utilize information such as other grammers, usage frequency to judge the correctness of transformation results.Specifically, approach absent Chinese character (specifying) with usage frequency and OCR identification probability exactly or get rid of.
(3) according to situations such as manufacturings, the present invention must obligato inscape (constituting essential condition, item, part) can be regarded as a plurality of or regard a plurality of key elements as one conversely, perhaps suitably they are made up.
(4) in existing word processor and conversion equipment, read in necessary programs, again extra storage the disk etc. of device, thereby form the structure identical with the present invention.
(5) high with price but memory device that have a more speed is stored the high literal of usage frequency, thus embodiment improved.
(6) input block by read in conversion equipment main body and other L/E type fetch equipment and import (reading in) as the disk that byte information is stored.The file of the then exportable encode on disk of output device.
(7) according to the field of file, the usage frequency of word and literal has a great difference.For example: with the animal file associated in the usage frequency height of literal such as " moving ", " shellfish ", " horse ", " cat (simplified Chinese character) ", " bird ", and with the patent file associated in, the usage frequency height of words such as " exploitations ".Therefore, the transfer algorithm of the text conversion unit of pronunciation adopt or the situation of reflection usage frequency under, can use the file purposes imported by this device user, context and the frequency table that forms.Automatically change with reference to the frequency table after perhaps using particular words usage frequencies such as " exploitations " to be judged.
(8) during as the change object, Unrecorded Chinese character in the device being judged as the OCR read error, therefore to increase the function of output character connotation with the statement of OCR input.
(9) literal for approximate shapes is many, stroke many and know in advance that by other experiences OCR reads the low literal of precision, should improve and the word that comprises this literal, the relevant weight of front and back literal, increases and impels the translator to note the function of its connotation.
Just as described above, if adopt the simple complex form of Chinese characters file of Chinese of the present invention device for interchanging to solve problem in the past, just can obtain following result.
(1) owing to utilizes the information of the pronunciation of literal, so, can not remember the problem that causes in device and adjacent two word transfer problems (for example owing to exist " useful ", " a day " to wait the front and back word, so can produce the false transitions in " useful sky ") and homonymic selection problem (for example " upper ", " as yet not ", " captain " three homophones) etc. pronunciation being converted to solve effectively in the literal because of word with " having one day " corresponding pronunciation symbol string.
By utilizing the pronunciation information of grammer, the syntax, also can reduce the false transitions of the word of changing voice, improve the accuracy of changing between simplified Chinese character file and the complex form of Chinese characters file.And can reduce the scale of device.
(2) owing to do not use BIGRAM information, the content that does not influence corpus just can improve the correctness of conversion.There is not the conversion incorrectness that produces because of language Data acquisition, difficulty.Expect correct and meet the result in each field, just must adopt foregoing, with further raising precision.
(3), when the word phonemic notation during from input converts Chinese character to Chinese file and when simplified Chinese character and the complex form of Chinese characters are changed mutually, all can use the same system device as the Chinese words processor etc.So, when using, do not need to set up other conversion equipment with Chinese words processor etc.And then the man-hour of the device of can reducing the staff, but also reduction of expenditure.
(4) the vocabulary conversion equipment is a kind of language record, so as long as use this device, continent, Taiwan just do not need to carry out the conversion of simplified system or traditional font system in use numerous and diversely, so economical and laborsaving.
So, of the present invention practical living very high.

Claims (4)

1. simple complex form of Chinese characters file conversion device of Chinese, this device adopts predetermined language identification to distinguish the simplified Chinese character word and the complex form of Chinese characters, this device is at the file destination that will be converted to the source document of simplified Chinese character word or complex form of Chinese characters record with another kind of written record, and the simple complex form of Chinese characters file conversion device of said Chinese is equipped with:
Storage simplified Chinese character vocabulary reaches the vocabulary conversion equipment of the complex form of Chinese characters vocabulary corresponding with it; Storage pronunciation symbol and the simplified and unsimplified Hanzi corresponding or the system and device of word with it;
The simple complex form of Chinese characters corresponding intrument of the storage simplified Chinese character and the complex form of Chinese characters corresponding with it;
Language identification with input identifies that source document is the simplified Chinese character or the complex form of Chinese characters, with the above-mentioned vocabulary conversion equipment of source document vocabulary retrieval of input, finds out suitable corresponding word again, rewrites the vocabulary converting unit of source document vocabulary;
Text conversion in the file that above-mentioned vocabulary converting unit is produced with reference to the said system device is the pronunciation converting unit of the literal of pronunciation symbol;
With reference to said system device and simple complex form of Chinese characters corresponding intrument, above-mentioned pronunciation symbol is converted to the text conversion unit of pronunciation of the file destination literal of other font by the set transfer algorithm from the pronunciation symbol to the literal,
It is characterized in that: the conversion dual-purpose between the simple complex form of Chinese characters information of literal and pronunciation.
2. the simple complex form of Chinese characters file conversion device of Chinese as claimed in claim 1, it is characterized in that: no matter aforesaid system and device has that the Chinese character and the word of Chinese are the simplified Chinese character or the complex form of Chinese characters, and non-change voice block device and the word of will changing voice that all the non-word of changing voice is stored in the non-block of changing voice stores the block device of changing voice in the block of changing voice into; The text conversion unit of aforementioned pronunciation has the longest consensus method reflection conversion equipment of the longest consensus method of employing as transfer algorithm.
3. the simple complex form of Chinese characters file conversion device of Chinese as claimed in claim 1 or 2 is characterized in that: as transfer algorithm, the text conversion unit of aforementioned pronunciation has the usage frequency reflection conversion equipment that the literal that usage frequency is high and word are preferentially changed.
4. the simple complex form of Chinese characters text conversion of Chinese as claimed in claim 3 device is characterized in that: aforementioned conversion equipment by usage frequency has the usage frequency conversion control device by file content of the frequency table that uses when switching with corresponding above-mentioned transfer algorithm of original file content or switch transition.
CN96103701A 1995-03-24 1996-03-21 Simplified Chinese character-the original complex form changingover apparatus Expired - Fee Related CN1102779C (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
JP66117/95 1995-03-24
JP66117/1995 1995-03-24
JP7066117A JPH08263478A (en) 1995-03-24 1995-03-24 Single/linked chinese character document converting device

Publications (2)

Publication Number Publication Date
CN1134568A CN1134568A (en) 1996-10-30
CN1102779C true CN1102779C (en) 2003-03-05

Family

ID=13306625

Family Applications (1)

Application Number Title Priority Date Filing Date
CN96103701A Expired - Fee Related CN1102779C (en) 1995-03-24 1996-03-21 Simplified Chinese character-the original complex form changingover apparatus

Country Status (2)

Country Link
JP (1) JPH08263478A (en)
CN (1) CN1102779C (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1786956B (en) * 2005-12-09 2010-08-25 王绯 Method for processing converting abnormal word containing unicode four byte code East Asia ideograph in searching engine

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7260780B2 (en) * 2005-01-03 2007-08-21 Microsoft Corporation Method and apparatus for providing foreign language text display when encoding is not available
JP2006252164A (en) * 2005-03-10 2006-09-21 Fuji Xerox Co Ltd Chinese document processing device
CN101131690B (en) * 2006-08-21 2012-07-25 富士施乐株式会社 Method and system for mutual conversion between simplified Chinese characters and traditional Chinese characters
CN102929852B (en) * 2012-10-15 2016-05-04 福建榕基软件股份有限公司 A kind ofly in RichText Edition device, realize the method and system that the simple complex form of Chinese characters turns mutually
KR101384139B1 (en) * 2012-11-23 2014-04-10 박선정 Transformation method for chinese simplified character, study method using the same, recoding medium, storage medium and mobile communication device including storage medium
CN103870442A (en) * 2012-12-17 2014-06-18 鸿富锦精密工业(深圳)有限公司 Converting system and method for simplified Chinese and traditional Chinese
CN103885941A (en) * 2012-12-24 2014-06-25 鸿富锦精密工业(深圳)有限公司 Patent application document conversion system and method
CN110874527A (en) * 2018-08-28 2020-03-10 游险峰 Cloud-based intelligent paraphrasing and phonetic notation system
CN112036121A (en) * 2020-08-31 2020-12-04 浪潮商用机器有限公司 Simplified Chinese character and traditional Chinese character conversion method and related device
CN113076724B (en) * 2021-04-08 2024-06-11 合肥工业大学 Method and device for converting characters
CN117252154B (en) * 2023-11-20 2024-01-23 北京语言大学 Chinese simplified and complex character conversion method and system based on pre-training language model

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1038364A (en) * 1988-06-03 1989-12-27 李毅民 Letter complex form of Chinese characters compatible automatic conversion system for Chinese-character information processing
CN1045878A (en) * 1989-03-22 1990-10-03 唐懋宽 Computing machine Chinese sound-digit code input technology
JPH04238397A (en) * 1991-01-23 1992-08-26 Matsushita Electric Ind Co Ltd Chinese pronunciation symbol generation device and its polyphone dictionary

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1038364A (en) * 1988-06-03 1989-12-27 李毅民 Letter complex form of Chinese characters compatible automatic conversion system for Chinese-character information processing
CN1045878A (en) * 1989-03-22 1990-10-03 唐懋宽 Computing machine Chinese sound-digit code input technology
JPH04238397A (en) * 1991-01-23 1992-08-26 Matsushita Electric Ind Co Ltd Chinese pronunciation symbol generation device and its polyphone dictionary

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1786956B (en) * 2005-12-09 2010-08-25 王绯 Method for processing converting abnormal word containing unicode four byte code East Asia ideograph in searching engine

Also Published As

Publication number Publication date
JPH08263478A (en) 1996-10-11
CN1134568A (en) 1996-10-30

Similar Documents

Publication Publication Date Title
CN1176456C (en) Automatic index based on semantic unit in data file system and searching method and equipment
CN1135485C (en) Identification of words in Japanese text by a computer system
CN1159661C (en) System for Chinese tokenization and named entity recognition
CN101388012B (en) Phonetic check system and method with easy confusion tone recognition
US8239188B2 (en) Example based translation apparatus, translation method, and translation program
US6233544B1 (en) Method and apparatus for language translation
CN1029170C (en) Language translation system
US7979268B2 (en) String matching method and system and computer-readable recording medium storing the string matching method
CN103970798B (en) The search and matching of data
JP5130892B2 (en) Character encoding processing method and system
CN1102779C (en) Simplified Chinese character-the original complex form changingover apparatus
CN1008016B (en) Input processing system
CN1252575A (en) Chinese generator for computer translation
CN1282072A (en) Error correcting method for voice identification result and voice identification system
CN1227657A (en) Natural language parser with dictionary-based part-of-speech probabilities
CN1770144A (en) Machine translation system and method
CN1910573A (en) System for identifying and classifying denomination entity
US7072880B2 (en) Information retrieval and encoding via substring-number mapping
CN1282932A (en) Chinese character fragmenting device
CN1108572C (en) Mechanical Chinese to japanese two-way translating machine
CN1949211A (en) New Chinese characters spoken language analytic method and device
CN1679023A (en) Method and system of creating and using chinese language data and user-corrected data
CN1226692C (en) Machine translation system based on semanteme and its method
KR101080880B1 (en) Automatic loanword-to-korean transliteration method and apparatus
JP2958044B2 (en) Kana-Kanji conversion method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C19 Lapse of patent right due to non-payment of the annual fee
CF01 Termination of patent right due to non-payment of annual fee