CN117688123A - Method and device for generating document structure tree - Google Patents
Method and device for generating document structure tree Download PDFInfo
- Publication number
- CN117688123A CN117688123A CN202211039454.0A CN202211039454A CN117688123A CN 117688123 A CN117688123 A CN 117688123A CN 202211039454 A CN202211039454 A CN 202211039454A CN 117688123 A CN117688123 A CN 117688123A
- Authority
- CN
- China
- Prior art keywords
- text
- document structure
- structure tree
- semantic information
- text unit
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 79
- 238000003062 neural network model Methods 0.000 claims abstract description 41
- 230000015654 memory Effects 0.000 claims description 25
- 238000012545 processing Methods 0.000 claims description 23
- 238000012795 verification Methods 0.000 claims description 18
- 238000004590 computer program Methods 0.000 claims description 6
- 238000003672 processing method Methods 0.000 claims description 2
- 238000010586 diagram Methods 0.000 description 17
- 238000003058 natural language processing Methods 0.000 description 8
- 230000019771 cognition Effects 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 238000012549 training Methods 0.000 description 6
- 238000004891 communication Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 238000006467 substitution reaction Methods 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 238000012015 optical character recognition Methods 0.000 description 3
- 238000012937 correction Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/322—Trees
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Evolutionary Computation (AREA)
- Mathematical Physics (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Machine Translation (AREA)
Abstract
The application provides a method and a device for generating a document structure tree, wherein the method comprises the following steps: acquiring a text #A, wherein the text #A comprises at least two text units, the at least two text units comprise a first text unit and a second text unit, and the first text unit is adjacent to the second text unit; acquiring at least two pieces of semantic information, wherein the at least two pieces of semantic information comprise first semantic information and second semantic information, the first semantic information and the second semantic information are used for determining a hierarchical relationship between a first text unit and a second text unit, the first text unit corresponds to the first semantic information, and the second text unit corresponds to the second semantic information; at least two semantic information are input into the neural network model, and a first document structure tree of the text #A is obtained by reasoning. By the method, the embodiment of the application can generate the corresponding document structure tree for most documents, and is not limited by factors such as the format of the documents, the attribute information of the characters and the like.
Description
Technical Field
The present disclosure relates to the field of document structure technologies, and in particular, to a method and an apparatus for generating a document structure tree.
Background
The document structure tree may be used to indicate a directory structure of the document through which a user may learn the overall structure of the document. The user can also screen the text required by the user from the long file through the file structure tree.
However, the existing document structure tree generation method is interfered by many factors, such as the format of the document and the characteristics of the document itself, such as word size, font, line spacing, etc., so that the generalization thereof is poor, and the accuracy of the generated document structure tree is not high. Especially for deeper documents, the document structure tree generated by the existing document structure tree generation method is easy to generate errors.
Disclosure of Invention
The method and the device for generating the document structure tree can generate the corresponding document structure tree for most documents, and are not limited by factors such as layout of the documents, attribute information of characters and the like.
In a first aspect, a method for generating a document structure tree is provided, including: acquiring a text, wherein the text comprises at least two text units, the at least two text units comprise a first text unit and a second text unit, and the first text unit is adjacent to the second text unit; acquiring at least two pieces of semantic information, wherein the at least two pieces of semantic information comprise first semantic information and second semantic information, the first semantic information and the second semantic information are used for determining a hierarchical relationship between the first text unit and the second text unit, the first text unit corresponds to the first semantic information, and the second text unit corresponds to the second semantic information; at least two semantic information are input into the neural network model, and a first document structure tree of the text is obtained through reasoning.
The document structure tree of the document is generated by reasoning based on the semantic information of the document, so that the corresponding document structure tree can be generated for most of the documents, and the document structure tree is not limited by factors such as layout of the document, attribute information of characters and the like.
With reference to the first aspect, in certain implementations of the first aspect, the text unit includes at least one of: statement, or paragraph.
Specifically, when the text unit is a sentence, the generation apparatus of the document structure tree of the embodiment of the present application may acquire semantic information of adjacent sentences, determine hierarchical relationships between adjacent sentences based on the semantic information, and generate the document structure tree of the document based on the hierarchical relationships. When the text unit is a paragraph, the generation device of the document structure tree in the embodiment of the application can acquire the semantic information of the adjacent paragraphs, determine the hierarchical relationship between the adjacent paragraphs based on the semantic information, and generate the document structure tree of the document based on the hierarchical relationship. Thus, the corresponding document structure tree can be generated for most documents, and is not limited by factors such as the format of the documents, the attribute information of the characters and the like.
With reference to the first aspect, in certain implementations of the first aspect, the method further includes: acquiring first data, wherein the first data is determined by checking a first document structure tree by a user; and updating the neural network model according to the first data.
By updating the neural network model based on the verification data of the user on the generated document structure tree, the embodiment of the application can support optimizing the generated document structure tree in a successive iteration mode and more accurately indicate the structure of the document, and the subsequently generated document structure tree can be more in line with the cognition of the user.
With reference to the first aspect, in certain implementations of the first aspect, the method further includes: and updating the first document structure tree according to the document structure template to obtain a second document structure tree of the text.
Therefore, the generated second document structure tree can be more accurate and meets the requirements of users.
With reference to the first aspect, in certain implementations of the first aspect, the method further includes: acquiring second data, wherein the second data is determined by checking a second document structure tree by a user; updating the neural network model according to the second data.
By updating the neural network model based on the verification data of the user on the generated document structure tree, the embodiment of the application can support optimizing the generated document structure tree in a successive iteration mode and more accurately indicate the structure of the document, and the subsequently generated document structure tree can be more in line with the cognition of the user.
With reference to the first aspect, in certain implementations of the first aspect, the method further includes: storing the checked first document structure tree or the second document structure tree into a document template library.
Therefore, the method can facilitate the subsequent generation of the document structure tree which is more in line with the cognition of the user, and can generate a more accurate document structure tree.
In a second aspect, there is provided a generating apparatus of a document structure tree, including: the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring a text, the text comprises at least two text units, the at least two text units comprise a first text unit and a second text unit, and the first text unit is adjacent to the second text unit; the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is also used for acquiring at least two pieces of semantic information, the at least two pieces of semantic information comprise first semantic information and second semantic information, the first semantic information and the second semantic information are used for determining the hierarchical relationship between a first text unit and a second text unit, the first text unit corresponds to the first semantic information, and the second text unit corresponds to the second semantic information; and the processing module is used for inputting at least two semantic information into the neural network model and obtaining a first document structure tree of the text by reasoning.
With reference to the second aspect, in certain implementations of the second aspect, the text unit includes at least one of: statement, or paragraph.
With reference to the second aspect, in some implementations of the second aspect, the apparatus further includes a verification module, where the verification module is configured to obtain first data, where the first data is data determined by a user verifying the first document structure tree; the processing module is also used for updating the neural network model according to the first data.
With reference to the second aspect, in some implementations of the second aspect, the processing module is further configured to update the first document structure tree according to the document structure template to obtain a second document structure tree of the text.
With reference to the second aspect, in some implementations of the second aspect, the apparatus further includes a verification module, where the verification module is configured to obtain second data, where the second data is data determined by a user verifying the second document structure tree; the processing module is also used for updating the neural network model according to the second data.
With reference to the second aspect, in certain implementations of the second aspect, the apparatus further includes: and the storage module is used for storing the checked first document structure tree or second document structure tree into the document template library.
In a third aspect, a cluster of computing devices is provided, comprising at least one computing device, each computing device comprising a processor and a memory; the processor of the at least one computing device is configured to execute instructions stored in the memory of the at least one computing device to cause the cluster of computing devices to perform the method as described in the first aspect and any one of the possible implementations of the first aspect.
In a fourth aspect, a computer readable storage medium is provided, storing instructions that when run on a computer cause the computer to perform the data processing method according to any one of the first aspect and any one of the possible implementations of the first aspect.
In a fifth aspect, a computer device is provided, the computer device comprising a processor and a memory; the memory is used for storing computer program instructions; the processor execution invokes computer program instructions in the memory to perform the method as described in the first aspect and any one of the possible implementations of the first aspect.
Drawings
Fig. 1 is a schematic diagram of an applicable application scenario in an embodiment of the present application.
FIG. 2 is a flow chart of a method 200 of generating a document structure tree according to an embodiment of the present application.
FIG. 3 is a schematic diagram of initial correction of a document structure tree according to an embodiment of the present application.
FIG. 4 is a flow chart of a method 400 of generating a document structure tree according to an embodiment of the present application.
FIG. 5 is a schematic diagram of user verification of a document structure tree according to an embodiment of the present application.
FIG. 6 is a schematic diagram of updating a neural network model and document template library according to an embodiment of the present application.
Fig. 7 is a schematic structural diagram of a document structure tree generating apparatus 700 according to an embodiment of the present application.
FIG. 8 is a schematic diagram of one architecture of a computing device cluster in accordance with an embodiment of the application.
FIG. 9 is yet another architectural diagram of a computing device cluster in accordance with an embodiment of the present application.
FIG. 10 is yet another architectural diagram of a computing device cluster in accordance with an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings.
The document structure tree may indicate a directory structure of the document. The user can obtain the catalogue titles of all levels of the document and text contents corresponding to the titles through the document structure tree. The user may also view the overall structure of the document through the document structure tree, and may also locate specific text in the document through the document structure tree.
However, only a few documents mark the directory structure. For documents with unlabeled directory structures, the directory structures often need to be labeled manually, but this requires a lot of financial and manpower effort. Therefore, the automatic generation technology of the document structure tree has become a popular research direction.
In brief, the automatic generation technique of the document structure tree can be based on the self characteristics of a given document, so as to realize the automatic generation of the document structure tree (also can be a directory structure). The application field of the document structure tree is extremely wide. See in particular fig. 1.
Fig. 1 is a schematic diagram of an applicable application scenario in an embodiment of the present application. As shown in fig. 1, the document structure tree can be applied to various fields such as long document information extraction, enterprise document searching, text positioning, document management, long document processing and the like.
Currently, existing methods for generating document structure trees are subject to interference from a number of factors, such as:
1) The accuracy of the generated document structure tree is weaker, and particularly when the document is deeper in hierarchy, the document structure tree generated by the existing method is easy to have errors;
2) Poor generalization, only applicable to specific types of documents, and not applicable to wider types of documents;
3) Depending on the characteristics of the document itself, such as font size, font style, line spacing, etc., existing methods cannot generate a suitable document structure tree for plain text files without such characteristics.
For example, an existing document structure tree generation method may be character-based attribute information, which is applicable to a document in which attribute information of characters exists, for example, a PDF document. Specifically, the attribute information of each character of the PDF document is obtained, wherein the attribute information can comprise information such as the abscissa position, the font style, the word size, the line spacing and the like of the character, the obtained attribute information of all the characters is counted, the title text and the body text are distinguished, and a document structure tree is generated on the basis of the obtained attribute information.
Although this method can be well applied to PDF documents that can provide attribute information of characters, it cannot be applied to documents that cannot provide attribute information of characters, for example, plain text documents.
In view of the above technical problems, the present application provides a method and an apparatus for generating a document structure tree, which can generate a corresponding document structure tree for most documents, and is not limited by factors such as layout of the documents and attribute information of characters.
The method and apparatus for generating a document structure tree according to the embodiments of the present application will be described below with reference to the accompanying drawings.
FIG. 2 is a flow chart of a method 200 of generating a document structure tree according to an embodiment of the present application. The execution subject of the method 200 for generating a document structure tree is a device for generating a document structure tree. As shown in fig. 2, the method 200 of generating a document structure tree includes:
s210, acquiring a text #A, wherein the text #A comprises at least two text units, the at least two text units comprise a first text unit and a second text unit, and the first text unit is adjacent to the second text unit.
Specifically, a generating device (hereinafter, simply referred to as "first device") of the document structure tree may acquire text #a uploaded by the user, and the text #a may include at least two text units. The at least two text units may include a first text unit and a second text unit, wherein the first text unit is adjacent to the second text unit.
Specifically, the text unit is a basic composition of the text #a. For example, the text units may be sentences or paragraphs. In other words, a text may be composed of a plurality of sentences or a plurality of paragraphs, and embodiments of the present application are not limited.
Illustratively, the sentence may be a short sentence, such as: "today rainy". "; alternatively, "today's weather is cloudy. "and the like. Paragraphs may also include multiple sentences. For example, "today's weather is overcast days, all extracurricular activities are cancelled. And thus become less pleasing. The above description is by way of example only and is not intended as limiting.
The above-mentioned "at least two" of at least two text units is only a generic term and is not intended as a limitation of a specific number.
S220, acquiring at least two pieces of semantic information, wherein the at least two pieces of semantic information comprise first semantic information and second semantic information, the first semantic information and the second semantic information are used for determining a hierarchical relationship between a first text unit and a second text unit, the first semantic information corresponds to the first text unit, and the second semantic information corresponds to the second text unit.
Specifically, the more semantic information is acquired by the first device, the more semantic relationships are determined by the first device based on the semantic information, so that a more accurate document structure tree can be generated.
Specifically, the semantic information is in one-to-one correspondence with the text units, that is: one semantic information may correspond to one text unit. Illustratively, a text element is "background art," which may correspond to a piece of semantic information that is used to embody the semantics of the text element or similar information. In other words, the semantic information may be inherent attribute information of the text unit, which may embody the semantics of the text unit.
More specifically, the first device may obtain at least two semantic information of the text #a, and adjacent semantic information of the at least two semantic information may be used to determine a hierarchical relationship between adjacent text units corresponding to the adjacent semantic information. In particular, adjacent semantic information may be used to determine semantic relationships of adjacent text units corresponding to the adjacent semantic information, which may be used to determine hierarchical relationships between adjacent text units.
For example, the at least two semantic information may include a first semantic information and a second semantic information, where the first semantic information and the second semantic information are two adjacent semantic information. Wherein the first semantic information corresponds to a first text unit and the second semantic information corresponds to a second text unit. The first text unit and the second text unit are two adjacent text units, and the first text unit can be the adjacent text unit of the second text unit in a sentence-changing way. The first semantic information and the second semantic information may determine a semantic relationship between the first text unit and the second text unit, which may be used to determine a hierarchical relationship between the first text unit and the second text unit.
For convenience of description, the following description will take a sentence as an example of a text unit.
It should be understood that, when the text unit is a sentence, the generating apparatus of the document structure tree of the embodiment of the present application may acquire semantic information of adjacent sentences, determine hierarchical relationships between adjacent sentences based on the semantic information, and generate the document structure tree of the document based on the hierarchical relationships. When the text unit is a paragraph, the generation device of the document structure tree in the embodiment of the application can acquire the semantic information of the adjacent paragraphs, determine the hierarchical relationship between the adjacent paragraphs based on the semantic information, and generate the document structure tree of the document based on the hierarchical relationship. Thus, the corresponding document structure tree can be generated for most documents, and is not limited by factors such as the format of the documents, the attribute information of the characters and the like.
In S220, the first device may determine a hierarchical relationship between adjacent sentences based on the acquired semantic information. For example, the semantic information of the first sentence is "illustration of the drawing", the semantic information of the second sentence is "fig. 1 is an application scenario diagram". The first means may determine that the first sentence is the above description or the general description of the second sentence based on the semantic information of the first sentence and the semantic information of the second sentence. Thus, the first device may determine the semantic relationship between the first sentence and the second sentence according to the semantic information of the first sentence and the semantic information of the second sentence, i.e.: the first sentence is a summary or a generic description of the second sentence, and then it can be determined that the second sentence is subordinate to the first sentence, that is: the first sentence has a higher hierarchy than the second sentence. Accordingly, the first device may determine that the first sentence is a summary or an upper description of the second sentence, and thus, the hierarchy of the first sentence is higher than the hierarchy of the second sentence.
Also illustratively, the semantic information of the third statement is "FIG. 2 is a flow chart of an implementation. The first means may determine a semantic relationship between the second sentence and the third sentence according to the semantic information of the second sentence and the semantic information of the third sentence, that is: the hierarchy of the second sentence is identical to the hierarchy of the third sentence.
Further, the first means may determine a semantic relationship between the first sentence and the third sentence according to the semantic relationship between the first sentence and the second sentence and the semantic relationship between the second sentence and the third sentence, that is: the first sentence is a summary or a generic description of the third sentence.
From the above description, the first device may determine a hierarchical relationship between adjacent text units.
S230, inputting at least two semantic information into the neural network model, and reasoning to obtain a first document structure tree of the text #A.
Specifically, the first means infers the semantic information based on the at least two semantic information obtained as described above, and may generate a first document structure tree of the text #a.
The inference of the at least two semantic information by the first means may be implemented based on a neural network model (which may also be referred to as a pre-training model). Among them, the neural network model is a technique widely used in the field of natural language processing (natural language processing, NLP) in recent years. The neural network model is pre-trained through a large number of corpus, so that the neural network model can learn a large number of semantic knowledge, and then fine adjustment is performed through downstream tasks (in deep learning, the downstream tasks refer to specific emotion classification tasks, named entity recognition tasks and the like), and the training effect of the downstream tasks can be greatly improved. The first means may implement reasoning about at least two semantic information based on the neural network model, generating a first document structure tree of text #a.
NLP refers to a technique that uses natural language used by human communication to interact with machines. The natural language is made readable and understandable by a computer through artificial processing. The relevant research of NLP starts with the exploration of machine translation by humans. Although natural language processing involves multi-dimensional operations such as speech, grammar, semantics, and speech, the basic task of NLP is simply to segment the corpus to be processed based on the way of ontology dictionary, word frequency statistics, contextual semantic analysis, etc., to form term units that are in units of minimum parts of speech and are rich in semantics.
NLP is a subject of language, which uses computer technology to analyze, understand and process natural language, i.e. uses computer as powerful tool for language research, and quantitatively researches language information under the support of computer, and provides language description for common use between human and computer. Comprising two parts, natural language understanding (natural language understanding, NLU) and natural language generation (natural language generation, NLG). It is a typical edge-crossing discipline, involving the fields of linguistic science, computer science, mathematics, cognition, logic, etc., focusing on interactions between computers and natural language.
Illustratively, in the embodiment of the present application, the directory structure of the text #a may be first labeled, and semantic relationships between all adjacent text units are obtained, and then the neural network model is inferred. The neural network model may be a pre-trained model, with its input being at least two text units and its output being a classification of semantic relationships. Reference may be made to existing algorithms or processes for specific reasoning processes, which are not described in detail herein.
The document structure tree of the document is generated by reasoning based on the semantic information of the document, so that the corresponding document structure tree can be generated for most of the documents, and the document structure tree is not limited by factors such as layout of the document, attribute information of characters and the like.
In one possible implementation manner, the method may further include: and updating the first document structure tree according to the document structure template to obtain a second document structure tree of the text #A.
Specifically, the first device may update the first document structure tree described above based on the document structure templates stored in the document structure template library to generate the second document structure tree. Therefore, the document structure tree generated by the first device can be more accurate and meets the requirements of users.
Specifically, the first device may obtain a first document structure tree based on reasoning about at least two semantic information by the neural network model. Further, the first device may compare the generated first document structure tree with the document structure template library, correct the first document structure tree according to the template stored in the document structure template library, and finally generate the second document structure tree. The specific process can be seen in fig. 3.
FIG. 3 is a schematic diagram of initial correction of a document structure tree according to an embodiment of the present application. As shown in fig. 3, the first device may first generate a first document structure tree based on the neural network model's reasoning about at least two semantic information in document #a. Wherein the italic part in the first document structure tree may be displayed as mispredicted or unacknowledged content. At this time, the first device may compare the first document structure tree with the first document structure templates in the document structure template library, thereby correcting the content of the italic part in the first document structure tree, so that the generated second document structure tree may be more accurate, and may more conform to the requirements of the user.
Illustratively, the first device confirms "A.A. credit quality analysis" as the primary title. At this time, the first device may acquire a second-level title under the first-level title beginning with "first." in the first document structure template, find that "(second). A financial real estate" matches "a second-level title under" first."" (second), "and then confirm that" (second). A financial real estate "should be the second-level title, not the first-level title, and the following text" in recent years, a … "should be the second-level text. Therefore, the accuracy of the second document structure tree can be enhanced by combining the document structure template, so that the generated second document structure tree is more accurate and can meet the requirements of users.
The document structure tree of the document is generated by reasoning based on the semantic information of the document, so that the corresponding document structure tree can be generated for most of the documents, and the document structure tree is not limited by factors such as layout of the document, attribute information of characters and the like.
It will be appreciated that the present application also supports the conversion of other non-plain text documents into the aforementioned text #a by optical character recognition (optical character recognition, OCR) techniques.
The method 200 of generating a document structure tree shown in FIG. 2 is described further below in conjunction with other figures.
FIG. 4 is a flow chart of a method 400 of generating a document structure tree according to an embodiment of the present application. The execution body of the method 400 for generating a document structure tree is a device for generating a document structure tree. As shown in fig. 4, the method 400 for generating a document structure tree includes:
s410, acquiring a text #A, wherein the text #A comprises at least two text units, the at least two text units comprise a first text unit and a second text unit, and the first text unit is adjacent to the second text unit.
The specific content may refer to the description of S210, and will not be described herein.
S420, acquiring at least two pieces of semantic information, wherein the at least two pieces of semantic information comprise first semantic information and second semantic information, the first semantic information and the second semantic information are used for determining a hierarchical relationship between a first text unit and a second text unit, the first semantic information corresponds to the first text unit, and the second semantic information corresponds to the second text unit.
The specific content may refer to the description of S220, and will not be described herein.
S430, inputting at least two semantic information into the neural network model, and reasoning to obtain a first document structure tree of the text #A.
The specific content may refer to the description of S230, and will not be described herein.
S440, acquiring first data, wherein the first data is determined by checking a first document structure tree by a user.
Specifically, after the first device generates the first document structure tree for the text #a uploaded by the user, an interface of the first document structure tree may be presented to the user. The user may review the first document structure tree generated by the first device and verify the first document structure tree of text #a based on the user's experience or knowledge. See in particular fig. 5.
FIG. 5 is a schematic diagram of user verification of a document structure tree according to an embodiment of the present application. As shown in fig. 5, the user may verify the first document structure tree of text #a generated by the first device. The first device supports user self-defining directory structure hierarchy of text #A. Wherein the italics are misclassified content. The user checks the first document structure tree (also known as directory structure tree) of text #a. For example, a user may select a piece of text and define the hierarchy and type of its directory structure, i.e., title or text. Illustratively, the user selects "(three). A city land strength" and changes it from a primary title to a secondary title, and changes "during a period, A city" from primary text to secondary text.
In this way, it may be supported that the user is able to correct the document structure tree generated by the first device.
S450, updating the neural network model according to the first data.
In particular, the first device may update the neural network model based on the obtained verification data of the first document structure tree by the user. For example, the first device may regenerate training data according to the first data, perform fine tuning training based on the original parameters of the neural network model, and update some parameters of the neural network module to obtain an updated neural network model.
Because the updated neural network model is iteratively obtained by the first device based on the first data obtained by checking the first document structure tree by the user, the updated neural network model can optimize the document structure tree generated by the first device, thereby more accurately indicating the structure of the document.
By updating the neural network model based on the verification data of the user on the generated document structure tree, the embodiment of the application can support optimizing the generated document structure tree in a successive iteration mode and more accurately indicate the structure of the document, and the subsequently generated document structure tree can be more in line with the cognition of the user.
Optionally, the method 400 may further include:
s460, storing the first document structure tree after verification into a document structure template library.
Specifically, the first device may store the first document structure tree after user verification to the document structure template library, so that a document structure tree more conforming to the user's cognition may be generated later, and a more accurate document structure tree may be generated. See in particular fig. 6.
FIG. 6 is a schematic diagram of updating a neural network model and document structure template library according to an embodiment of the present application. As shown in fig. 6, the first device may generalize the structural feature information of the first document structure tree after verification, and store it in the document structure template library. For example, the first device may obtain the first several characters of the primary title of the first document structure tree and store them in the primary catalog in the document structure template library. And then summarizing the first several characters of the second-level title of the first document structure tree, adding the first-level catalogue as a prefix, and adding the first-level catalogue into the second-level catalogue in the document structure template library. When the neural network model is updated, the first device can divide training data into an input part and an output part, the input part is two adjacent text units, and the output part is semantic relation classification (which can be marked by labels) corresponding to the two adjacent text units. All texts in the training set are preprocessed, and a plurality of adjacent text units can be obtained.
Illustratively, in fig. 6, statement 1 and statement 2 are two adjacent statements, tag 1 is used to indicate the category of the semantic relationship between statement 1 and statement 2, statement 2 and statement 3 are two adjacent statements, tag 2 is used to indicate the category of the semantic relationship between statement 2 and statement 3, statement 3 and statement 4 are two adjacent statements, and tag 3 is used to indicate the category of the semantic relationship between statement 3 and statement 4. After acquiring tags 1 through 3, the underlying model may be updated with this information, thereby updating the neural network model.
Alternatively, the method for generating the document structure tree according to the embodiment of the application can be applied to the field of long document information extraction. Specifically, the user can locate this field based on the directory content of the document structure tree generated according to the method described above. If the user knows that the total production value can only appear in the credit quality analysis of the first-level catalog B city government or the economic strength of the second-level catalog B city according to a certain priori knowledge, the extraction range can be greatly reduced, and the service calling time is saved.
Alternatively, the method for generating the document structure tree according to the embodiment of the application can be applied to the field of enterprise document searching. Specifically, the user can screen the text under a specific directory level from the document structure tree generated by the method according to the prior knowledge, and then perform fuzzy matching on the text to retrieve the related data.
In general, the method for generating the document structure tree in the embodiment of the application does not need to rely on pdf files and manually written rules, so that a great amount of labor and time cost can be saved, the migration effect is good, and a good effect can be obtained for complex documents with deep directory structure layers. And a verification interface is provided, and a user can correct the generated result, continuously iterate and optimize the algorithm model of the bottom layer (such as updating a neural network model), and update the corresponding document structure template library.
In one possible implementation, when the first device presents the second document structure tree of text #a to the user, the first device may further obtain second data determined by the user by checking the second document structure tree.
Alternatively, the first device may update the neural network model according to the second data, and the detailed description may refer to the description about the first data, which is not described herein.
Having described method embodiments of the present application, corresponding apparatus embodiments are described below.
FIG. 7 is a schematic block diagram of an apparatus 700 for generating a document structure tree according to an embodiment of the present application. The apparatus 700 includes an acquisition module 710 and a processing module 720.
Optionally, the apparatus 700 further comprises a verification module 730 and a storage module 740.
Memory 740 includes, but is not limited to, random access memory (random access memory, RAM), read-only memory (ROM), erasable programmable read-only memory (erasable programmable read only memory, EPROM), or portable read-only memory (compact disc read-only memory, CD-ROM), memory 440 for associated instructions and data.
The processing module 720 may be one or more central processing units (central processing unit, CPU), and in the case where the processor 720 is a CPU, the CPU may be a single-core CPU or a multi-core CPU.
The acquiring module 710 is configured to perform the following operations: acquiring a text #A, wherein the text #A comprises at least two text units, the at least two text units comprise a first text unit and a second text unit, and the first text unit is adjacent to the second text unit; at least two semantic information is acquired, wherein the at least two semantic information comprises first semantic information and second semantic information, the first semantic information and the second semantic information are used for determining a hierarchical relationship between a first text unit and a second text unit, the first text unit corresponds to the first semantic information, and the second text unit corresponds to the second semantic information.
A processing module 720, configured to perform the following operations: at least two semantic information are input into the neural network model, and a first document structure tree of the text #A is obtained by reasoning.
The foregoing is described by way of example only. The apparatus 700 is responsible for executing the methods or steps related to the foregoing method embodiments.
Optionally, the verification module 730 is configured to perform the following operations: acquiring first data, wherein the first data is determined by checking a first document structure tree by a user; alternatively, second data, which is data determined by a user checking the second document structure tree, is acquired.
Optionally, the storage module 740 is configured to perform the following operations: storing the checked first document structure tree or second document structure tree into a document structure template library.
Optionally, the processing module 720 may be further configured to perform the following operations: the neural network model is updated based on the first data or the second data.
The foregoing is described by way of example only.
The above description is merely exemplary in nature. Specific content can be seen from the content shown in the above method embodiment. In addition, the implementation of the respective operations in fig. 7 may also correspond to the respective descriptions of the method embodiments shown with reference to fig. 2 to 6.
The embodiment of the present application further provides a computing device cluster, and a schematic structural diagram of the computing device cluster is shown in fig. 8, where the computing device cluster includes at least one computing device 800. The same document structure tree generation apparatus 700 may be stored in memory 806 in one or more computing devices 800 in a cluster of computing devices for executing instructions of the document structure tree generation method 200.
It should be noted that, the memory 806 in different computing devices 800 in the computing device cluster may store different instructions for performing part of the functions of the generating apparatus 700 of the document structure tree.
FIG. 9 is yet another block diagram of a computing device cluster in accordance with an embodiment of the application. As shown in fig. 9, two computing devices 900A and 900B are connected through a communication interface 908. Instructions for performing the functions of the interaction unit 202 and the processing unit 206 are stored on a memory in the computing device 900A. Instructions for performing the functions of storage unit 204 are stored in memory in computing device 900B. In other words, the memories 906 of the computing devices 900A and 900B collectively store instructions of the document structure tree generating apparatus 700 for executing the document structure tree generating method described in the foregoing embodiment.
The connection manner between clusters of computing devices shown in fig. 9 may be in consideration of the fact that the method for generating a document structure tree according to the foregoing embodiment provided in the present application needs to obtain semantic information of at least two text units. Accordingly, it is contemplated that the storage function is performed by computing device 900B.
It should be appreciated that the functionality of computing device 900A shown in fig. 9 may also be performed by multiple computing devices 800. Likewise, the functionality of computing device 900B may also be performed by multiple computing devices 800.
In some possible implementations, one or more computing devices in a cluster of computing devices may be connected through a network. Wherein the network may be a wide area network or a local area network, etc.
FIG. 10 is a schematic block diagram of yet another architecture of a computing device cluster in accordance with an embodiment of the application. As shown in fig. 10, two computing devices 1000A and 1000B are connected by a network. Specifically, the connection to the network is made through a communication interface in each computing device. In this type of possible implementation, instructions to execute the interactive unit 202 are stored in a memory 1006 in the computing device 1000A. Meanwhile, the memory 1006 in the computing device 1000B has stored therein instructions that execute the storage unit 204 and the processing unit 206.
The present application also provides a computer readable storage medium having stored thereon computer instructions for implementing the method described in the above method embodiments.
For example, the computer program, when executed by a computer, enables the computer to implement the method described in the method embodiments above.
Embodiments of the present application also provide a computer program product comprising instructions which, when executed by a computer, cause the computer to implement the method described in the method embodiments above.
The present application will present various aspects, embodiments, or features about a system comprising a plurality of devices, components, modules, etc. It is to be understood and appreciated that the various systems may include additional devices, components, modules, etc. and/or may not include all of the devices, components, modules etc. discussed in connection with the figures. Furthermore, combinations of these schemes may also be used.
In addition, in the embodiments of the present application, words such as "exemplary," "for example," and the like are used to indicate an example, instance, or illustration. Any embodiment or design described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, the term use of an example is intended to present concepts in a concrete fashion.
In the embodiments of the present application, "corresponding" and "corresponding" may sometimes be used in combination, and it should be noted that the meaning to be expressed is consistent when the distinction is not emphasized.
The network architecture and the service scenario described in the embodiments of the present application are for more clearly describing the technical solution of the embodiments of the present application, and do not constitute a limitation on the technical solution provided in the embodiments of the present application, and those skilled in the art can know that, with the evolution of the network architecture and the appearance of the new service scenario, the technical solution provided in the embodiments of the present application is also applicable to similar technical problems.
Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.
In the present application, "at least one" means one or more, and "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: including the case where a alone exists, both a and B together, and B alone, where a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b, or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural.
The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. The scope of the application is therefore intended to be subject to the scope of the appended claims
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.
In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Claims (15)
1. A method for generating a document structure tree, comprising:
acquiring a text, wherein the text comprises at least two text units, the at least two text units comprise a first text unit and a second text unit, and the first text unit is adjacent to the second text unit;
acquiring at least two pieces of semantic information, wherein the at least two pieces of semantic information comprise first semantic information and second semantic information, the first semantic information and the second semantic information are used for determining a hierarchical relationship between the first text unit and the second text unit, the first text unit corresponds to the first semantic information, and the second text unit corresponds to the second semantic information;
inputting the at least two semantic information into a neural network model, and reasoning to obtain a first document structure tree of the text.
2. The method according to claim 1, wherein the method further comprises:
acquiring first data, wherein the first data is determined by checking the first document structure tree by a user;
and updating the neural network model according to the first data.
3. The method according to claim 1, wherein the method further comprises:
and updating the first document structure tree according to the document structure template to obtain a second document structure tree of the text.
4. A method according to claim 3, characterized in that the method further comprises:
acquiring second data, wherein the second data is determined by checking the second document structure tree by a user;
updating the neural network model according to the second data.
5. The method according to claim 2 or 4, characterized in that the method further comprises:
and storing the checked first document structure tree or the second document structure tree into a document template library.
6. The method according to any one of claims 1 to 5, wherein the text unit comprises at least one of:
statement, or paragraph.
7. A document structure tree generating apparatus, comprising:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a text, the text comprises at least two text units, the at least two text units comprise a first text unit and a second text unit, and the first text unit is adjacent to the second text unit;
the obtaining module is further configured to obtain at least two pieces of semantic information, where the at least two pieces of semantic information include first semantic information and second semantic information, the first semantic information and the second semantic information are used to determine a hierarchical relationship between the first text unit and the second text unit, the first text unit corresponds to the first semantic information, and the second text unit corresponds to the second semantic information;
and the processing module is used for inputting the at least two semantic information into the neural network model and obtaining a first document structure tree of the text by reasoning.
8. The apparatus of claim 7, wherein the apparatus further comprises:
the verification module is used for acquiring first data, wherein the first data is determined by verifying the first document structure tree by a user;
The processing module is further configured to update the neural network model according to the first data.
9. The apparatus of claim 7, wherein the processing module is further configured to update the first document structure tree according to a document structure template to obtain a second document structure tree for the text.
10. The apparatus of claim 9, wherein the apparatus further comprises:
the verification module is used for acquiring second data, wherein the second data is determined by verifying the second document structure tree by a user;
the processing module is further configured to update the neural network model according to the second data.
11. The apparatus according to claim 8 or 10, characterized in that the apparatus further comprises:
and the storage module is used for storing the checked first document structure tree or the second document structure tree into a document template library.
12. The apparatus according to any one of claims 7 to 11, wherein the text unit comprises at least one of:
statement, or paragraph.
13. A cluster of computing devices, comprising at least one computing device, each computing device comprising a processor and a memory;
The processor of the at least one computing device is configured to execute instructions stored in the memory of the at least one computing device to cause the cluster of computing devices to perform the method of any one of claims 1 to 6.
14. A computer readable storage medium storing instructions which, when run on a computer, cause the computer to perform the data processing method of any one of claims 1 to 6.
15. A computer device, the computer device comprising a processor and a memory;
the memory is used for storing computer program instructions;
execution of the processor invokes computer program instructions in the memory to perform the method of any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211039454.0A CN117688123A (en) | 2022-08-29 | 2022-08-29 | Method and device for generating document structure tree |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211039454.0A CN117688123A (en) | 2022-08-29 | 2022-08-29 | Method and device for generating document structure tree |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117688123A true CN117688123A (en) | 2024-03-12 |
Family
ID=90126935
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211039454.0A Pending CN117688123A (en) | 2022-08-29 | 2022-08-29 | Method and device for generating document structure tree |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117688123A (en) |
-
2022
- 2022-08-29 CN CN202211039454.0A patent/CN117688123A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111444320B (en) | Text retrieval method and device, computer equipment and storage medium | |
US11106873B2 (en) | Context-based translation retrieval via multilingual space | |
CN113392209B (en) | Text clustering method based on artificial intelligence, related equipment and storage medium | |
Kashmira et al. | Generating entity relationship diagram from requirement specification based on nlp | |
US11604929B2 (en) | Guided text generation for task-oriented dialogue | |
US20240184829A1 (en) | Methods and systems for controlled modeling and optimization of a natural language database interface | |
CN118093834B (en) | AIGC large model-based language processing question-answering system and method | |
EP4364044A1 (en) | Automated troubleshooter | |
CN111651994B (en) | Information extraction method and device, electronic equipment and storage medium | |
US20240296603A1 (en) | Systems and methods for digital ink generation and editing | |
KR102682244B1 (en) | Method for learning machine-learning model with structured ESG data using ESG auxiliary tool and service server for generating automatically completed ESG documents with the machine-learning model | |
CN116258137A (en) | Text error correction method, device, equipment and storage medium | |
US20230014904A1 (en) | Searchable data structure for electronic documents | |
CN114997288A (en) | Design resource association method | |
CN115759119A (en) | Financial text emotion analysis method, system, medium and equipment | |
CN117271558A (en) | Language query model construction method, query language acquisition method and related devices | |
CN111951079A (en) | Credit rating method and device based on knowledge graph and electronic equipment | |
CN118228694A (en) | Method and system for realizing industrial industry number intelligence based on artificial intelligence | |
US20240320444A1 (en) | User interface for ai-guided content generation | |
US11868313B1 (en) | Apparatus and method for generating an article | |
CN113705207A (en) | Grammar error recognition method and device | |
CN117556789A (en) | Student comment generation method based on multi-level semantic mining | |
WO2023198696A1 (en) | Method for extracting information from an unstructured data source | |
CN115309995A (en) | Scientific and technological resource pushing method and device based on demand text | |
CN117688123A (en) | Method and device for generating document structure tree |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication |