WO2017107010A1 - Information analysis system and method based on event regression test - Google Patents

Information analysis system and method based on event regression test Download PDF

Info

Publication number
WO2017107010A1
WO2017107010A1 PCT/CN2015/098086 CN2015098086W WO2017107010A1 WO 2017107010 A1 WO2017107010 A1 WO 2017107010A1 CN 2015098086 W CN2015098086 W CN 2015098086W WO 2017107010 A1 WO2017107010 A1 WO 2017107010A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
module
natural language
text
database
Prior art date
Application number
PCT/CN2015/098086
Other languages
French (fr)
Chinese (zh)
Inventor
易峥
夏炜
陶志伟
潘杭平
Original Assignee
浙江核新同花顺网络信息股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 浙江核新同花顺网络信息股份有限公司 filed Critical 浙江核新同花顺网络信息股份有限公司
Priority to PCT/CN2015/098086 priority Critical patent/WO2017107010A1/en
Publication of WO2017107010A1 publication Critical patent/WO2017107010A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • the invention relates to an information analysis system and method, in particular to automatically analyzing relevant information and natural language sentences obtained by an event, thereby obtaining historical backtesting information and the like.
  • One aspect of the present invention is directed to an information analysis system including, according to one embodiment, a computer readable storage medium, the storage medium storing an executable module, the storage medium including a collection module,
  • the collection module is capable of collecting information;
  • the processing module is capable of pre-processing the collected information, extracting an event from the pre-processed information; and a natural language processing module capable of generating the generated event according to the extracted event
  • a natural language statement capable of generating a backtest result according to the generated natural language statement combined with the history information.
  • a processor capable of executing the executable module of the computer readable storage medium storage.
  • the information analysis system further includes a a database capable of storing the collected information, pre-processed information, extracted events, natural language statements, historical information, and backtest results.
  • the database includes an original information database, a text database, a text pre-processing database, an entity database, an event attribute database, a keyword database, a text classification database, a history information database, a natural language processing database, and event recognition.
  • Database backtest module database, text template database, dictionary database.
  • the processing module of the information analysis system further includes a format conversion module, a text processing module, an attribute extraction module, and an event recognition module.
  • the processing module further comprises a text classification module.
  • the method adopted by the processing module includes chi-square statistics, information gain, mutual information, odds ratio, cross entropy, inter-class information difference, keyword statistics, decision tree, Rocchio, Na ⁇ ve Bayes Neural network, support vector machine, linear least squares fit, nearest neighbor algorithm, genetic algorithm, sentiment classification, maximum entropy, Generalized Instance Set, synonym configuration, Boolean association rules, position rules, machine learning.
  • the natural language processing module can receive information from the collection module.
  • the backtesting module further includes backtesting information determination, and the backtesting information determines that the evaluation is given according to the condition of the backtesting result.
  • the backtest result can be presented to the user.
  • Another aspect of the present invention relates to an information analysis method including collecting information, extracting an event based on the information, generating a natural language sentence according to the event, and performing backtesting analysis on the natural language sentence.
  • the collection information includes user input information and non-user input information
  • the non-user input information source including a communication terminal and a server.
  • the collected information includes announcement information and new Smell the information.
  • the extraction event further comprises entity identification and attribute extraction.
  • the entity identification further comprises format conversion, text segmentation, number and unit normalization processing.
  • the attribute extraction can be implemented by a system defined model.
  • the natural language statement may be generated based on the extraction event.
  • the natural language statement can be generated based on user input information.
  • the natural language statement is further expanded according to an event category.
  • the natural language sentence analysis includes backtesting the natural language statement.
  • the natural language sentence backtesting may generate a backtest result according to the information category.
  • Figure 1 is a schematic diagram of an exemplary system configuration of an information analysis system
  • Figure 2 is a block diagram of the information analysis system
  • Figure 3 shows the flow chart of information analysis
  • Figure 4 is a schematic structural view of the collection module
  • Figure 5 is a schematic structural view of a processing module
  • FIG. 6 is a schematic structural diagram of a format conversion module
  • Figure 7 is a schematic structural diagram of a text preprocessing module
  • Figure 8 is a schematic structural diagram of a text classification module
  • FIG. 9 is a schematic structural diagram of an attribute extraction module
  • Figure 10 is a flow chart of the processing module
  • Figure 11 is a schematic structural diagram of a natural language processing module
  • Figure 12 is a schematic structural view of the back test module
  • Figure 13 shows the backtest flow chart
  • Figure 14 is a schematic structural diagram of a system database
  • Figure 15 shows a flow chart of information analysis
  • Figure 16 shows the online working flow chart of the information analysis system
  • Figure 17 is a schematic diagram of an interactive interface of an information analysis system for news or announcements
  • Figure 18 is a schematic diagram of an interactive interface of the information analysis system for user input
  • Figure 19 shows the text of the announcement text used by the information analysis system.
  • the information analysis method described in this specification refers to collecting information, processing information, generating natural language sentences, analyzing data to provide reference information, and the like.
  • an aspect of the invention relates to an information analysis system.
  • the information analysis system can include a collection module, a system database, a processing module, a natural language processing module, and a backtest module.
  • Another aspect of the invention relates to an information analysis method based on event backtesting.
  • the information analysis method may include collecting information, preprocessing the information and extracting the entities in the information, processing the information and extracting related attributes in the information, determining the information category and generating a refinement event, according to the generation.
  • the refinement event generates a corresponding natural language statement, backtests the generated natural language statement, generates a backtest report, and the like.
  • Another aspect of the invention relates to the fact that the user can select any announcement or news to perform real-time backtesting and generate a backtest report.
  • Another aspect of the invention relates to a user inputting any natural language statement and backtesting the input natural language statement to generate a backtest report in real time.
  • Different embodiments of the present invention are applicable to a variety of fields including, but not limited to, investments in finance and derivatives (including but not limited to stocks, bonds, gold, paper gold, silver, foreign exchange, precious metals, futures, money funds, etc.), Technology (including but not limited to mathematics, physics, chemistry and chemical engineering, biology and bioengineering, electrical engineering, communication systems, internet, internet of things, etc.), politics (including but not limited to politicians, political events, countries), news ( From the regional perspective, including but not limited to regional news, domestic news, international news; from the main body of the news, including but not limited to political news, sports news, science and technology news, economic news, life news, weather news, etc.).
  • various information resources such as text, pictures, audio, and video content
  • the backtesting strategy is generated according to relevant historical information
  • a backtest report is generated to make the user more Quickly and easily understand the possible future impact of the information.
  • Application scenarios of different embodiments of the present invention include, but are not limited to, one or more combinations of web pages, browser plug-ins, clients, customization systems, enterprise internal analysis systems, artificial intelligence robots, and the like. The above description of the applicable fields is merely a specific example and should not be considered as the only feasible implementation.
  • the backtest report is displayed to the user in a unified text form.
  • the back test report may also be displayed in a unified audio format or video format. user.
  • Alternatives or modifications or variations similar to this are still within the scope of the invention.
  • FIG. 1 is a schematic diagram of an exemplary system configuration of an information analysis system.
  • the example system configuration 100 can include, but is not limited to, one or more information analysis systems 101, one or more networks 102, and one or more information sources 103.
  • the information analysis system 101 can be used in a system for analyzing and processing the collected information to generate an analysis result.
  • the information analysis system 101 can be a server or a server group.
  • a server group can be centralized, such as a data center.
  • a server group can also be distributed, such as a distributed system.
  • the information analysis system 101 can be local or remote.
  • Network 102 can provide a conduit for information exchange.
  • Network 102 can be a single network or a combination of multiple networks.
  • Network 102 may include, but is not limited to, one or more combinations of a local area network, a wide area network, a public network, a private network, a wireless local area network, a virtual network, a metropolitan area network, a public switched telephone network, and the like.
  • Network 102 may include a variety of network access points, such as wired or wireless access points, base stations, or network switching points, through which the data sources connect to network 102 and transmit information over the network.
  • the information source 103 can provide various information.
  • the information source 103 can include, but is not limited to, a server, a communication terminal.
  • the server may be a web server, a file server, a database server, an FTP server, an application server, a proxy server, etc., or any combination of the above.
  • the communication terminal may be a mobile phone, a personal computer, a wearable device, a tablet computer, a smart TV, or the like, or any combination of the above communication terminals.
  • the information source 103 can send or/and collect information to the information analysis system 101 via the network 102.
  • the information source 103 can be information input by the user or information provided by other databases or information sources.
  • FIG. 2 shows a block diagram of the information analysis system.
  • Information analysis system 101 may include, but is not limited to, one or more collection modules 201, one or more processing modules 202, one or more natural language processing modules 203, one or more flyback modules 204, one or more system databases 205. Some or all of the above modules may be connected to the network 102. The above modules can be centralized or distributed. One or more of the above modules may be local or remote.
  • the collection module 201 can be mainly The means for collecting the required information in various ways, collecting the information may be direct (eg, directly from the one or more information sources 103 via the network 102) or indirectly (eg, through the processing module 201, nature) The language processing module 202, the backtest module 204 or the system database 205 to obtain information).
  • the processing module 202 can be mainly used for pre-processing of information.
  • the pre-processing of information can be manual or automatic.
  • the pre-processing of information can include but is not limited to format conversion, word segmentation, entity identification, number and unit normalization.
  • the natural language processing module 203 can be mainly used to generate natural language sentences, and can also receive input natural language sentences.
  • the manner in which the natural language processing module 203 processes the information may be manual or automatic.
  • the flyback module 204 can be primarily used to analyze information.
  • the analysis method can include, but is not limited to, one or more combinations of system definitions, user-defined selections, machine learning, and the like.
  • System database 205 can be broadly referred to as a device having a storage function.
  • the system database 205 is primarily used to store data collected from the information source 103 and various data generated in the operation of the information analysis system 101.
  • System database 205 can be local or remote.
  • the connection or communication between the system database and other modules of the system can be wired or wireless.
  • the collection module 201 can transmit the collected information to the processing module 202.
  • the collection module 201 can also transmit the collected information to the natural language processing module 203.
  • the collection module 201 can also transmit the collected information to the backtest module 204.
  • the collection module 201 can receive the request sent by the processing module 202, and can also access the system database 205 according to the request to obtain the required data. After the required data is acquired, the collection module 201 can transmit the data to the processing module 202.
  • the collection module 201 can receive the request sent by the natural language processing module 203, and can also access the system database 205 according to the request to obtain the required data. After the required data is acquired, the collection module 201 can transmit the data to the natural language processing module 203.
  • the collection module 201 can receive the request sent by the backtest module 204, and can also access the system database 205 according to the request to obtain the required data. After the required data is acquired, the collection module 201 can transmit the data to the backtest module 204.
  • the collection module 201 can be used primarily to collect the required information in a variety of ways.
  • the collection module 201 can obtain the required information by sending a request to the information source 103. After the collection module 201 obtains the required information, the obtained information may be processed in the next step or stored in the system database 205.
  • the collection module 201 can also obtain information stored in the system database 205 by sending a request to the system database 205. Alternatively, system database 205 may also send a request directly to information source 103, which may be stored in system database 205.
  • the information source 103 can be a server, a communication terminal, or the like.
  • the server may be a web server, a file server, a database server, an FTP server, an application server, a proxy server, etc., or any combination of the above.
  • the communication terminal may be a mobile phone, a personal computer, a wearable device, a tablet computer, a smart TV, or the like, or various combinations of the above communication terminals.
  • the above required information may include, but is not limited to, one or more of various news, research reports, announcements, messages, reports, notices, essays, periodicals, and the like.
  • the information required above may be information about various industries including, but not limited to, one or more of sports, entertainment, economics, politics, military, culture, art, science, engineering, and the like.
  • the form of the above-mentioned required information may include, but is not limited to, one or more of text, picture, audio, video, and the like.
  • the video news broadcast on a video website "World Bank lowered the global economic growth forecast to 2.8% this year”
  • a listed company issued by a stock exchange Announcement "A Co., Ltd. Announcement on Signing Major Contracts for Daily Operations” and a football event preview released by a sports event live broadcast platform "This Saturday Chelsea Club will face the rival Arsenal at the Stamford Bridge Stadium at home.”
  • the processing module 202 can communicate bidirectionally with the collection module 201.
  • the processing module 202 can process the information transmitted by the collection module 201.
  • the information processing can include, but is not limited to, one or more combinations of format conversion, text preprocessing, text classification, attribute extraction, and event recognition.
  • the processing module 202 may also send information to the collection module 201, where the information may include, but is not limited to, processed information and control information, which may include, but is not limited to, control information of the information collection manner, control information of the information collection time, Control information such as information collection sources.
  • the processing module 202 can communicate bi-directionally with the natural language processing module 203.
  • the processing module 202 can transmit the processed information to the self.
  • the language processing module 203 can also receive the information sent by the natural language processing module 203.
  • the processing module 202 can communicate bidirectionally with the flyback module 204.
  • the processing module 202 may transmit the processed information to the backtest module 204, and may also receive the information sent by the backtest module 204.
  • Processing module 202 can communicate bi-directionally with system database 205.
  • the processing module 202 may transmit the processed information to the system database 205 for storage, or may send the request information to the system database 205 and receive the information sent by the system database 205 during the processing.
  • the natural language processing module 203 can send a request to the collection module 201, and the collection module 201 can access the system database 205 or from one or more information sources 103 upon request to obtain the required information. After the required information is obtained, the collection module 201 transmits the information to the natural language processing module 203. Alternatively, after receiving the request sent from the natural language processing module 203, the collection module 201 may also transmit the information in the collection module 201 to the natural language processing module 203, which may be from the information source 103 or the system database. 205.
  • the natural language processing module 203 can send a request to the processing module 202, and the processing module 202 can access the system database 205 according to the request to obtain the required information.
  • the processing module 202 transmits the information to the natural language processing module 203.
  • the processing module 202 may also transmit the information in the processing module 202 to the natural language processing module 203 after receiving the request sent from the natural language processing module 203.
  • the natural language processing module 203 can directly access the system database 205 and send a request to the system database 205 to obtain the required information, which can be transmitted to the natural language processing module 203.
  • system database 205 can send information to natural language processing module 203 without receiving a request.
  • the natural language processing module 203 can directly receive a natural language statement (not shown) from the information source 103, which can be input by the user using an input device, and the input device includes It is not limited to one or more combinations of a keyboard, a mouse, a camera, a scanner, a handwriting tablet, a voice input device, and the like.
  • the input information of the natural language processing module 203 may be letters, numbers, characters, words, phrases, sentences, paragraphs, chapters, etc., or one or more of them, or by any number of identifier sets, the set of identifiers may Contains one or more semantics.
  • the input information of the natural language processing module 203 may be a customized information type.
  • the input information of the natural language processing module 203 may be characterized as a multi-tuple.
  • the input information of the natural language processing module 203 can be characterized as a quad ⁇ k, c, u, d ⁇ .
  • the parameter k may be configured to represent a source of information, which may include, but is not limited to, a collection module 201, a processing module 202, a system database 205, an information source 103 (not shown), or any combination of the above sources of information.
  • the parameter c can be configured to represent the communication time. For example, the parameter c can be configured to represent the year, month, date, and the like. By giving the parameter c a specific value, the information of the specific time specified by the parameter c will be input to the natural language processing module 203.
  • the parameter u can be configured to represent the user model to be used.
  • the user model is a data processing model with different functions depending on different user needs.
  • the parameter u by default means that no data model is applied.
  • the parameter d can be configured to indicate that information has been generated.
  • the generated information refers to various entities and attributes that the user has generated during the natural language processing, and various entities and attributes will be used in the subsequent natural language processing.
  • the natural language processing module 203 can process the collected information to generate natural language statements.
  • the generated natural language statement can be transmitted to the backtest module 204 for backtesting.
  • the natural language processing module 203 may send a backtesting request to the backtesting module 204. After the request is approved, the natural language processing module 203 inputs the generated natural language statement to the backtesting module 204 for back testing.
  • the natural language processing module 203 may also not send the backtesting request, but directly input the generated natural language statement to the backtesting module 204 for back testing.
  • the backtest module 204 after receiving the natural language statement input by the natural language processing module 203, the backtest module 204 further processes the natural language statement to generate a database standard access instruction, thereby accessing or retrieving the stored in the corresponding database. historical data.
  • the natural language processing module 203 can receive the event generated by the processing module 202, and the natural language processing module 203 can assemble the received natural language statement (event), or can be based on the backtesting. Need, plus additional conditions. For example, for individual stock events, you need to add “stock code or abbreviation”; for industry event backtesting, you need to add “single stock corresponding industry”; for whole market events (such as The central bank cuts interest rates. There is no need to add any statements to the backtest.
  • the input information to the natural language processing module 203 is only for the purpose of facilitating understanding of the invention and should not be considered as the only feasible embodiment of the present invention.
  • various modifications and changes may be made to the content of the required information without departing from the principle, but these modifications and changes may be made.
  • the input information of the natural language processing module 203 can be characterized as a binary group, a triplet, a quintuple, a hexadecimal, an N-tuple, etc., or any combination of the above types of information.
  • the backtest module 204 can send a backtest condition request to the collection module 201, and the collection module 201 can access the system database 205 according to the request to obtain the required information. After the required information is obtained, the collection module 201 transmits the information to the backtest module 204. Alternatively, after receiving the request sent from the backtest module 204, the collecting module 201 may also transmit the information stored in the collecting module 201 to the back testing module 204.
  • the flyback module 204 can send a request to the processing module 202, and the processing module 202 can access the system database 205 to obtain the required information according to the request. After the required information is obtained, the processing module 202 can transmit the information to the backtest module 204.
  • the processing module 202 may also transmit the information stored in the processing module 202 to the backtest module 204.
  • the backtest module 204 can send a request to the natural language processing module 203, and the natural language processing module 203 can access the system database 205 to obtain the required information according to the request. After the required information is obtained, the natural language processing module 203 can transmit the information to the backtest module 204.
  • the natural language processing module 203 may also transmit the information stored in the natural language processing module 203 to the backtest module 204.
  • the flyback module 204 can directly access the system database 205 and send a request to the system database 205 to obtain the required information, which can be transmitted to the flyback module 204.
  • system database 205 can send information to flyback module 204 without receiving a request.
  • the input information received by the backtest module 204 may include, but is not limited to, one or more combinations of letters, numbers, characters, words, sentences, paragraphs, chapters, natural language statements, and the like.
  • Sources of input information may include, but are not limited to, collection module 201, processing module 202, natural One or more combinations of language processing module 203, system database 205, information source 103, and the like.
  • System database 205 or other storage devices within the system generally refer to all media that can have read/write capabilities.
  • the system database 205 or other storage devices in the system may be internal to the system or external devices of the system.
  • the connection manner of the system database 205 or other storage devices in the system may be wired or wireless.
  • System database 205 or other storage devices within the system may include, but are not limited to, one or more combinations of hierarchical databases, networked databases, and relational databases.
  • the system database 205 or other storage devices within the system may digitize the information and store it in a storage device that utilizes electrical, magnetic or optical means.
  • System database 205 or other storage devices within the system can be used to store various information such as programs and data.
  • the system database 205 or other storage devices in the system may be devices that store information by means of electrical energy, such as various memories, random access memory (RAM), read only memory (ROM), and the like.
  • the system database 205 or other storage devices within the system may be devices that store information using magnetic energy, such as hard disks, floppy disks, magnetic tapes, magnetic core memories, magnetic bubble memories, USB flash drives, flash memories, and the like.
  • System database 205 or other storage devices within the system may be devices that optically store information, such as CDs or DVDs.
  • the system database 205 or other storage devices within the system may be devices that store information using magneto-optical means, such as magneto-optical disks.
  • the access method of the system database 205 or other storage devices in the system may be one or more combinations of random storage, serial access storage, read-only storage, and the like.
  • the system database 205 or other storage devices within the system may be non-persistent memory or permanent memory.
  • the storage device mentioned above is a few examples, and the storage device that the system can use is not limited thereto.
  • the system database 205 or other storage devices in the system may be local, remote, or on a cloud server.
  • System database 205 or other storage devices within the system can communicate or exchange information with information source 103.
  • System database 205 or other storage devices within the system may receive information from information source 103 and store it in system database 205 or other storage devices within the system.
  • Information in the system database 205 or other storage devices within the system can be extracted and passed to the information source 103.
  • the instruction may be directly from the information source 103, or may be from other modules, such as the collection module 201, the processing module 202, the natural language processing module 203, the backtest module 204, and the like.
  • System database 205 or other storage devices within the system can communicate or exchange information with collection module 201.
  • the system database 205 or other storage devices within the system may receive the information collected by the collection module 201 and store it in the system database 205 or other storage devices within the system. Based on the received instructions, the information in the system database 205 or other storage devices in the system can be extracted and passed to the collection module 201.
  • the instruction may be directly from the collection module 201, or may be from other modules, such as the processing module 202, the natural language processing module 203, the backtest module 204, the information source 103, and the like.
  • System database 205 or other storage devices within the system can communicate or exchange information with processing module 202.
  • the system database 205 or other storage devices within the system can receive the information collected by the processing module 202 and store it in the system database 205 or other storage devices within the system. Based on the received instructions, information in the system database 205 or other storage devices within the system can be extracted and passed to the processing module 202.
  • the instruction may be directly from the processing module 202, or may be from other modules, such as the collection module 201, the natural language processing module 203, the backtest module 204, the information source 103, and the like.
  • System database 205 or other storage devices within the system can communicate or exchange information with natural language processing module 203.
  • the system database 205 or other storage devices within the system can receive the information collected by the natural language processing module 203 and store it in the system database 205 or other storage devices within the system. Based on the received instructions, information in the system database 205 or other storage devices within the system can be extracted and passed to the natural language processing module 203.
  • the instruction may be directly from the natural language processing module 203, or may be from other modules, such as the collection module 201, the processing module 202, the backtest module 204, the information source 103, and the like.
  • the information sent by the system database 205 may be information obtained directly from the information source, or may be processed and analyzed.
  • the information processed by the analysis may be information stored in the system database 205 after being processed by the processing module 202, or may be information stored after being processed by the natural language processing module 203.
  • System database 205 or its The way of transferring information between the storage device and other modules may be wired or wireless, and may be direct or indirect, and may be performed simultaneously or sequentially, and may be periodic or aperiodic. And so on.
  • the collection module 201, the processing module 202, the natural language processing module 203, the backtest module 204, and the system database 205 may be different modules embodied in one system, or may be one module to implement the above two or more modules.
  • the functions, such as the processing module 202, can collect information and generate natural language statements.
  • the processing module implements the functions of the collection module 201 and the natural language processing module 203 at the same time, and similar modifications are still within the scope of the claims of the present invention.
  • FIG 3 shows the flow chart of information analysis.
  • the required information is collected from the information source 103 (see Figure 1) in step 301.
  • the information source 103 can include, but is not limited to, a server, a communication terminal.
  • the server may be a web server, a file server, a database server, an FTP server, an application server, a proxy server, etc., or any combination of the above.
  • the communication terminal may be a mobile phone, a personal computer, a wearable device, a tablet computer, a smart TV, or the like, or any combination of the above communication terminals.
  • natural language sentences input by the user through various communication terminals can be received.
  • the above required information may include, but is not limited to, one or more of various news, announcements, comments, research reports, blogs, messages, reports, notices, essays, journals, and the like.
  • the information required above may be information about various industries including, but not limited to, one or more of sports, entertainment, economics, politics, military, culture, art, science, engineering, and the like.
  • the form of the above-mentioned required information may include, but is not limited to, one or more of text, picture, audio, video, and the like.
  • the news can be a video news broadcast on a video website, "The World Bank cuts its global economic growth forecast to 2.8% this year”, a news report reported by a news website, "HSBC China service industry PMI rose to 53.5 in May”, a stock exchange Announcement of the listed company issued “A Company’s Announcement on Signing Major Contracts for Daily Operations”, The football event preview released by a sports event live broadcast platform "This Saturday Chelsea Club will be at the Stamford Bridge at home against the rivals Arsenal".
  • Step 301 can be completed by the collection module 201.
  • the information collected in step 301 is processed in step 302.
  • Step 302 can be completed by processing module 202.
  • the information collected in step 301 may be textual information.
  • the textual information may be derived directly or indirectly from text, audio, video or any combination of the above sources.
  • the system can convert the audio into text by speech recognition or subtitle extraction.
  • text information is derived from video
  • the system can convert the video into text by speech recognition or subtitle file extraction.
  • the text information can be Chinese, English, German, Spanish, Arabic, French, Japanese, Korean, Russian, Portuguese, etc., or any combination of the above.
  • the text information may be letters, numbers, characters, words, phrases, sentences, paragraphs, chapters, etc., or one or more of them, or a set of any number of identifiers, the set of identifiers may include One or more semantics.
  • the information processing performed by step 302 may include, but is not limited to, one or more of format conversion, word segmentation processing, entity recognition, number and unit normalization processing, text classification, event attribute extraction, refinement event recognition, and the like.
  • Format conversion converts text information in various formats into a uniform text format.
  • the format of the text information may include, but is not limited to, pdf, doc, epub, mobi, caj, kdh, nh, etc., or one or more of the above formats.
  • the unified text format may include, but is not limited to, one or more combinations of txt, ASCII, MIME, and the like.
  • Word segmentation can extract words in text information according to word type.
  • Word types can include but are not limited to nouns, verbs, adjectives, adverbs, auxiliary words, onomatopoeia, numbers, proprietary symbols, etc., or one or more of them.
  • the text information can also be processed using a certain word segmentation algorithm.
  • the word segmentation algorithm may include, but is not limited to, a word segmentation based word segmentation method (ie, mechanical segmentation method), an understanding based word segmentation method, a statistical based word segmentation method, or the like, or one or more of the above word segmentation methods.
  • An entity may include, but is not limited to, one or more of a product, an institution name, a person's name, a place name, a time, a date, a currency, a number, a percentage, and the like.
  • Entity recognition methods may include, but are not limited to, hidden Markov models, maximum entropy models, support Vector machine, rule-based recognition method, and statistical-based recognition method, etc., or one or more of them.
  • the system can summarize the elements in the past information and define various event categories. For example: diplomacy, finance, sports, politics, science, education, etc., or one or more of them.
  • the above categories may also include sub-categories of several levels, such as financial classes, which may include sub-categories such as stocks, funds, and futures.
  • the above categories may include the completion of the entity identification, the system will normalize the numbers and units in the text information identified by the entity. For example, the “project total investment of 30,000 yuan” was converted into “project total investment of 30,000 yuan”, and “Messi completed the hat trick in the match against Real Madrid at Barcelona” into “Messi’s match against Real Madrid at Barcelona’s home game. Score 3 goals" and so on.
  • the system will classify the text information to obtain a large category of text information (such as financial class).
  • the system can access the system database 205 (see FIG. 2 for details), and pass the storage attributes appearing in the text and the attribute features such as the number of category keywords or the preset weights in the database 205 through certain conditions.
  • the calculation method is calculated and classified by calculation.
  • the category keywords may be extracted by a specific method, and the extraction methods may include, but are not limited to, statistically based chi-square statistics, synonym rules, Boolean association rules, position rules, information gain, mutual information, odds ratio, cross entropy A combination of one or more of the methods such as poor information between classes.
  • the system may employ a machine learning based text classification method including, but not limited to, decision trees, Rocchio, Na ⁇ ve Bayes, neural networks, support vector machines, linear least squares fitting , nearest neighbor algorithm kNN, genetic algorithm, maximum entropy, etc., or one or more of them. That is, by classifying the training of the text labeled with the category label, the classifier is obtained, thereby performing emotional classification on the new text object.
  • Text information can be classified into one or more categories.
  • a category can be a predefined category of the system and can contain several levels of subclasses. For example, in the financial field, text information can be divided into but not limited to announcements, news, research, blogs, forums, microblogs, and interactive investment.
  • Announcement classes can include, but are not limited to, subclasses such as contract, annual, and emotional.
  • the announcement class may include, but is not limited to, periodic reports and equity distribution announcements, transaction announcements, fundraising announcements, major events, policy preferential announcements, executive change announcements, acquisition repurchase announcements, and the like.
  • the system can divide news into news sources. Reliable sources and unreliable sources, such as the official information source CCTV financial channel can be considered as a reliable source of information, news types can include but not limited to financial, political, science and education, political and legal, social, sports, military, entertainment One or more combinations of classes.
  • the above description of the classification is merely for the purpose of facilitating the understanding of the invention and should not be construed as the only embodiment of the invention.
  • the classification step is not essential in the present invention. For some text information, the system can directly determine its category, and thus the classification step can be skipped. For example, if the title of a certain information is displayed as "A Company's Announcement on Signing a Major Operation Contract for Daily Operations," the system can directly determine that it is an announcement type.
  • the attributes in the text information can be extracted. Attributes are descriptions of the nature and relationships of entities. For example, Figure 19 shows the cover and excerpts of the announcement of CITIC Securities Co., Ltd. Annual Report 2014. For this announcement, the extracted entity can be “CITIC Securities”, which can be extracted from the table in Figure 19. Attributes include operating income, net profit increase and decrease over the same period, total assets, total liabilities, total shareholders' equity, etc., or one or more of them.
  • the system can combine the entity and the attribute according to a certain rule method to generate a refinement event.
  • Step 303 the system may generate a natural language statement according to the obtained refinement event (step 303), such as “the net profit growth of CITIC Securities annual report is greater than 100%”, or may be for individual stock events, industry events and full market events. Natural language statements, such as "CITIC Securities annual report net profit growth rate greater than 100%, brokerage industry annual report net profit growth rate greater than 100%, annual report net profit growth rate greater than 100%.”
  • Step 303 can be completed by natural language processing module 203.
  • the generated natural language statement can be input into an analysis system (eg, flyback module 204) to analyze the identified event (step 304).
  • step 304 may be performed by the backtest module 204.
  • the above analysis may include, but is not limited to, backtesting the event. Backtesting refers to combining events and related historical events and data according to certain rules to generate a backtest report for users to refer to when investing.
  • the system can also convert the collected information directly into a natural language statement (step 303) and then analyze the natural language statement (step 304). Alternatively, the system may also directly analyze the collected information (step 304).
  • FIG. 4 is a schematic structural view of the collection module 201.
  • the collection module 201 can include, but is not limited to, one acquisition unit 401, one processing unit 402, and one storage unit 403.
  • the collection unit 401 can collect the required information from the information source 103 (see FIG. 2), or other modules in the system (eg, the processing module 202, the natural language processing module 203, the backtest module 204, the system database 205).
  • the above required information may include, but is not limited to, one or more of various news, announcements, comments, research reports, blogs, messages, reports, notices, essays, journals, and the like.
  • the information required above may be information about various industries including, but not limited to, one or more of sports, entertainment, economics, politics, military, culture, art, science, engineering, and the like.
  • the form of the above-mentioned required information may include, but is not limited to, one or more of text, pictures, audio, video, etc.
  • the news may be a video news broadcast by a video website.
  • the World Bank lowered the global economic growth forecast to 2.8% this year.
  • the collection unit 401 may directly receive information input by the user, and the information may include, but is not limited to, a natural language sentence, a program language, and the like.
  • Processing unit 402 can process the collected information.
  • the processing may include, but is not limited to, storing the collected information in the storage unit 403, storing the collected information in the system database 205, retrieving information from the storage unit 403, and transmitting the information to other modules (eg, the processing module 202, The natural language processing module 203, the backtest module 204, the system database 205) retrieves information from the system database 205 and transmits the information to other modules (eg, the processing module 202, the natural language processing module 203, and the backtest module 204).
  • the processing unit 402 can also directly send the collected information to other modules, such as the processing module 202, the natural language processing module 203, the backtest module 204, and the system database. 205.
  • the storage unit 403 can store the information collected by the collection module 201.
  • the storage unit 403 can store information processed by the processing unit 402.
  • FIG. 5 is a schematic diagram showing the structure of the processing module 202.
  • the processing module 202 can include, but is not limited to, a format conversion module 501, a text preprocessing module 502, a text classification module 503, an attribute extraction module 504, and an event recognition module 505. Each of the above modules may be independent, or some modules may be combined into one module.
  • the format conversion module 501 can perform format conversion on the information collected by the collection module 201. Format conversion can be done automatically by the system or manually. Format conversion can be done in real time or at regular intervals.
  • File formats for information that can be converted include, but are not limited to, pdf, doc, docx, epub, mobi, caj, kdh, nh, bmp, jpg, tiff, gif, pcx, tga, exif, fpx, svg, psd, cdr, pcd
  • dxf, ufo, eps, ai raw, mpeg, avi, mov, asf, wmv, navi, 3gp, RA, RAM, mkv, flv, rmvb, WebM, and the like.
  • the collected information is in the picture jpg format. If the picture contains text information, the picture can be converted into a text format, such as a txt format, by OCR (Optical Character Recognition) recognition.
  • OCR Optical Character Recognition
  • the text pre-processing module 502 can pre-process the format-converted text, and the pre-processing can include, but is not limited to, one or more of text segmentation, entity recognition, normalization processing, and the like.
  • the text classification module 503 can classify the pre-processed text, and the pre-processed text can be divided into, but not limited to, an announcement type, a news category, a research report class, a blog class, a forum class, a microblog class, and an interactive investment class.
  • Announcement classes may include, but are not limited to, contract classes, annual report classes, and the like.
  • the announcement category may include, but is not limited to, a reorganization announcement, an equity incentive announcement, a major contract announcement, a policy offer announcement, an executive change announcement, an acquisition repurchase announcement, and the like.
  • the system can divide news into reliable sources and unreliable based on news sources.
  • Source such as the official information source CCTV financial channel can be considered as a reliable source of information
  • news types can include but not limited to financial, political, science and education, political and legal, social, sports, military, entertainment, etc. or A variety of combinations.
  • the attribute extraction module 504 can automatically match the attributes related to the extraction event, and the extraction rules can be configured by the system or manually.
  • the event identification module 505 can derive the final refinement event based on the results of the attribute extraction module 504, the results of the text pre-processing module 502, the system database 205, and certain rules.
  • FIG. 6 is a schematic structural diagram of the format conversion module 501.
  • the format conversion module 501 can include, but is not limited to, a control unit 601, a text processing unit 602, a picture processing unit 603, an audio processing unit 604, and a video processing unit 605.
  • the control unit 601 can select a corresponding processing unit according to the information collected by the collection module 201; the text processing unit 602 can process the information in the text format collected by the collection module 201.
  • the picture processing unit 603 can process the information of the picture format collected by the collection module 201.
  • the audio processing unit 604 can process the information of the audio format collected by the collection module 201.
  • the video processing unit 605 can process the information in the video format collected by the collection module 201.
  • the above units in the embodiments of the present specification may be independently distributed, but in some embodiments, the above partial units may be combined into one unit, for example, the audio processing unit 604 may be combined with the video processing unit 605 to form an audio and video processing. Unit to achieve the functions of both.
  • the control unit 601 can perform type determination on the information collected by the collection module 201, and select a corresponding processing unit according to the type. For example, after the control unit 601 determines the information collected by the collection module 201 and determines that it is the text format information, the selection text processing unit 602 performs the processing of the next step.
  • the text processing unit 602 can process the information in the text format collected by the collection module 201 and convert the text data into a unified format.
  • the text format in the information collected by the collection module 201 may include, but is not limited to, Hypertext Markup Language (html), Extensible Hypertext Markup Language (xhtml), and expandable. Extensible Markup Language (xml), pdf format (Portable Document)
  • the text processing unit 602 can convert the above format into a unified text format, and the unified text format can include, but is not limited to, the txt format, in one or more of Format, doc, and docx formats (a proprietary format of Microsoft Corporation).
  • the information collected by the collection module 201 is the “2014 Annual Report of CITIC Securities Co., Ltd.”, as shown in FIG. 19, the format of the announcement is in pdf format, and the text processing unit 602 can convert the announcement from pdf format to txt format. format.
  • the picture processing unit 603 can process the information of the picture format collected by the collection module 201 and convert it into a unified text format.
  • the information in the picture format collected by the collection module 201 may be a book, a newspaper, a magazine, a letter, or the like.
  • the image processing unit 603 may use an OCR (Optical Character Recognition) technology to take a picture.
  • OCR Optical Character Recognition
  • the information is converted to a uniform text format.
  • the audio processing unit 604 can process the data in the audio format in the information collected by the collection module 201 and convert it into a unified text format.
  • the audio format in the information collected by the collection module 201 may include, but is not limited to, one or more of CD, WAVE, AIFF, AU, MPEG, MP3, MIDI, WMA, RealAudio, VQF, OggVorbis, AAC, APE, and the like.
  • the audio processing unit 604 can convert it to a text format using speech recognition technology. Speech recognition techniques may include, but are not limited to, methods based on vocal tract model and speech knowledge, methods of pattern matching, and one or more of methods using artificial neural networks, or any combination of the above.
  • the video processing unit 605 can process the data in the video format in the information collected by the collection module 201 and convert it into a unified text format.
  • the video format in the information collected by the collection module 201 may include, but is not limited to, one or more combinations of Flash Video, AVI, WMV, MPEG, Mastroska, Real Video, QuickTime File Format, Ogg, MOD, and the like.
  • the video processing unit 605 can perform text export on the subtitle portion in the video, the subtitle includes the video built-in subtitle and the external subtitle, and is converted into a unified text format.
  • the video processing unit 605 can also extract the audio portion of the video and perform speech recognition to convert it into a uniform text format.
  • the video processing unit 605 can extract the audio portion of the video for speech recognition and convert it into a unified text format if the video is loaded with words. Cursor, the subtitles are exported and converted into a unified text format, or you can choose to extract the audio part for speech recognition and convert to a unified text format.
  • the four processing units included in the format conversion module 501 may not be all included in some embodiments, and may include only one of the units. It can also be some of these units. In some embodiments, the above four processing units are all included, and the order of execution between the processing units may be sequentially performed, may be performed simultaneously, or may be any suitable order.
  • the format conversion module 501 converts the information collected by the collection module 201 into a unified text format, and the text preprocessing module 502 performs subsequent processing on the information in the text format.
  • FIG. 7 is a schematic structural diagram of a text preprocessing module 502.
  • the text pre-processing module 502 can include, but is not limited to, a language recognition unit 701, a text segmentation unit 702, an entity identification unit 703, and a normalization unit 704.
  • the language recognition unit 701 can perform language recognition on the text information processed by the format conversion module 501.
  • the text segmentation unit 702 can perform word segmentation on the text.
  • the entity identification unit 703 can identify an entity in the text.
  • the normalization unit 704 can perform unified normalization processing on the content containing the digital information in the text and its corresponding units to form a standard digital data format.
  • the above units may be independent or may be combined into one unit.
  • the language recognition unit 701 can be combined with the text segmentation unit 702 into one unit.
  • the language recognition unit 701 can perform language recognition on the text processed by the format conversion module 501.
  • the language used by the collection module 201 may include, but is not limited to, one or more of Chinese, English, French, Russian, Spanish, Arabic, Japanese, German, etc., and the language recognition unit 701 may identify the collection module 201 to collect Information office The language used.
  • the text segmentation unit 702 can perform word segmentation processing on the text recognized by the language recognition unit 701 by using a certain word segmentation algorithm.
  • the language used by the collection module 201 includes language in words, such as English, French, Russian, etc., where there is a natural separation between words and words; also includes words in words, words It is composed of words, and there is no natural separation between words and words, such as Chinese. Therefore, before the word frequency statistics of Chinese texts are performed, the Chinese text must first be processed by word segmentation, while the English text is not required.
  • the word segmentation algorithm may include, but is not limited to, a word segmentation based word segmentation method (ie, mechanical segmentation method), an understanding based word segmentation method, a statistical based word segmentation method, or the like, or any combination of the above several word segmentation methods.
  • a word segmentation based word segmentation method ie, mechanical segmentation method
  • an understanding based word segmentation method ie, a statistical based word segmentation method, or the like, or any combination of the above several word segmentation methods.
  • One embodiment of the present invention is a method combining a statistical-based word segmentation method and a dictionary-based word segmentation method.
  • the text segmentation unit 702 can send a request to the system database 205 to access the dictionary database, and when the system database 205 receives the request, the requested dictionary can be sent to the text segmentation unit 702.
  • the dictionary may be a dictionary for a specific domain, and may be, for example, a dictionary for an announcement or a dictionary for news.
  • the dictionary may be a dictionary for reorganization announcements, a dictionary for incentive announcements, a dictionary for major contract announcements, a dictionary for policy offer announcements, a dictionary for executive change announcements, a dictionary for purchase repurchase announcements, and the like.
  • the text segmentation unit 702 can obtain the final text segmentation result by combining the statistically obtained word segmentation result with the word segmentation result obtained by the dictionary matching.
  • the entity identification unit 703 can perform entity identification on the word-processed text by the entity identification method, and can store the identified entity set in the entity database in the system database 205.
  • the entity may include, but is not limited to, one or more of a product, an organization name, a person's name, a place name, a time, a date, a currency, a number, a percentage, etc., for example, CITIC Securities Co., Ltd. 2014 Annual Report
  • the entities that can be identified in the title information are “CITIC Securities”, “Company”, “2014” and “Annual Report”.
  • the entity identification method may include, but is not limited to, a hidden Markov model, a maximum entropy model, a support vector machine, a Boolean association rule, a synonym-based configuration rule, a location rule-based recognition method, and a statistical-based recognition method, or the like. Any combination of identification methods.
  • the normalization unit 704 can normalize the numbers in the text and their units Rational, so that it has a consistent unit. For example, the normalization unit 704 can convert the "five percent probability of rising” appearing in the text to "the probability of rising 5%”, and convert the "total investment of 30,000 yuan” into “the total investment of the project is 30,000 yuan. "Wait.
  • the four processing unit execution sequences included in the text preprocessing module 502 may be, in order, a language recognition unit 701, a text word segmentation unit 702, an entity recognition unit 703, and a normalization unit 704.
  • the execution order of the processing units in the text preprocessing module 502 may also be first performed by the language recognition unit 701, and it is determined whether the text segmentation unit 702 is executed according to the recognition result of the language recognition unit 701.
  • the recognition result is Chinese text
  • the text segmentation unit 702 is executed.
  • the recognition result is another language having a fixed separator, such as English, Korean, Russian, etc.
  • the text segmentation unit 702 may not execute.
  • the order of execution of the subsequent entity identification unit 703 and the normalization unit 704 may be sequential, may be reversed, or may be performed simultaneously.
  • the text pre-processing module 502 can pre-process the text information processed by the format conversion module 501, and the text classification module 503 can perform subsequent processing on the pre-processed text.
  • FIG. 8 is a schematic diagram showing the structure of the text classification module 503.
  • the text classification module 503 can include, but is not limited to, one or more keyword extraction units 801, one or more classification units 802.
  • the keyword extracting unit 801 can perform keyword extraction on the text processed by the text preprocessing module 502.
  • Classification unit 802 can classify the extracted keywords according to predefined rules.
  • the above units may be independent or may be combined into one unit.
  • the keyword extracting unit 801 can analyze the text processed by the text preprocessing module 502 and extract keywords.
  • the extraction method of the keyword may include, but is not limited to, a method based on statistical chi-square statistics, synonym rules, Boolean association rules, position rules, information gain, mutual information, probability ratio, cross entropy, inter-class information difference, or the like. Any combination. Specifically, for the "2014 Annual Report of CITIC Securities Co., Ltd. In the announcement, the keyword extracting unit 801 first performs keyword extraction, and the extracted keywords may include, but are not limited to, "CITIC Securities", "2014", “annual report”, “net profit”, “synchronous increase and decrease” "and many more.
  • the classification unit 802 can classify the text and paste the category label according to a certain classification method by using the keyword proposed by the keyword extraction unit 801.
  • Classification methods may include, but are not limited to, decision trees, Rocchio, Na ⁇ ve Bayes, neural networks, hidden Markov models, support vector machines, linear least squares fits, nearest neighbor algorithm kNN, genetic algorithms, maximum entropy, etc., or Any combination of the above methods.
  • Classification unit 802 can send a keyword database access request to system database 205. After receiving the request, the system database 205 sends the requested keyword to the classification unit 802.
  • the classifying unit 802 can match the keywords extracted by the keyword extracting unit 801 and the keywords sent by the system database 205 according to a certain algorithm, classify the text according to the matching result, and paste the text with the corresponding category tag. Specifically, in response to the above-mentioned full-text announcement entitled "CITIC Securities Co., Ltd. 2014 Annual Report", the classification unit classifies it as a bulletin category, an annual report subcategory, and labels it. A text can be attributed to different categories according to the matching result. In this case, just paste two labels on the text, and one text can have more than two labels at the same time.
  • Text classification module 503 is optional in some embodiments. For example, if the information elements in the information collected by the collection module 201 are already determined, the text classification step can be skipped. Specifically, if the information collected is a newsletter, the content of the newsletter is “The 18th Shanghai International Film Festival was closed on the evening of June 21, 2015. The main competition unit Golden Jubilee Award was announced. The Chinese film “The Sunshine "Heart” became the biggest winner, Deng Chao, Guo Tao, Duan Yihong and the three won the same film, Cao Baoping won the best director. The elements of the newsletter, such as time, characters, events, etc., are clear and clear, and the text sorting module 503 can be skipped directly to the next module attribute extraction module 504 for subsequent processing.
  • the execution order of the keyword extraction unit 801 and the classification unit 802 included in the text classification module 503 may be sequential, that is, the keyword extraction unit 801 is executed first, and the classification unit 802 is executed later.
  • FIG. 9 is a schematic structural diagram of the attribute extraction module 504.
  • the attribute extraction module 504 can include, but is not limited to, one or more keyword extraction units 901, one or more attribute extraction templates 902, and one or more attribute extraction units 903.
  • the keyword extraction unit 901 can perform keyword extraction on the text;
  • the attribute extraction template 902 can store an extraction rule capable of extracting event attributes;
  • the attribute extraction unit 903 can complete the extraction of the event attributes.
  • the above units may be independent or may be combined into one unit.
  • the keyword extracting unit 901 may analyze and extract keywords processed by the text preprocessing module 502, and the extraction methods may include, but are not limited to, statistically based chi-square statistics, synonym rules, Boolean association rules, position rules, and information gains. , mutual information, odds ratio, cross entropy, inter-class information difference, etc., or any combination of the above methods.
  • the keyword extraction unit 901 is optional in some embodiments. Since the text classification module 503 is optional, when the text classification module 503 needs to be directly processed by the attribute extraction module 504, the keyword extraction unit 901 can be executed to perform keyword extraction on the preprocessed text. . If the text classification module 503 has been executed, the keyword extraction unit 901 may be skipped.
  • the attribute extraction template 902 can store extraction rules that are capable of extracting event attributes. Events are made up of attributes of entities and entities.
  • the entity identification unit 703 in the text pre-processing module 502 has performed entity recognition on the text and formed a set of entities, and the attribute extraction template 902 stores different attribute extraction rules for different entities.
  • the extraction rules are pre-configured.
  • the configuration method may be manual configuration, and different entity attribute extraction rules are set for each type of text according to the type of text set in advance.
  • the configuration method can also be a machine learning method. For example, a batch of training text can be selected first. The training text is a clearly typed text that is manually labeled. Through the training of the training text, the attribute extractor (not shown) is obtained.
  • the attribute extractor extracts the required attributes according to different text categories, and then uses the attribute extractor to perform attribute extraction on the new text.
  • the attribute extraction template 902 is optional in some embodiments.
  • the attribute extraction unit 903 can complete the work of extracting attributes from the text. After the text classification module 503 is executed, the attribute extraction unit 903 can select a corresponding attribute extraction template according to the result of the text classification and the label attached to the text. When the text has two or more labels attached, the corresponding number of attribute extraction templates can be selected at the same time, and then the attributes are extracted from the text, and the obtained results are clustered. Specifically, for the information of "CITIC Securities Co., Ltd. 2014 Annual Report", the text classification module 503 assigns it a sub-category of the annual category of the announcement category. According to the label, the attribute extraction unit 903 selects a corresponding attribute extraction template, and extracts an event attribute according to the selected template.
  • Event attributes may include, but are not limited to, information on the amount of operating income and the year-on-year increase, the amount of net profit, the amount of total assets, and the magnitude of the increase.
  • the attribute information of “net profit attributable to shareholders of the parent company/current period growth (%)/116.20% over the same period of the previous year” can be extracted.
  • the event recognition module 505 can complete the identification of the event.
  • the entity identification unit 703 in the text preprocessing module 502 has performed entity recognition on the text and formed a set of entities.
  • the attribute extraction module 504 extracts the required set of attributes from the text.
  • the event recognition module 505 can identify the event according to a certain event recognition template according to the entity recognition result and the attribute extraction result, and generate a refinement event. Specifically, for the CITIC Securities Co., Ltd.
  • an attribute attribute extracted by attribute extraction module 304 is: "net profit attributable to shareholders of the parent company / current period increase or decrease (%) / 116.20% over the same period of the previous year", event
  • the identification module 505 can obtain the final refinement event according to the entity identification result and the attribute extraction result: "The annual profit growth rate of the net profit attributable to the parent company shareholders of the CITIC Securities 2014 annual report is equal to 116.2%.”
  • the event recognition module 505 cannot separately identify the event based on the entity information and the attribute information. .
  • the final refinement event category can be calculated based on the data of the database (such as historical data) and certain rule methods.
  • Figure 10 is a flow chart of the processing module.
  • the system performs format conversion on the information transmitted by the collection module 201 and converts it into a unified text format (step 1001). Format conversion may include, but is not limited to, format conversion of one or more of text, pictures, audio, video, and the like.
  • Step 1001 can be implemented by format conversion module 501.
  • the system preprocesses the information in text format (step 1002). Pre-processing may include, but is not limited to, one or more of speech recognition, text segmentation, entity recognition, normalization, and the like.
  • Step 1002 can be implemented by text pre-processing module 502.
  • the system classifies the text (step 1003).
  • the classification step may include, but is not limited to, performing keyword extraction and classification.
  • Step 1003 can be implemented by text classification module 503.
  • Step 1004 The system performs attribute extraction on the text (step 1004).
  • Step 1004 can be implemented by attribute extraction module 504.
  • the system identifies the event (step 1005).
  • Step 1005 can be implemented by event identification module 505.
  • the system may directly perform step 1002 without going through step 1001, and directly perform step 1004 without going through step 1003.
  • FIG. 11 is a schematic structural diagram of the natural language processing module 203.
  • the natural language processing module 203 can include, but is not limited to, one collection unit 1101, and one natural language generation unit 1102.
  • the collecting unit 1101 can access other modules in the system.
  • the block collects the required information (eg, collection module 201, processing module 202, backtest module 204, system database 205).
  • the natural language generating unit 1102 can convert the information collected by the collecting unit 1101 into a natural language sentence.
  • the collection unit 1101 may receive the refinement event output by the processing module 202.
  • the collecting unit 1101 can also receive user input information from the collecting module 201.
  • the natural language generating unit 1102 can receive the refinement event and process the refinement event according to the user input information.
  • the user may choose to generate natural language statements for individual stocks, or select natural language statements for the industry, and may also select natural language statements for the entire market. Or one or more of them.
  • users can choose to generate natural language statements for the market.
  • the above natural language statement can also be automatically generated without the user intervening.
  • the system can automatically generate natural language statements for individual stocks, natural language statements for the industry, natural language statements for the entire market, or a combination of one or more of them.
  • the system can automatically generate natural language statements for the market. Further, in the field of news, the natural language generating unit 1102 can generate natural language sentences for commodity prices, natural language sentences for weather conditions, and natural language statements for demographics. The relevant values in the above natural language statements can be defaulted (in the absence of a value, only the change event will be declared).
  • the natural language statement generated by the natural language generating unit 1102 can be input to the backtesting module 204 to backtest the natural language sentence.
  • the user inputs a natural language statement to be backtested to the collection module 201, and the natural language processing module 203 receives the natural language statement input by the user from the collection module 201.
  • the user can directly input the natural language statement to be returned to the query box of the natural language processing module 203 (not shown).
  • the natural language processing module 203 can preprocess the natural language sentence input by the user to obtain a standard node sequence (the node includes at least an indicator node and a condition node), and construct a node tree according to the relationship between the indicator node and other nodes.
  • the node tree can be used to characterize the combination of indicator conditions.
  • Data query instructions can be generated based on the node tree.
  • the data query command can be input to the backtest module 204 for backtest analysis.
  • the user can call the stored by the natural language processing module 203 Historical data in the system database 205, the user can use Boolean operators (AND, OR, NOT, etc.) to group a certain number of natural language statements together.
  • the natural language processing module 203 can also receive the backtest result output by the backtest module 204.
  • FIG 12 shows the structure of the backtest module.
  • the backtest module 204 can include a standard question unit 1201, a other question unit 1202, an optimization unit 1203, and an extension unit 1204.
  • the standard question unit 1201, the other question unit 1202, the optimization unit 1203, and the extension unit 1204 may be independent. Some of the above units may also be combined into one unit to work.
  • the standard question unit 1201 can receive natural language statement events of the system standard.
  • the standard question unit 1201 may receive the natural language statement generated by the natural language processing module 203, or may receive natural language statements derived from other modules.
  • Other modules include, but are not limited to, one or more combinations of collection module 201, processing module 202, natural language processing module 203, and system database 205.
  • Other question unit 1202 can receive non-systematic natural language statements.
  • Non-systematic natural language statements may include, but are not limited to, one or more combinations of user input, expert definition, system extracted results, and the like.
  • the standard question unit 1201 and the other question unit 1202 can be combined into one question unit, which can receive natural language sentence events input by the system and the user, and the like.
  • the optimization unit 1203 can optimize the policy combination according to the received information and the backtesting algorithm.
  • the optimization method can be automatic or manual.
  • the basic backtesting data obtained by the above backtesting algorithm includes but is not limited to the holding period, the single return average, the single return maximum, the single return minimum, the expected annualized return, and the transaction.
  • the optimization unit 1203 can further provide an optimal strategy according to the backtest result, and also has a report rating. Wait.
  • the extension unit 1204 can be configured to provide a subscription function, and can also be configured to provide an information sharing function.
  • the subscription function may be that the extension unit 1204 subscribes to the information including the specific keyword according to the user's selection, and pushes the information content analyzed by the system to the user in various manners.
  • the sharing function can be that the user shares the information of interest to friends in various ways.
  • Step 1301 receives the information.
  • the received source of information may include, but is not limited to, a collection module 201, a processing module 202, a natural language processing module 203, and a system database 205.
  • the information received can be either a natural language statement or a machine statement.
  • Step 1301 may be that the backtest module 204 receives the natural language statement.
  • Natural language statement information can be directly input by the user or generated by other modules.
  • Step 1301 can be accomplished by standard question element unit 1201 and/or other question element unit 1203.
  • step 1302 the received natural language statement and the historical data are back analyzed, and the historical data may be stored in the system database 205 or may be stored in the backtest module 204.
  • Backtesting analysis of natural language statements and historical data can be achieved by certain optimization methods. Optimization methods may include, but are not limited to, one or more combinations of system definitions, user-defined selections, machine learning, and the like.
  • step 1303 the information result of the optimized analysis will match the corresponding text template.
  • Text templates can be system-defined or user-defined.
  • the result content matching the template may include, but is not limited to, one or more combinations of backtest reports, report ratings, optimal strategies, trend predictions, and the like.
  • the backtested historical information may be matched with a text template, or may be matched with a template such as a voice, a video, or a picture.
  • the backtesting process may be completed by the backtesting module 204, or may be completed by the natural language processing module 203, the processing module 202, the collecting module 201, and the like.
  • the flyback module 204 can further include an expansion unit 1204.
  • the extension unit 1204 can be configured to provide a subscription function, and can also be configured to provide an information sharing function.
  • the extension unit may include, but is not limited to, various types of application program interfaces (APIs) such as an object-oriented API, a library and framework API, an API and protocol, an API and device interface, a web API, or one or more of them.
  • APIs application program interfaces
  • the subscription function may be that the extension unit 1204 subscribes to the information including the specific keyword according to the user's selection, and pushes the information content analyzed by the system to the user in various manners.
  • the sharing function can be that the user shares the information of interest to the friend in various ways.
  • the subscription function of the extension unit 1204 may include, but is not limited to, providing push information to the user, and may also recommend users who are interested in similar interests, may also recommend comments of the information, and provide ratings for whether or not the information is helpful.
  • the manner in which the extension unit 1204 pushes may include, but is not limited to, mobile client software, email, SMS, RSS portal, online single-user aggregator, search engine, browser, instant messaging software, social network, and the like.
  • the push period of the extension unit 1204 may be set by the system or may be user-defined.
  • the push cycle can be periodic or irregular. Push can be real-time or delayed.
  • the information content form pushed by the extension unit 1204 may include, but is not limited to, one or more of text, voice, picture, animation, video, and the like.
  • the information content pushed by the extension unit 1204 may include, but is not limited to, information content update that the user has browsed, may be information that the user pays attention to, or may be information recommended by the system according to the user record, or may be a hot situation of the same type of information. One or more.
  • the sharing function of the extension unit 1204 may be a method of publishing information used by the user, sharing to a designated place, selecting who can view the information, and the like.
  • the content of the information sharing may be a single piece of information or a plurality of pieces of information, and may be information of a part of the selected content or information of the entire content of the page, and may be information content sharing or information comment sharing, and may be information sharing. It can also be informational help rating sharing, etc.
  • the manner of information sharing may include, but is not limited to, one or more of SMS, MMS, email, QQ, MSN, WeChat, Weibo, Douban, Twitter, Facebook, Instagram, everyone, instant messaging software tools, and the like.
  • the information sharing receiving object may include, but is not limited to, one or more of a single friend, a plurality of friends, a circle of friends, a public social circle, a forum, other users, and the like.
  • the content format of information sharing may include, but is not limited to, text, pictures, voice, One or more of animations, videos, web links, and more.
  • extension unit 1204 the specific manner and steps of implementing the extension unit and the functions that the extension unit can implement without departing from this principle may be Various modifications and changes in form and detail are made, but such modifications and changes are still within the scope of the above description.
  • Figure 14 is a block diagram of the system database 205.
  • the system database 205 can include, but is not limited to, an original information database, a text database, a text pre-processing database, an entity database, an event attribute database, a keyword database, a text classification database, a historical information database, a natural language. Processing one or more combinations of a database, an event identification database, a backtest module database, a text template database, a dictionary database, and the like.
  • the system database 205 can store data and templates as well as data. For example, historical information collected by the historical information database 1409 can be classified and stored in the database.
  • the information is also updated in real time in each database, such as the keyword database 1406, and the synonym update is implemented in the database.
  • various types of databases in the system database 205 may be implemented in the collection module 201, the processing module 202, the natural language processing module 203, and the backtest module 204, respectively.
  • the various types of databases in the system database 205 can also be a library function for realizing two or more types of data in the database.
  • the text pre-processing database 1403 can simultaneously store pre-processed data, entity data, event attributes and keywords, etc.
  • the functions of the entity database 1404, the event attribute database 1405, and the keyword database 1406 are implemented.
  • Figure 15 shows the flow chart of information analysis.
  • the information analysis system collects the required information in step 1501.
  • Step 1501 can be completed by the collection module 201.
  • the required information may include, but is not limited to, one or more of various news, announcements, reviews, research reports, blogs, messages, reports, notices, essays, journals, and the like.
  • the information required above may be information about various industries including, but not limited to, one or more of sports, entertainment, economics, politics, military, culture, art, science, engineering, and the like.
  • the form of the above-mentioned required information may include, but is not limited to, one or more of text, picture, audio, video, and the like.
  • the information collected by the system at step 1501 may be textual information.
  • the text information includes, but is not limited to, the following formats: pdf, doc, epub, mobi, caj, etc., or one or more of them.
  • the system may preprocess the textual information at step 1502.
  • Step 1502 can be completed by processing module 202.
  • Text pre-processing may include, but is not limited to, one or more combinations of format conversion, word segmentation processing, entity recognition, number and unit normalization processing, and the like.
  • the information collected by the system in step 1501 is the “2014 Annual Report of CITIC Securities Co., Ltd.” announcement, which is a PDF file and can be downloaded from the website of the Shanghai Stock Exchange.
  • the system converts the announcement into text in txt format through format conversion to facilitate word segmentation and subsequent text processing.
  • the table part inside the PDF format is analyzed and processed, and part of the formatting information and context information are retained.
  • the system will process the word segmentation according to a certain method.
  • the word segmentation process can be performed in accordance with a statistical model and a dictionary database 1407.
  • word segmentation can also be implemented by applying certain rules. Rules may include, but are not limited to, synonym configuration, Boolean association rules, location rules, or one or more of them.
  • the system will identify the entity. Entity identification includes, but is not limited to, product, institution name, person name, place name, time, date, currency, number, percentage, and the like.
  • the system can summarize the elements in the past information and define various event categories. For example: diplomacy, finance, sports, politics, science, education, etc., or one or more of them.
  • the above categories may also include sub-categories of several levels.
  • the financial category includes treasury bonds, stocks, funds, and the like.
  • the system will normalize the number and unit of the announcement. For example, convert “3% of net profit growth” to “3% of net profit growth”.
  • the system will perform text processing on the bulletin (step 1503).
  • Text processing can be done by the processing module 202.
  • the system is preprocessed Announcement for keyword matching. Keyword matching can be combined with synonym configuration, Boolean association rules, location rules, and the like.
  • the system can make a category determination for the announcement (step 1504). For example, in the “2014 Annual Report of CITIC Securities Co., Ltd.” shown in Figure 19, the keywords extracted may be “CITIC Securities”, “2014”, “Annual Report”, “Net Profit”, “Sequential Increase and Decrease”, etc. The announcement can be judged as a broad category of financial reports.
  • Step 1504 can be completed by processing module 202.
  • the system can generate a refinement event in accordance with a rule (step 1505).
  • the attributes in the textual information may be extracted after the category determination is completed.
  • An attribute is a description of the nature or relationship of an entity. For example, for the “2014 Annual Report of CITIC Securities Co., Ltd.” shown in Figure 19, the extracted entity can be “CITIC Securities”, and the attributes that can be extracted from the table in Figure 19 can be operating income, net profit. The increase or decrease during the same period, total assets, total liabilities, total shareholders' equity, etc., or one or more of them.
  • the system can combine the entity and the attribute according to a certain rule method to generate a refinement event.
  • Step 1505 can be completed by processing module 202.
  • the rules for generating refinement events may include, but are not limited to, synonym configuration, Boolean association rules, location rules, or one or more of them.
  • CITIC Securities extracted from the above announcement, it can combine the “net profit increase and decrease” of one of its attributes to generate a detailed event of “CITIC Securities' 2014 annual net profit growth of 116.2%”.
  • Step 1506 can be performed by natural language processing module 203.
  • the system can generate 3 natural language sentences, “the net profit of CITIC Securities annual report increased by more than 100% year-on-year”, “the annual growth rate of net profit of brokerage industry annual report is greater than 100%”, “annual report net profit growth rate is greater than 100% ", that is, corresponding to three levels of individual stocks, industry and the entire stock market.
  • the system can perform individual stock event backtesting, industry event backtesting, and full market backtesting for the three natural language sentences generated above (step 1507).
  • Step 1507 can be completed by the backtest module 204.
  • Step 1508 can be performed by The flyback module 204 is completed.
  • an example back test report generated by the system may be “nearly one year, all financial report announcements of A shares, the average return of the next day's closing is 0.48%, and the rising probability is 47.77%.
  • the securities industry has announced 165 similar announcements.
  • the average return on the next day was 0.72%, the probability of increase was 46.67%, the probability of the stock price falling next day was too large, and the probability of profit was low.
  • the probability of rise is distributed at around 50%, so the judgment is “insignificant”.
  • step 1504 the system can directly collect information entered by the user and convert the information into a natural language statement (step 1506). Similarly, after the system completes the collection of information, the system can directly perform text processing (step 1503). Similarly, after the text processing is completed, the system can generate a refinement event directly according to the processed text, and step 1504 is not necessary.
  • Figure 16 shows the flow chart of the interaction of the information analysis system.
  • the information analysis system receives the natural language statement at step 1601.
  • the natural language statement can be directly input by the user, or can be a natural language statement obtained by processing texts such as announcements and news.
  • the user can input the natural language statement through the interactive interface provided by the information analysis system (see FIG. 18).
  • Users can enter any natural language statement. For example, in the financial field, the user can enter "000826, sign a major contract"; the system can retrieve the major contract signed by the company corresponding to code 000826 based on the input natural language statement.
  • the numbers and dates of natural language statements can be in any format (see Figure 18).
  • the information analysis system processes the natural language statement received in step 1601 in step 1602. Processing may include, but is not limited to, word segmentation processing, entity recognition, number and unit normalization processing, text classification, event attribute extraction, refinement event recognition, and the like.
  • the information analysis system backtests the processed natural language statements in step 1602 at step 1603, and then generates a backtest report at step 1604. The details of the backtest report will be described in detail in FIG.
  • the interactive interface can be generated by the backtest module 204 and used to display backtest results (backtested impressions).
  • the interactive interface can be displayed on a variety of electronic devices.
  • the electronic device may include, but is not limited to, a mobile phone, a personal computer, a tablet, a PDA, a smart watch, a smart home appliance, a smart vehicle, etc., or one or more of them.
  • the user can enter a Uniform Resource Locator (URL) for any announcement or news to read the announcement or news and its analysis results.
  • the user can also enter an IP (Internet Protocol) address for any announcement or news.
  • URL Uniform Resource Locator
  • IP Internet Protocol
  • the user can enter the full name of the announcement or news. Alternatively, the user can also enter keywords in the announcement or news to select the specified news or announcement in the results list. After the specified announcement or news is selected, the title, body content, and backtest report of the announcement or news may be displayed on the interactive interface.
  • the area 1702 can display all or part of the body content of the announcement or news, and the user can view the body content through a mouse, a button, a touch screen, a voice control, or a touch pad.
  • Area 1703 can display a backtest report for the selected announcement or news.
  • the backtest report may include, but is not limited to, display of historical data, for example,
  • rating and suggestion strategies for announcements or news may also be displayed in area 1703.
  • the proposed strategy is based on the optimal historical performance after the latest time in the backtesting cycle.
  • the suggested strategy may be, for example, the highest probability of rise is to close the market after 1 day of holding, with an average return of 0.29%.
  • Report ratings can be positive, bad or insignificant.
  • Area 1704 can display the latest announcements or news for the user to view.
  • the display mode can be a list display, that is, only the title and time of the latest announcement or news are displayed.
  • the area 1705 can display some relevant information of the selected report or news, such as the selected report or news, the type of announcement (news), the time of publication, the name of the securities involved in the news or announcement, the stock code, the announcement (news) number, and the like.
  • the interactive interface can be generated by the backtest module 204 and used to display backtest results (backtested impressions).
  • the interactive interface can be displayed on various electronic devices, including but not limited to, a mobile phone, a personal computer, a tablet computer, a PDA, a smart watch, a smart home appliance, a smart vehicle, etc., or one or more of them.
  • the user can enter a uniform resource addresser (URL) of any text to read the text and its analysis results.
  • URL uniform resource addresser
  • IP Internet Protocol
  • the user can enter any natural language statement.
  • any natural language statement For example, in the financial field, users can enter “December 10, 20-day moving average bonding; amplitude is less than 3%; two ratios are ranked from large to small” or “dividend rate is greater than 3% for two consecutive years; earnings per share is greater than 2 yuan; The market value of less than 5 billion operating income grew from small to large.
  • the numbers and dates in the natural statements you enter can be numbers and dates in any format. The date can be the previous day, last week, last weekend, last week 5, last month, last quarter, big year, near N days, nearly N weeks, nearly N trading days, and so on. The number can be 1, 3, 3, 5, 5, 5%, etc.
  • the number can also be a range, such as 5 to 10 yuan, 5-10%, and the like.
  • a collation can also be added to the input natural language statement.
  • the sorting rule can be, for example, the ratio is from large to small, the quantity ratio is from small to large, the increase and decrease is from small to large, the increase and decrease is from large to small, the turnover rate is from large to small, the turnover rate is from small to large, and the capital is small.
  • the user can set an analysis strategy. For example, the user can set Time range, position stock, buying opportunity, shareholding period, take profit condition, stop loss condition, transaction rate, etc. Specifically, the user can set the buying opportunity to buy after the opening of the next day, or set the take profit condition to “retract 5% take profit when greater than 25%”.
  • the user can click on the query button, and a backtest report on the input natural language sentence will be generated.
  • Region 1803 can present a rating for generating a backtest report and a suggested strategy. The report rating can be an estimate of the maximum expected annualized rate of return and maximum success rate.
  • Region 1804 can display a backtest report for the input natural language statement.
  • the backtest report may include, but is not limited to, backtest data analysis, cumulative income graph, revenue distribution graph, historical transaction query, and the like.
  • Backtest data can include, but is not limited to, holding period, single return average, single return minimum, expected annualized return, number of transactions, profit-loss ratio, success rate, maximum volatility, weekly win rate, Sharpe ratio , the maximum number of consecutive consecutive stock selection results, the average number of stocks per day, etc.
  • the plug-in uses the plug-in to backtest the historical information of the news or announcement, and give predictions; similarly, the system can also be embedded in the company system for intelligent data analysis of financial statements; in addition, various sensors collect data, such as temperature sensors, humidity sensors Wind sensors can read environmental data, and the system can be used to analyze environmental historical trends and predict future environmental changes. In medical terms, the same drug is used in different age groups. The effect is backtested, such as the symptoms of a cold, and it is based on historical data to analyze how many days have passed.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

One aspect of the present invention relates to an information analysis method based on an event regression test. The information analysis method can comprise: collecting information, pre-processing the information and extracting an entity in the information, processing the information and extracting a relevant attribute in the information, determining the type of information and generating a refinement event, generating a corresponding natural language sentence according to the generated refinement event, performing a regression test on the generated natural language sentence, and generating a regression testing report, etc. In addition, the present invention relates to a user being able to select any announcement or news to perform a regression test, and generate a regression testing report. Moreover, the present invention relates to a user being able to input a natural language sentence and perform a regression test on an input natural language sentence so as to generate a regression testing report. Furthermore, the present invention relates to an information analysis system based on an event regression test, comprising a collection module, a system database, a processing module, a natural language processing module and a regression testing module.

Description

基于事件回测的信息分析系统及方法Information analysis system and method based on event backtesting 技术领域Technical field
本发明涉及一种信息分析系统及方法,尤其是对一个事件获得的相关信息、自然语言语句进行自动分析,从而获得其历史回测信息等。The invention relates to an information analysis system and method, in particular to automatically analyzing relevant information and natural language sentences obtained by an event, thereby obtaining historical backtesting information and the like.
背景技术Background technique
随着互联网的不断普及,人们越来越习惯于使用互联网获取信息或根据数据分析信息。由于互联网覆盖范围的不断扩大与信息的不断增加,数据也在不断增加,当人们试图使用互联网获取某种信息时,经常会遇到信息篇幅长、信息内容多样等特点,需要占用时间去阅读和分析。同时,有的行业领域或用户需要根据当前信息回测历史数据来做出决定。例如在金融行业领域中,研究交易策略或投资策略时,使用回测可以评估该策略在过去一段时间内的表现和有效性,从而帮助投资者进行投资决策的分析。又如在天气预报中,根据实时的温度、湿度风向和气压的数据,可以通过分析同等条件下的历史天气情况以预测未来天气的情况等。With the increasing popularity of the Internet, people are becoming more accustomed to using the Internet to obtain information or analyze information based on data. Due to the continuous expansion of Internet coverage and the continuous increase of information, data is also increasing. When people try to use the Internet to obtain certain information, they often encounter the characteristics of long information length and diverse information content, which takes time to read and analysis. At the same time, some industry sectors or users need to make historical decisions based on current information to make historical decisions. For example, in the financial industry, when using a backtest to evaluate the performance and effectiveness of the strategy over the past period of time, it is useful to help investors analyze investment decisions. In the weather forecast, according to the real-time data of temperature, humidity, and air pressure, it is possible to predict the future weather by analyzing historical weather conditions under the same conditions.
简述Brief
本发明一方面是关于一种信息分析系统,根据其中一个实施例,该信息分析系统包括一种计算机可读的存储媒介,所述存储媒介存储可执行模块,该存储媒介包括收集模块,所述收集模块能够收集信息;处理模块,所述处理模块能够对收集的信息进行预处理,从预处理后的信息中提取事件;自然语言处理模块,所述自然语言处理模块能够根据提取出的事件生成自然语言语句;回测模块,所述回测模块能够根据生成的自然语言语句结合历史信息生成回测结果。一个处理器,所述处理器能够执行所述计算机可读的存储媒介存储的可执行模块。One aspect of the present invention is directed to an information analysis system including, according to one embodiment, a computer readable storage medium, the storage medium storing an executable module, the storage medium including a collection module, The collection module is capable of collecting information; the processing module is capable of pre-processing the collected information, extracting an event from the pre-processed information; and a natural language processing module capable of generating the generated event according to the extracted event A natural language statement; a backtest module capable of generating a backtest result according to the generated natural language statement combined with the history information. A processor capable of executing the executable module of the computer readable storage medium storage.
根据本发明的另一个实施例,该信息分析系统进一步包括一 个数据库,所述数据库能够储存所述的收集信息、预处理后的信息、提取的事件、自然语言语句、历史信息、回测结果。According to another embodiment of the present invention, the information analysis system further includes a a database capable of storing the collected information, pre-processed information, extracted events, natural language statements, historical information, and backtest results.
根据本发明的另一个实施例,该数据库包括原始信息数据库、文本数据库、文本预处理数据库、实体数据库、事件属性数据库、关键词数据库、文本分类数据库、历史信息数据库、自然语言处理数据库、事件识别数据库、回测模块数据库、文字模板数据库、词典数据库。According to another embodiment of the present invention, the database includes an original information database, a text database, a text pre-processing database, an entity database, an event attribute database, a keyword database, a text classification database, a history information database, a natural language processing database, and event recognition. Database, backtest module database, text template database, dictionary database.
根据本发明的另一个实施例,该信息分析系统的处理模块进一步包括格式转换模块、文本处理模块、属性抽取模块、事件识别模块。According to another embodiment of the present invention, the processing module of the information analysis system further includes a format conversion module, a text processing module, an attribute extraction module, and an event recognition module.
根据本发明的另一个实施例,该处理模块进一步包括文本分类模块。According to another embodiment of the invention, the processing module further comprises a text classification module.
根据本发明的另一个实施例,该处理模块采用的方法包括卡方统计、信息增益、互信息、几率比、交叉熵、类间信息差、关键词统计、决策树、Rocchio、朴素贝叶斯、神经网络、支持向量机、线性最小平方拟合、最邻近算法、遗传算法、情感分类、最大熵、Generalized Instance Set、同义词配置、布尔关联规则、位置规则、机器学习。According to another embodiment of the present invention, the method adopted by the processing module includes chi-square statistics, information gain, mutual information, odds ratio, cross entropy, inter-class information difference, keyword statistics, decision tree, Rocchio, Naïve Bayes Neural network, support vector machine, linear least squares fit, nearest neighbor algorithm, genetic algorithm, sentiment classification, maximum entropy, Generalized Instance Set, synonym configuration, Boolean association rules, position rules, machine learning.
根据本发明的另一个实施例,该自然语言处理模块可以从收集模块接收信息。According to another embodiment of the invention, the natural language processing module can receive information from the collection module.
根据本发明的另一个实施例,该回测模块进一步包括回测信息判断,所述回测信息判断根据回测结果的情况给出评价。According to another embodiment of the present invention, the backtesting module further includes backtesting information determination, and the backtesting information determines that the evaluation is given according to the condition of the backtesting result.
根据本发明的另一个实施例,该回测结果可以被展示给用户。According to another embodiment of the invention, the backtest result can be presented to the user.
本发明的另一方面涉及一种信息分析方法,该信息分析方法包括收集信息;根据所述信息提取事件;根据所述事件生成自然语言语句;对所述自然语言语句进行回测分析。Another aspect of the present invention relates to an information analysis method including collecting information, extracting an event based on the information, generating a natural language sentence according to the event, and performing backtesting analysis on the natural language sentence.
根据本发明的另一个实施例,该收集信息包括用户输入信息和非用户输入信息,所述非用户输入信息来源包括通信终端和服务器。According to another embodiment of the invention, the collection information includes user input information and non-user input information, the non-user input information source including a communication terminal and a server.
根据本发明的另一个实施例,该收集信息包括公告信息和新 闻信息。According to another embodiment of the present invention, the collected information includes announcement information and new Smell the information.
根据本发明的另一个实施例,该提取事件进一步包括实体识别和属性抽取。According to another embodiment of the invention, the extraction event further comprises entity identification and attribute extraction.
根据本发明的另一个实施例,该实体识别进一步包括格式转换、文本分词、数字及单位归一化处理。According to another embodiment of the invention, the entity identification further comprises format conversion, text segmentation, number and unit normalization processing.
根据本发明的另一个实施例,该属性抽取可以通过系统定义的模型实现。According to another embodiment of the invention, the attribute extraction can be implemented by a system defined model.
根据本发明的另一个实施例,该自然语言语句可以根据所述的提取事件生成。According to another embodiment of the invention, the natural language statement may be generated based on the extraction event.
根据本发明的另一个实施例,该自然语言语句可以根据用户输入信息生成。According to another embodiment of the invention, the natural language statement can be generated based on user input information.
根据本发明的另一个实施例,该自然语言语句进一步根据事件类别扩展。According to another embodiment of the invention, the natural language statement is further expanded according to an event category.
根据本发明的另一个实施例,该自然语言语句分析包括对所述自然语言语句进行回测。According to another embodiment of the invention, the natural language sentence analysis includes backtesting the natural language statement.
根据本发明的另一个实施例,该自然语言语句回测可以根据信息类别生成回测结果。According to another embodiment of the present invention, the natural language sentence backtesting may generate a backtest result according to the information category.
附图描述Description of the drawings
为了更清楚地说明本发明实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单的介绍。显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域的普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图将本发明应用于其它类似情景。除非从语言环境中显而易见或另做说明,图中相同标号代表相同结构和操作。In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly described below. Obviously, the drawings in the following description are only some embodiments of the present invention, and those skilled in the art can apply the present invention to other similarities according to the drawings without any creative work. scene. Unless otherwise apparent from the language environment or otherwise stated, the same reference numerals in the drawings represent the same structure and operation.
图1所示的是信息分析系统的一种示例系统配置的示意图;Figure 1 is a schematic diagram of an exemplary system configuration of an information analysis system;
图2所示的是信息分析系统的模块示意图;Figure 2 is a block diagram of the information analysis system;
图3所示的是信息分析流程图;Figure 3 shows the flow chart of information analysis;
图4所示的是收集模块的结构示意图;Figure 4 is a schematic structural view of the collection module;
图5所示的是处理模块的结构示意图; Figure 5 is a schematic structural view of a processing module;
图6所示的是格式转换模块的结构示意图;FIG. 6 is a schematic structural diagram of a format conversion module;
图7所示的是文本预处理模块的结构示意图;Figure 7 is a schematic structural diagram of a text preprocessing module;
图8所示的是文本分类模块的结构示意图;Figure 8 is a schematic structural diagram of a text classification module;
图9所示的是属性抽取模块的结构示意图;FIG. 9 is a schematic structural diagram of an attribute extraction module;
图10所示的是处理模块的流程图;Figure 10 is a flow chart of the processing module;
图11所示的是自然语言处理模块的结构示意图;Figure 11 is a schematic structural diagram of a natural language processing module;
图12所示的是回测模块的结构示意图;Figure 12 is a schematic structural view of the back test module;
图13所示的是回测流程图;Figure 13 shows the backtest flow chart;
图14所示的是系统数据库的结构示意图;Figure 14 is a schematic structural diagram of a system database;
图15所示的是信息分析流程图;Figure 15 shows a flow chart of information analysis;
图16所示的是信息分析系统线上工作流程图;Figure 16 shows the online working flow chart of the information analysis system;
图17所示的是信息分析系统针对新闻或公告的一个交互界面示意图;Figure 17 is a schematic diagram of an interactive interface of an information analysis system for news or announcements;
图18所示的是信息分析系统针对用户输入的一个交互界面示意图;Figure 18 is a schematic diagram of an interactive interface of the information analysis system for user input;
图19所示的是信息分析系统所用到的公告文本图。Figure 19 shows the text of the announcement text used by the information analysis system.
具体描述specific description
如本说明书和权利要求书中所示,除非上下文明确提示例外情形,“一”、“一个”、“一种”和/或“该”等词并非特指单数,也可包括复数。一般说来,术语“包括”与“包含”仅提示包括已明确标识的步骤和元素,而这些步骤和元素不构成一个排它性的罗列,方法或者设备也可能包含其它的步骤或元素。The words "a", "an", "the" and "the" In general, the terms "comprising" and "comprising" are intended to include only the steps and elements that are specifically identified, and the steps and elements do not constitute an exclusive list, and the method or device may also include other steps or elements.
本说明书所述的信息分析方法是指通过收集信息,处理信息,生成自然语言语句,分析数据提供参考信息等。在一些实施例中,本发明一方面涉及一种信息分析系统。该信息分析系统可以包括收集模块、系统数据库、处理模块、自然语言处理模块和回测模块。本发明另一方面涉及一个基于事件回测的信息分析方法。该信息分析方法可以包括收集信息,对信息进行预处理并提取信息中的实体,处理信息并提取信息中的相关属性,判断信息类别并生成细化事件,根据生成 的细化事件生成对应的自然语言语句,对生成的自然语言语句进行回测,生成回测报告等。本发明另一方面涉及用户可以选择任何公告或者新闻进行实时回测,并生成回测报告。本发明另一方面涉及用户可以输入任何自然语言语句并对输入的自然语言语句进行回测,实时地生成回测报告。The information analysis method described in this specification refers to collecting information, processing information, generating natural language sentences, analyzing data to provide reference information, and the like. In some embodiments, an aspect of the invention relates to an information analysis system. The information analysis system can include a collection module, a system database, a processing module, a natural language processing module, and a backtest module. Another aspect of the invention relates to an information analysis method based on event backtesting. The information analysis method may include collecting information, preprocessing the information and extracting the entities in the information, processing the information and extracting related attributes in the information, determining the information category and generating a refinement event, according to the generation. The refinement event generates a corresponding natural language statement, backtests the generated natural language statement, generates a backtest report, and the like. Another aspect of the invention relates to the fact that the user can select any announcement or news to perform real-time backtesting and generate a backtest report. Another aspect of the invention relates to a user inputting any natural language statement and backtesting the input natural language statement to generate a backtest report in real time.
本发明的不同实施例可适用于多种领域,包括但不限于金融及其衍生物投资(包括但不限于股票、债券、黄金、纸黄金、白银、外汇、贵金属、期货、货币基金等)、科技(包括但不限于数学、物理、化学及化学工程、生物及生物工程、电子工程、通信系统、互联网、物联网等)、政治(包括但不限于政治人物、政治事件、国家)、新闻(从区域而言,包括但不限于地区新闻、国内新闻、国际新闻;从新闻主体而言,包括但不限于政治新闻、体育新闻、科技新闻、经济新闻、生活新闻、气象新闻等)等。依据本发明的至少一个实施例,可以对各种信息资源,如文字、图片、音频、视频的内容进行快速的提炼,并根据相关历史信息,提炼回测策略并生成回测报告,使用户更加方便快捷地了解信息可能的未来影响。本发明的不同实施例应用场景包括但不限于网页、浏览器插件、客户端、定制系统、企业内部分析系统、人工智能机器人等一种或多种组合。以上对适用领域的描述仅仅是具体的示例,不应被视为是唯一可行的实施方案。显然,对于本领域的专业人员来说,在了解一种基于事件回测的信息分析方法和系统的基本原理后,可能在不背离这一原理的情况下,对实施上述方法和系统的应用领域形式和细节上的各种修正和改变,但是这些修正和改变仍在以上描述的范围之内。例如,在本发明的一个实施例中,回测报告是以统一的文字形式显示给用户的,对于本领域的专业人员来说,回测报告也可以是以统一的音频格式或视频格式显示给用户。与此类似的替换或修正或改变,仍在本发明的保护范围之内。为了更清楚地说明本发明实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单的介绍。显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域的普通技术人员来讲,在不付出创造 性劳动的前提下,还可以根据这些附图将本发明应用于其它类似情景。除非从语言环境中显而易见或另做说明,图中相同标号代表相同结构和操作。Different embodiments of the present invention are applicable to a variety of fields including, but not limited to, investments in finance and derivatives (including but not limited to stocks, bonds, gold, paper gold, silver, foreign exchange, precious metals, futures, money funds, etc.), Technology (including but not limited to mathematics, physics, chemistry and chemical engineering, biology and bioengineering, electrical engineering, communication systems, internet, internet of things, etc.), politics (including but not limited to politicians, political events, countries), news ( From the regional perspective, including but not limited to regional news, domestic news, international news; from the main body of the news, including but not limited to political news, sports news, science and technology news, economic news, life news, weather news, etc.). According to at least one embodiment of the present invention, various information resources, such as text, pictures, audio, and video content, can be quickly refined, and the backtesting strategy is generated according to relevant historical information, and a backtest report is generated to make the user more Quickly and easily understand the possible future impact of the information. Application scenarios of different embodiments of the present invention include, but are not limited to, one or more combinations of web pages, browser plug-ins, clients, customization systems, enterprise internal analysis systems, artificial intelligence robots, and the like. The above description of the applicable fields is merely a specific example and should not be considered as the only feasible implementation. Obviously, for those skilled in the art, after understanding the basic principles of an information analysis method and system based on event backtesting, it is possible to apply the above-mentioned methods and systems without departing from this principle. Various modifications and changes in form and detail, but such modifications and changes are still within the scope of the above description. For example, in one embodiment of the present invention, the backtest report is displayed to the user in a unified text form. For those skilled in the art, the back test report may also be displayed in a unified audio format or video format. user. Alternatives or modifications or variations similar to this are still within the scope of the invention. In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly described below. Obviously, the drawings in the following description are only some embodiments of the present invention, and are not intended to be created by those skilled in the art The invention can also be applied to other similar scenarios in accordance with these drawings on the premise of sexual labor. Unless otherwise apparent from the language environment or otherwise stated, the same reference numerals in the drawings represent the same structure and operation.
图1所示的是信息分析系统的一种示例系统配置的示意图。示例系统配置100可以包含但不限于一个或多个信息分析系统101、一个或多个网络102和一个或多个信息源103。信息分析系统101可以用于对收集的信息进行分析加工以生成分析结果的系统。信息分析系统101可以是一个服务器,也可以是一个服务器群组。一个服务器群组可以是集中式的,例如数据中心。一个服务器群组也可以是分布式的,例如一个分布式系统。信息分析系统101可以是本地的,也可以是远程的。网络102可以提供信息交换的渠道。网络102可以是单一网络,也可以是多种网络组合的。网络102可以包括但不限于局域网、广域网、公用网络、专用网络、无线局域网、虚拟网络、都市城域网、公用开关电话网络等一种或多种组合。网络102可以包括多种网络接入点,如有线或无线接入点、基站或网络交换点,通过以上接入点使数据源连接网络102并通过网络发送信息。信息源103可以提供各种信息。信息源103可以包括但不限于服务器、通信终端。进一步地,服务器可以是web服务器、文件服务器、数据库服务器、FTP服务器、应用程序服务器、代理服务器器等,或者上述服务器的任意组合。通信终端可以是手机、个人电脑、可穿戴设备、平板电脑、智能电视等,或则上述通信终端的任意组合。信息源103可以通过网络102发送或/和收集信息到信息分析系统101,信息源103可以是用户输入的信息,也可以是其他数据库或信息源提供的信息。Figure 1 is a schematic diagram of an exemplary system configuration of an information analysis system. The example system configuration 100 can include, but is not limited to, one or more information analysis systems 101, one or more networks 102, and one or more information sources 103. The information analysis system 101 can be used in a system for analyzing and processing the collected information to generate an analysis result. The information analysis system 101 can be a server or a server group. A server group can be centralized, such as a data center. A server group can also be distributed, such as a distributed system. The information analysis system 101 can be local or remote. Network 102 can provide a conduit for information exchange. Network 102 can be a single network or a combination of multiple networks. Network 102 may include, but is not limited to, one or more combinations of a local area network, a wide area network, a public network, a private network, a wireless local area network, a virtual network, a metropolitan area network, a public switched telephone network, and the like. Network 102 may include a variety of network access points, such as wired or wireless access points, base stations, or network switching points, through which the data sources connect to network 102 and transmit information over the network. The information source 103 can provide various information. The information source 103 can include, but is not limited to, a server, a communication terminal. Further, the server may be a web server, a file server, a database server, an FTP server, an application server, a proxy server, etc., or any combination of the above. The communication terminal may be a mobile phone, a personal computer, a wearable device, a tablet computer, a smart TV, or the like, or any combination of the above communication terminals. The information source 103 can send or/and collect information to the information analysis system 101 via the network 102. The information source 103 can be information input by the user or information provided by other databases or information sources.
图2所示的是信息分析系统的模块示意图。信息分析系统101可以包含但不限于一个或多个收集模块201、一个或多个处理模块202、一个或多个自然语言处理模块203、一个或多个回测模块204、一个或多个系统数据库205。上述的模块中部分或全部可以与网络102连接。上述模块可以是集中式的也可以是分布式的。上述模块中的一个或多个模块可以是本地的也可以是远程的。收集模块201可以主要 用于以各种方式收集所需要的信息,收集信息的方式可以是直接的(例如直接通过网络102从一个或多个信息源103获取信息)也可以是间接的(例如通过处理模块201、自然语言处理模块202、回测模块204或者系统数据库205来获取信息)。处理模块202可以主要用于信息的预处理,信息的预处理可以是人工的,也可以是自动的,信息的预处理可以包括但不限于格式转换、分词处理、实体识别、数字及单位归一化处理、文本分类、事件属性抽取、细化事件识别、加密文档解密等一种或多种组合。自然语言处理模块203可以主要用于生成自然语言语句,也可以接收输入的自然语言语句。自然语言处理模块203处理信息的方式可以是人工的,也可以是自动的。回测模块204可以主要用于分析信息。分析方法可以包括但不限于系统定义、用户自定义选择、机器学习等其中的一种或多种组合。信息分析可以是人工实现的,也可以是自动完成的,或者是两者相结合完成的。系统数据库205可以泛指具有存储功能的设备。系统数据库205主要用于存储从信息源103收集的数据和信息分析系统101工作中产生的各种数据。系统数据库205可以是本地的,也可以是远程的。系统数据库与系统其他模块间的连接或通信可以是有线的,也可以是无线的。Figure 2 shows a block diagram of the information analysis system. Information analysis system 101 may include, but is not limited to, one or more collection modules 201, one or more processing modules 202, one or more natural language processing modules 203, one or more flyback modules 204, one or more system databases 205. Some or all of the above modules may be connected to the network 102. The above modules can be centralized or distributed. One or more of the above modules may be local or remote. The collection module 201 can be mainly The means for collecting the required information in various ways, collecting the information may be direct (eg, directly from the one or more information sources 103 via the network 102) or indirectly (eg, through the processing module 201, nature) The language processing module 202, the backtest module 204 or the system database 205 to obtain information). The processing module 202 can be mainly used for pre-processing of information. The pre-processing of information can be manual or automatic. The pre-processing of information can include but is not limited to format conversion, word segmentation, entity identification, number and unit normalization. One or more combinations of processing, text categorization, event attribute extraction, refinement event recognition, and encrypted document decryption. The natural language processing module 203 can be mainly used to generate natural language sentences, and can also receive input natural language sentences. The manner in which the natural language processing module 203 processes the information may be manual or automatic. The flyback module 204 can be primarily used to analyze information. The analysis method can include, but is not limited to, one or more combinations of system definitions, user-defined selections, machine learning, and the like. Information analysis can be done manually, automatically, or a combination of both. System database 205 can be broadly referred to as a device having a storage function. The system database 205 is primarily used to store data collected from the information source 103 and various data generated in the operation of the information analysis system 101. System database 205 can be local or remote. The connection or communication between the system database and other modules of the system can be wired or wireless.
收集模块201可以将收集到的信息传输给处理模块202。收集模块201也可以将收集到的信息传输给自然语言处理模块203。收集模块201也可以将收集到的信息传输给回测模块204。收集模块201可以接收处理模块202发送的请求,也可以按照该请求访问系统数据库205,以获取需要的数据。需要的数据被获取之后,收集模块201可以将该数据传输给处理模块202。收集模块201可以接收自然语言处理模块203发送的请求,也可以按照该请求访问系统数据库205,以获取需要的数据。需要的数据被获取之后,收集模块201可以将该数据传输给自然语言处理模块203。收集模块201可以接收回测模块204发送的请求,也可以按照该请求访问系统数据库205,以获取需要的数据。需要的数据被获取之后,收集模块201可以将该数据传输给回测模块204。 The collection module 201 can transmit the collected information to the processing module 202. The collection module 201 can also transmit the collected information to the natural language processing module 203. The collection module 201 can also transmit the collected information to the backtest module 204. The collection module 201 can receive the request sent by the processing module 202, and can also access the system database 205 according to the request to obtain the required data. After the required data is acquired, the collection module 201 can transmit the data to the processing module 202. The collection module 201 can receive the request sent by the natural language processing module 203, and can also access the system database 205 according to the request to obtain the required data. After the required data is acquired, the collection module 201 can transmit the data to the natural language processing module 203. The collection module 201 can receive the request sent by the backtest module 204, and can also access the system database 205 according to the request to obtain the required data. After the required data is acquired, the collection module 201 can transmit the data to the backtest module 204.
收集模块201可以主要用于以各种方式收集所需要的信息。收集模块201可以通过向信息源103发送请求,以获取需要的信息。收集模块201在获取需要的信息后,可以将所获得的信息进行下一步处理或者存储在系统数据库205中。收集模块201也可以通过向系统数据库205发送请求,以获取存储在系统数据库205中的信息。可选择地,系统数据库205也可以直接向信息源103发送请求,获取的信息可以被存储在系统数据库205中。信息源103可以是服务器、通信终端等。进一步地,服务器可以是web服务器、文件服务器、数据库服务器、FTP服务器、应用程序服务器、代理服务器等,或者上述服务器的任意组合。通信终端可以是手机、个人电脑、可穿戴设备、平板电脑、智能电视等,或者上述通信终端的各种组合。上述需要的信息可以包括但不限于各种新闻、研报、公告、消息、报告、通知、论文、期刊等中的一种或多种。上述需要的信息可以是关于各个行业的信息,包括但不限于体育、娱乐、经济、政治、军事、文化、艺术、科学、工程等中的一种或多种。上述需要的信息的形式可以包括但不限于文字、图片、音频、视频等中的一种或多种。例如某视频网站播放的视频新闻《世界银行下调今年全球经济增长预期至2.8%》、某新闻网站报道的网页新闻《汇丰5月份中国服务业PMI升至53.5》、某证券交易所发布的上市公司公告《A股份有限公司关于签署日常经营重大合同的公告》、某体育赛事直播平台发布的足球赛事预告《本周六切尔西俱乐部将在主场斯坦福桥球场迎战同城死敌阿森纳》。The collection module 201 can be used primarily to collect the required information in a variety of ways. The collection module 201 can obtain the required information by sending a request to the information source 103. After the collection module 201 obtains the required information, the obtained information may be processed in the next step or stored in the system database 205. The collection module 201 can also obtain information stored in the system database 205 by sending a request to the system database 205. Alternatively, system database 205 may also send a request directly to information source 103, which may be stored in system database 205. The information source 103 can be a server, a communication terminal, or the like. Further, the server may be a web server, a file server, a database server, an FTP server, an application server, a proxy server, etc., or any combination of the above. The communication terminal may be a mobile phone, a personal computer, a wearable device, a tablet computer, a smart TV, or the like, or various combinations of the above communication terminals. The above required information may include, but is not limited to, one or more of various news, research reports, announcements, messages, reports, notices, essays, periodicals, and the like. The information required above may be information about various industries including, but not limited to, one or more of sports, entertainment, economics, politics, military, culture, art, science, engineering, and the like. The form of the above-mentioned required information may include, but is not limited to, one or more of text, picture, audio, video, and the like. For example, the video news broadcast on a video website, "World Bank lowered the global economic growth forecast to 2.8% this year", a news report reported by a news website, "HSBC China service industry PMI rose to 53.5 in May", a listed company issued by a stock exchange Announcement "A Co., Ltd. Announcement on Signing Major Contracts for Daily Operations" and a football event preview released by a sports event live broadcast platform "This Saturday Chelsea Club will face the rival Arsenal at the Stamford Bridge Stadium at home."
处理模块202可以与收集模块201进行双向通信。处理模块202可以处理收集模块201传输的信息,信息处理可以包括但不限于格式转换、文本预处理、文本分类、属性抽取和事件识别等中的一种或多种组合。处理模块202也可以向收集模块201发送信息,发送的信息可以包括但不限于经过处理的信息以及控制信息,该控制信息可以包括但不限于信息收集方式的控制信息、信息收集时间的控制信息、信息收集来源的控制信息等。处理模块202可以与自然语言处理模块203进行双向通信。处理模块202可以将经过处理后的信息传输给自 然语言处理模块203,也可以接收自然语言处理模块203发送的信息。处理模块202可以与回测模块204进行双向通信。处理模块202可以将经过处理后的信息传输给回测模块204,也可以接收回测模块204发送的信息。处理模块202可以与系统数据库205进行双向通信。处理模块202可以将经过处理后的信息传输给系统数据库205进行储存,也可以在处理过程中向系统数据库205发送请求信息并接收系统数据库205发送的信息。The processing module 202 can communicate bidirectionally with the collection module 201. The processing module 202 can process the information transmitted by the collection module 201. The information processing can include, but is not limited to, one or more combinations of format conversion, text preprocessing, text classification, attribute extraction, and event recognition. The processing module 202 may also send information to the collection module 201, where the information may include, but is not limited to, processed information and control information, which may include, but is not limited to, control information of the information collection manner, control information of the information collection time, Control information such as information collection sources. The processing module 202 can communicate bi-directionally with the natural language processing module 203. The processing module 202 can transmit the processed information to the self. The language processing module 203 can also receive the information sent by the natural language processing module 203. The processing module 202 can communicate bidirectionally with the flyback module 204. The processing module 202 may transmit the processed information to the backtest module 204, and may also receive the information sent by the backtest module 204. Processing module 202 can communicate bi-directionally with system database 205. The processing module 202 may transmit the processed information to the system database 205 for storage, or may send the request information to the system database 205 and receive the information sent by the system database 205 during the processing.
自然语言处理模块203可以向收集模块201发送请求,收集模块201可以根据请求访问系统数据库205或从一个或多个信息源103,以获取需要的信息。需要的信息被获取之后,收集模块201将该信息传输给自然语言处理模块203。可选择地,收集模块201在收到从自然语言处理模块203发来的请求之后,也可以将收集模块201中的信息传输给自然语言处理模块203,所述信息可以来自信息源103或系统数据库205。自然语言处理模块203可以向处理模块202发送请求,处理模块202可以根据请求访问系统数据库205,以获取需要的信息。需要的信息被获取之后,处理模块202将该信息传输给自然语言处理模块203。可选择地,处理模块202在收到从自然语言处理模块203发来的请求之后,也可以将处理模块202中的信息传输给自然语言处理模块203。可选择地,自然语言处理模块203可以直接访问系统数据库205,并向系统数据库205发送请求以获取需要的信息,该信息可以被传输给自然语言处理模块203。可选择地,系统数据库205可以在没有收到请求的情况下向自然语言处理模块203发送信息。在本发明的一种实施例中,自然语言处理模块203可以直接接收来自信息源103的自然语言语句(图中未展示),该自然语言语句可以是用户使用输入设备输入的,输入设备包括但不限于键盘、鼠标、摄像头、扫描仪、手写输入板、语音输入装置等一种或多种组合。The natural language processing module 203 can send a request to the collection module 201, and the collection module 201 can access the system database 205 or from one or more information sources 103 upon request to obtain the required information. After the required information is obtained, the collection module 201 transmits the information to the natural language processing module 203. Alternatively, after receiving the request sent from the natural language processing module 203, the collection module 201 may also transmit the information in the collection module 201 to the natural language processing module 203, which may be from the information source 103 or the system database. 205. The natural language processing module 203 can send a request to the processing module 202, and the processing module 202 can access the system database 205 according to the request to obtain the required information. After the required information is obtained, the processing module 202 transmits the information to the natural language processing module 203. Alternatively, the processing module 202 may also transmit the information in the processing module 202 to the natural language processing module 203 after receiving the request sent from the natural language processing module 203. Alternatively, the natural language processing module 203 can directly access the system database 205 and send a request to the system database 205 to obtain the required information, which can be transmitted to the natural language processing module 203. Alternatively, system database 205 can send information to natural language processing module 203 without receiving a request. In an embodiment of the present invention, the natural language processing module 203 can directly receive a natural language statement (not shown) from the information source 103, which can be input by the user using an input device, and the input device includes It is not limited to one or more combinations of a keyboard, a mouse, a camera, a scanner, a handwriting tablet, a voice input device, and the like.
自然语言处理模块203的输入信息可以是字母、数字、字符、词语、短语、语句、段落、篇章等,或其中的一种或多种,或由任意数量的标识符集合,该标识符集合可以包含一种或多种语义。可选择 地,自然语言处理模块203的输入信息可以是自定义的信息类型。在本发明的一些实施例中,自然语言处理模块203的输入信息可以被表征为一个多元组。例如:自然语言处理模块203的输入信息可以被表征为一个四元组{k,c,u,d}。其中,参数k可以被配置为表示信息来源,信息来源可以包括但不限于收集模块201、处理模块202、系统数据库205、信息源103(图中未展示),或者上述信息来源的任意组合。参数c可以被配置为表示通信时间。例如:参数c可以被配置为表示年份、月份、日期等。通过赋予参数c以特定的数值,由参数c指定的特定时间的信息将被输入自然语言处理模块203。参数u可以被配置为表示要使用的用户模型。用户模型为依据不同的用户需求而具有不同功能的数据处理模型。参数u在缺省的情况下表示不应用任何数据模型。参数d可以被配置为表示已产生信息。已产生信息指用户在自然语言处理的过程中已经生成的各种实体和属性等,各种实体和属性将在后续的自然语言处理的过程中被运用。The input information of the natural language processing module 203 may be letters, numbers, characters, words, phrases, sentences, paragraphs, chapters, etc., or one or more of them, or by any number of identifier sets, the set of identifiers may Contains one or more semantics. Optional The input information of the natural language processing module 203 may be a customized information type. In some embodiments of the invention, the input information of the natural language processing module 203 may be characterized as a multi-tuple. For example, the input information of the natural language processing module 203 can be characterized as a quad {k, c, u, d}. The parameter k may be configured to represent a source of information, which may include, but is not limited to, a collection module 201, a processing module 202, a system database 205, an information source 103 (not shown), or any combination of the above sources of information. The parameter c can be configured to represent the communication time. For example, the parameter c can be configured to represent the year, month, date, and the like. By giving the parameter c a specific value, the information of the specific time specified by the parameter c will be input to the natural language processing module 203. The parameter u can be configured to represent the user model to be used. The user model is a data processing model with different functions depending on different user needs. The parameter u by default means that no data model is applied. The parameter d can be configured to indicate that information has been generated. The generated information refers to various entities and attributes that the user has generated during the natural language processing, and various entities and attributes will be used in the subsequent natural language processing.
自然语言处理模块203可以将收集到的信息进行处理,以生成自然语言语句。生成的自然语言语句可以被传输至回测模块204以进行回测。具体地,自然语言处理模块203可以向回测模块204发送回测请求,该请求被获准之后,自然语言处理模块203将生成的自然语言语句输入到回测模块204进行回测。可选择地,自然语言处理模块203也可以不发送回测请求,而是直接将生成的自然语言语句输入到回测模块204进行回测。在本发明的一个实施例中,回测模块204在接收到自然语言处理模块203输入的自然语言语句后,将自然语言语句进一步处理生成数据库标准访问指令,从而访问或调取相应数据库中存储的历史数据。The natural language processing module 203 can process the collected information to generate natural language statements. The generated natural language statement can be transmitted to the backtest module 204 for backtesting. Specifically, the natural language processing module 203 may send a backtesting request to the backtesting module 204. After the request is approved, the natural language processing module 203 inputs the generated natural language statement to the backtesting module 204 for back testing. Alternatively, the natural language processing module 203 may also not send the backtesting request, but directly input the generated natural language statement to the backtesting module 204 for back testing. In an embodiment of the present invention, after receiving the natural language statement input by the natural language processing module 203, the backtest module 204 further processes the natural language statement to generate a database standard access instruction, thereby accessing or retrieving the stored in the corresponding database. historical data.
在本发明的另一种实施例中,自然语言处理模块203可以接收处理模块202生成的事件,自然语言处理模块203可以将接收到的自然语言语句(事件)进行拼装,也可以根据回测的需要,加上额外条件。比如:针对个股事件,则需要加上“个股代码或简称”;针对行业事件回测,则需要加上“个股对应行业”;针对全市场事件(如 央行降息)回测则不需要加任何语句。In another embodiment of the present invention, the natural language processing module 203 can receive the event generated by the processing module 202, and the natural language processing module 203 can assemble the received natural language statement (event), or can be based on the backtesting. Need, plus additional conditions. For example, for individual stock events, you need to add “stock code or abbreviation”; for industry event backtesting, you need to add “single stock corresponding industry”; for whole market events (such as The central bank cuts interest rates. There is no need to add any statements to the backtest.
需要注意的是,上述对自然语言处理模块203输入信息的描述只是为了便于理解发明,不应被视为是本发明唯一可行的实施方案。对于本领域的专业人员来说,在了解所需要的信息的基本原理后,可能在不背离这一原理的情况下,对所需要的信息的内容进行各种修正和改变,但是这些修正和改变仍在以上描述的范围之内。比如,自然语言处理模块203的输入信息可以被表征为二元组、三元组、五元组、六元组、N元组等,或上述信息类型的任意组合。It should be noted that the above description of the input information to the natural language processing module 203 is only for the purpose of facilitating understanding of the invention and should not be considered as the only feasible embodiment of the present invention. For those skilled in the art, after understanding the basic principles of the required information, various modifications and changes may be made to the content of the required information without departing from the principle, but these modifications and changes may be made. Still within the scope of the above description. For example, the input information of the natural language processing module 203 can be characterized as a binary group, a triplet, a quintuple, a hexadecimal, an N-tuple, etc., or any combination of the above types of information.
回测模块204可以向收集模块201发送回测条件请求,收集模块201可以根据请求访问系统数据库205获取需要的信息。需要的信息被获取之后,收集模块201将该信息传输给回测模块204。可选择地,收集模块201在收到从回测模块204发来的请求之后,也可以将存储在收集模块201中的信息传输给回测模块204。回测模块204可以向处理模块202发送请求,处理模块202可以根据请求访问系统数据库205获取需要的信息。需要的信息被获取之后,处理模块202可以将该信息传输给回测模块204。可选择地,处理模块202在收到从回测模块204发来的请求之后,也可以将存储在处理模块202中的信息传输给回测模块204。回测模块204可以向自然语言处理模块203发送请求,自然语言处理模块203可以根据请求访问系统数据库205获取需要的信息。需要的信息被获取之后,自然语言处理模块203可以将该信息传输给回测模块204。可选择地,自然语言处理模块203在收到从回测模块204发来的请求之后,也可以将存储在自然语言处理模块203中的信息传输给回测模块204。可选择地,回测模块204可以直接访问系统数据库205,并向系统数据库205发送请求以获取需要的信息,该信息可以被传输给回测模块204。可选择地,系统数据库205可以在没有收到请求的情况下向回测模块204发送信息。回测模块204收到的输入信息可以包括但不限于字母、数字、字符、词语、语句、段落、篇章、自然语言语句等其中的一种或多种组合。输入信息的来源可以包括但不限于收集模块201、处理模块202、自然 语言处理模块203、系统数据库205、信息源103等其中的一种或多种组合。The backtest module 204 can send a backtest condition request to the collection module 201, and the collection module 201 can access the system database 205 according to the request to obtain the required information. After the required information is obtained, the collection module 201 transmits the information to the backtest module 204. Alternatively, after receiving the request sent from the backtest module 204, the collecting module 201 may also transmit the information stored in the collecting module 201 to the back testing module 204. The flyback module 204 can send a request to the processing module 202, and the processing module 202 can access the system database 205 to obtain the required information according to the request. After the required information is obtained, the processing module 202 can transmit the information to the backtest module 204. Alternatively, after receiving the request sent from the backtest module 204, the processing module 202 may also transmit the information stored in the processing module 202 to the backtest module 204. The backtest module 204 can send a request to the natural language processing module 203, and the natural language processing module 203 can access the system database 205 to obtain the required information according to the request. After the required information is obtained, the natural language processing module 203 can transmit the information to the backtest module 204. Alternatively, after receiving the request sent from the backtest module 204, the natural language processing module 203 may also transmit the information stored in the natural language processing module 203 to the backtest module 204. Alternatively, the flyback module 204 can directly access the system database 205 and send a request to the system database 205 to obtain the required information, which can be transmitted to the flyback module 204. Alternatively, system database 205 can send information to flyback module 204 without receiving a request. The input information received by the backtest module 204 may include, but is not limited to, one or more combinations of letters, numbers, characters, words, sentences, paragraphs, chapters, natural language statements, and the like. Sources of input information may include, but are not limited to, collection module 201, processing module 202, natural One or more combinations of language processing module 203, system database 205, information source 103, and the like.
系统数据库205或系统内的其他存储设备泛指所有可以具有读/写功能的媒介。系统数据库205或系统内其他存储设备可以是系统内部的,也可以是系统的外接设备。系统数据库205或系统内其他存储设备的连接方式可以是有线的,也可以是无线的。系统数据库205或系统内其他存储设备可以包括但不限于层次式数据库、网络式数据库和关系式数据库等其中一种或多种组合。系统数据库205或系统内其他存储设备可以将信息数字化后再以利用电、磁或光学等方式的存储设备加以存储。系统数据库205或系统内其他存储设备可以用来存放各种信息例如程序和数据等。系统数据库205或系统内其他存储设备可以是利用电能方式存储信息的设备,例如各种存储器、随机存取存储器(Random Access Memory,RAM)、只读存储器(Read Only Memory,ROM)等。系统数据库205或系统内其他存储设备可以是利用磁能方式存储信息的设备,例如硬盘、软盘、磁带、磁芯存储器、磁泡存储器、U盘、闪存等。系统数据库205或系统内其他存储设备可以是利用光学方式存储信息的设备,例如CD或DVD等。系统数据库205或系统内其他存储设备可以是利用磁光方式存储信息的设备,例如磁光盘等。系统数据库205或系统内其他存储设备的存取方式可以是随机存储、串行访问存储、只读存储等一种或多种组合。系统数据库205或系统内其他存储设备可以是非永久记忆存储器,也可以是永久记忆存储器。以上提及的存储设备是列举了一些例子,该系统可以使用的存储设备并不局限于此。系统数据库205或系统内其他存储设备可以是本地的,也可以是远程的,也可以是云服务器上的。 System database 205 or other storage devices within the system generally refer to all media that can have read/write capabilities. The system database 205 or other storage devices in the system may be internal to the system or external devices of the system. The connection manner of the system database 205 or other storage devices in the system may be wired or wireless. System database 205 or other storage devices within the system may include, but are not limited to, one or more combinations of hierarchical databases, networked databases, and relational databases. The system database 205 or other storage devices within the system may digitize the information and store it in a storage device that utilizes electrical, magnetic or optical means. System database 205 or other storage devices within the system can be used to store various information such as programs and data. The system database 205 or other storage devices in the system may be devices that store information by means of electrical energy, such as various memories, random access memory (RAM), read only memory (ROM), and the like. The system database 205 or other storage devices within the system may be devices that store information using magnetic energy, such as hard disks, floppy disks, magnetic tapes, magnetic core memories, magnetic bubble memories, USB flash drives, flash memories, and the like. System database 205 or other storage devices within the system may be devices that optically store information, such as CDs or DVDs. The system database 205 or other storage devices within the system may be devices that store information using magneto-optical means, such as magneto-optical disks. The access method of the system database 205 or other storage devices in the system may be one or more combinations of random storage, serial access storage, read-only storage, and the like. The system database 205 or other storage devices within the system may be non-persistent memory or permanent memory. The storage device mentioned above is a few examples, and the storage device that the system can use is not limited thereto. The system database 205 or other storage devices in the system may be local, remote, or on a cloud server.
系统数据库205或系统内其他存储设备对信息的操作可以包括但不限于存储、分类、排序、筛选等一种或多种组合。系统数据库205或系统内其他存储设备可以与信息源103传递或交换信息。系统数据库205或系统内其他存储设备可以接收信息源103的信息,将其存储在系统数据库205或系统内其他存储设备上。根据收到的指令, 系统数据库205或系统内其他存储设备里面的信息可以被提取,传递给信息源103。该指令可以是直接来自于信息源103,也可以是来自于其他模块,如收集模块201、处理模块202、自然语言处理模块203、回测模块204等。系统数据库205或系统内其他存储设备可以与收集模块201传递或交换信息。系统数据库205或系统内其他存储设备可以接收收集模块201收集的信息,将其存储在系统数据库205或系统内其他存储设备上。根据收到的指令,系统数据库205或系统内其他存储设备里面的信息可以被提取,传递给收集模块201。该指令可以是直接来自于收集模块201,也可以是来自于其他模块,如处理模块202、自然语言处理模块203、回测模块204、信息源103等。The operation of the information in the system database 205 or other storage devices in the system may include, but is not limited to, one or more combinations of storage, classification, sorting, filtering, and the like. System database 205 or other storage devices within the system can communicate or exchange information with information source 103. System database 205 or other storage devices within the system may receive information from information source 103 and store it in system database 205 or other storage devices within the system. According to the instructions received, Information in the system database 205 or other storage devices within the system can be extracted and passed to the information source 103. The instruction may be directly from the information source 103, or may be from other modules, such as the collection module 201, the processing module 202, the natural language processing module 203, the backtest module 204, and the like. System database 205 or other storage devices within the system can communicate or exchange information with collection module 201. The system database 205 or other storage devices within the system may receive the information collected by the collection module 201 and store it in the system database 205 or other storage devices within the system. Based on the received instructions, the information in the system database 205 or other storage devices in the system can be extracted and passed to the collection module 201. The instruction may be directly from the collection module 201, or may be from other modules, such as the processing module 202, the natural language processing module 203, the backtest module 204, the information source 103, and the like.
系统数据库205或系统内其他存储设备可以与处理模块202传递或交换信息。系统数据库205或系统内其他存储设备可以接收处理模块202收集的信息,将其存储在系统数据库205或系统内其他存储设备上。根据收到的指令,系统数据库205或系统内其他存储设备里面的信息可以被提取,传递给处理模块202。该指令可以是直接来自于处理模块202,也可以是来自于其他模块,如收集模块201、自然语言处理模块203、回测模块204、信息源103等。 System database 205 or other storage devices within the system can communicate or exchange information with processing module 202. The system database 205 or other storage devices within the system can receive the information collected by the processing module 202 and store it in the system database 205 or other storage devices within the system. Based on the received instructions, information in the system database 205 or other storage devices within the system can be extracted and passed to the processing module 202. The instruction may be directly from the processing module 202, or may be from other modules, such as the collection module 201, the natural language processing module 203, the backtest module 204, the information source 103, and the like.
系统数据库205或系统内其他存储设备可以与自然语言处理模块203传递或交换信息。系统数据库205或系统内其他存储设备可以接收自然语言处理模块203收集的信息,将其存储在系统数据库205或系统内其他存储设备上。根据收到的指令,系统数据库205或系统内其他存储设备里面的信息可以被提取,传递给自然语言处理模块203。该指令可以是直接来自于自然语言处理模块203,也可以是来自于其他模块,如收集模块201、处理模块202、回测模块204、信息源103等。 System database 205 or other storage devices within the system can communicate or exchange information with natural language processing module 203. The system database 205 or other storage devices within the system can receive the information collected by the natural language processing module 203 and store it in the system database 205 or other storage devices within the system. Based on the received instructions, information in the system database 205 or other storage devices within the system can be extracted and passed to the natural language processing module 203. The instruction may be directly from the natural language processing module 203, or may be from other modules, such as the collection module 201, the processing module 202, the backtest module 204, the information source 103, and the like.
系统数据库205发送的信息可以是直接从信息源获取的信息,也可以是经过处理分析后的信息。经过处理分析的信息,可以是经过处理模块202处理后储存在系统数据库205的信息,也可以是经过自然语言处理模块203处理后储存的信息。系统数据库205或系统内其 他存储设备与其他模块信息传递的方式可以是有线的也可以是无线的,可以是直接的也可以是间接的,可以是同时进行的也可以是顺序进行的,可以是周期的也可以是非周期的等。The information sent by the system database 205 may be information obtained directly from the information source, or may be processed and analyzed. The information processed by the analysis may be information stored in the system database 205 after being processed by the processing module 202, or may be information stored after being processed by the natural language processing module 203. System database 205 or its The way of transferring information between the storage device and other modules may be wired or wireless, and may be direct or indirect, and may be performed simultaneously or sequentially, and may be periodic or aperiodic. And so on.
显然,对于本领域的专业人员来说,在了解信息分析系统及方法的原理后,可能在不背离这一原理的情况下,对各个模块进行任意组合,或者构成子系统与其它模块连接,对实施上述方法和系统的应用领域形式和细节上的各种修正和改变,但是这些修正和改变仍在以上描述的范围之内。例如,收集模块201、处理模块202、自然语言处理模块203、回测模块204和系统数据库205可以是体现在一个系统中的不同模块,也可以是一个模块实现上述的两个或两个以上模块的功能,如处理模块202可以收集信息并生成自然语言语句,该处理模块同时实现了收集模块201和自然语言处理模块203的功能,类似的变形仍在本发明的权利要求保护范围之内。Obviously, for those skilled in the art, after understanding the principle of the information analysis system and method, it is possible to arbitrarily combine the modules without deviating from the principle, or to form a subsystem to be connected with other modules. Various modifications and changes in form and detail of the application of the above methods and systems are implemented, but such modifications and changes are still within the scope of the above description. For example, the collection module 201, the processing module 202, the natural language processing module 203, the backtest module 204, and the system database 205 may be different modules embodied in one system, or may be one module to implement the above two or more modules. The functions, such as the processing module 202, can collect information and generate natural language statements. The processing module implements the functions of the collection module 201 and the natural language processing module 203 at the same time, and similar modifications are still within the scope of the claims of the present invention.
图3所示的是信息分析流程图。需要的信息在步骤301从信息源103(详见图1)中被收集。信息源103可以包括但不限于服务器、通信终端。进一步地,服务器可以是web服务器、文件服务器、数据库服务器、FTP服务器、应用程序服务器、代理服务器器等,或者上述服务器的任意组合。通信终端可以是手机、个人电脑、可穿戴设备、平板电脑、智能电视等,或者上述通信终端的任意组合。进一步地,在步骤301,用户通过各种通信终端输入的自然语言语句可以被接收。上述需要的信息可以包括但不限于各种新闻、公告、评论、研报、博客、消息、报告、通知、论文、期刊等中的一种或多种。上述需要的信息可以是关于各个行业的信息,包括但不限于体育、娱乐、经济、政治、军事、文化、艺术、科学、工程等中的一种或多种。上述需要的信息的形式可以包括但不限于文字、图片、音频、视频等中的一种或多种。例如,新闻可以是某视频网站播放的视频新闻《世界银行下调今年全球经济增长预期至2.8%》、某新闻网站报道的网页新闻《汇丰5月份中国服务业PMI升至53.5》、某证券交易所发布的上市公司公告《A股份有限公司关于签署日常经营重大合同的公告》、 某体育赛事直播平台发布的足球赛事预告《本周六切尔西俱乐部将在主场斯坦福桥球场迎战同城死敌阿森纳》等。步骤301可以由收集模块201完成。Figure 3 shows the flow chart of information analysis. The required information is collected from the information source 103 (see Figure 1) in step 301. The information source 103 can include, but is not limited to, a server, a communication terminal. Further, the server may be a web server, a file server, a database server, an FTP server, an application server, a proxy server, etc., or any combination of the above. The communication terminal may be a mobile phone, a personal computer, a wearable device, a tablet computer, a smart TV, or the like, or any combination of the above communication terminals. Further, in step 301, natural language sentences input by the user through various communication terminals can be received. The above required information may include, but is not limited to, one or more of various news, announcements, comments, research reports, blogs, messages, reports, notices, essays, journals, and the like. The information required above may be information about various industries including, but not limited to, one or more of sports, entertainment, economics, politics, military, culture, art, science, engineering, and the like. The form of the above-mentioned required information may include, but is not limited to, one or more of text, picture, audio, video, and the like. For example, the news can be a video news broadcast on a video website, "The World Bank cuts its global economic growth forecast to 2.8% this year", a news report reported by a news website, "HSBC China service industry PMI rose to 53.5 in May", a stock exchange Announcement of the listed company issued “A Company’s Announcement on Signing Major Contracts for Daily Operations”, The football event preview released by a sports event live broadcast platform "This Saturday Chelsea Club will be at the Stamford Bridge at home against the rivals Arsenal". Step 301 can be completed by the collection module 201.
步骤301中收集的信息在步骤302被处理。步骤302可以由处理模块202完成。在本发明的一些实施例中,步骤301收集的信息可以为文字信息。该文字信息可以直接或间接来源于文本、音频、视频或上述来源的任意组合。进一步地,当文字信息来源于音频时,系统可通过语音识别或字幕提取将音频转化为文本。当文字信息来源于视频时,系统可通过语音识别或字幕文件提取将视频转化为文本。文字信息可以为汉语、英语、德语、西班牙语、阿拉伯语、法语、日语、韩语、俄语、葡萄牙语等,或上述语言的任意组合。进一步地,文字信息可以是字母、数字、字符、词语、短语、语句、段落、篇章等,或其中的一种或多种,或由任意数量的标识符组成的集合,该标示符集合可以包含一种或多种语义。步骤302执行的信息处理可以包括但不限于格式转换、分词处理、实体识别、数字及单位归一化处理、文本分类、事件属性抽取、细化事件识别等中的一种或多种。格式转换可以将各种格式的文字信息转换为统一的文本格式。文字信息的格式可以包括但不限于:pdf、doc、epub、mobi、caj、kdh、nh等,或上述格式的一种或多种。统一的文本格式可以包括但不限于txt、ASCII、MIME等一种或多种组合。The information collected in step 301 is processed in step 302. Step 302 can be completed by processing module 202. In some embodiments of the invention, the information collected in step 301 may be textual information. The textual information may be derived directly or indirectly from text, audio, video or any combination of the above sources. Further, when the text information is derived from audio, the system can convert the audio into text by speech recognition or subtitle extraction. When text information is derived from video, the system can convert the video into text by speech recognition or subtitle file extraction. The text information can be Chinese, English, German, Spanish, Arabic, French, Japanese, Korean, Russian, Portuguese, etc., or any combination of the above. Further, the text information may be letters, numbers, characters, words, phrases, sentences, paragraphs, chapters, etc., or one or more of them, or a set of any number of identifiers, the set of identifiers may include One or more semantics. The information processing performed by step 302 may include, but is not limited to, one or more of format conversion, word segmentation processing, entity recognition, number and unit normalization processing, text classification, event attribute extraction, refinement event recognition, and the like. Format conversion converts text information in various formats into a uniform text format. The format of the text information may include, but is not limited to, pdf, doc, epub, mobi, caj, kdh, nh, etc., or one or more of the above formats. The unified text format may include, but is not limited to, one or more combinations of txt, ASCII, MIME, and the like.
分词处理可以将文字信息中词语按照词语类型提取出来,词语类型可以包括但不限于名词、动词、形容词、副词、助词、拟声词、数字、专有符号等,或其中的一种或多种。可选择地,文字信息也可以应用一定的分词算法被处理。分词算法可以包括但不限于基于字符串匹配的分词方法(即机械性分词法)、基于理解的分词方法、基于统计的分词方法等,或上述分词方法的一种或多种。分词处理完成之后,对文本信息进行实体识别。实体可以包括但不限于产品、机构名、人名、地名、时间、日期、货币、数字、百分比等中的一种或多种。实体识别方法可以包括但不限于隐马尔科夫模型、最大熵模型、支持 向量机、基于规则的识别方法和基于统计的识别方法等,或其中的一种或多种。具体地,系统可以总结提炼过往信息中的要素,定义各种事件类别。例如:外交类、金融类、体育类、政治类、科学类、教育类等,或其中的一种或多种。上述类别也可包含若干级别的子类,如金融类可以包括股票、基金、期货等子类。上述类别可以包含实体识别完成之后,系统将对经过实体识别的文本信息中的数字和单位进行归一化处理。例如将“项目总投资三万元”转换为“项目总投资30000元”、将“梅西在巴萨主场对阵皇马的比赛中完成了帽子戏法”转换成“梅西在巴萨主场对阵皇马的比赛中打进3球”等。Word segmentation can extract words in text information according to word type. Word types can include but are not limited to nouns, verbs, adjectives, adverbs, auxiliary words, onomatopoeia, numbers, proprietary symbols, etc., or one or more of them. . Alternatively, the text information can also be processed using a certain word segmentation algorithm. The word segmentation algorithm may include, but is not limited to, a word segmentation based word segmentation method (ie, mechanical segmentation method), an understanding based word segmentation method, a statistical based word segmentation method, or the like, or one or more of the above word segmentation methods. After the word segmentation process is completed, the text information is physically identified. An entity may include, but is not limited to, one or more of a product, an institution name, a person's name, a place name, a time, a date, a currency, a number, a percentage, and the like. Entity recognition methods may include, but are not limited to, hidden Markov models, maximum entropy models, support Vector machine, rule-based recognition method, and statistical-based recognition method, etc., or one or more of them. Specifically, the system can summarize the elements in the past information and define various event categories. For example: diplomacy, finance, sports, politics, science, education, etc., or one or more of them. The above categories may also include sub-categories of several levels, such as financial classes, which may include sub-categories such as stocks, funds, and futures. The above categories may include the completion of the entity identification, the system will normalize the numbers and units in the text information identified by the entity. For example, the “project total investment of 30,000 yuan” was converted into “project total investment of 30,000 yuan”, and “Messi completed the hat trick in the match against Real Madrid at Barcelona” into “Messi’s match against Real Madrid at Barcelona’s home game. Score 3 goals" and so on.
在完成数字和单位归一化处理之后,系统将对文本信息进行分类,以获取文本信息的大类别(如金融类)。在本发明的一个实施例中,系统可访问系统数据库205(详见图2),并将文本中出现的存储与数据库205中的类别关键词的数量或预设权值等属性特征通过一定的计算方法进行计算,并通过计算值进行分类。所述类别关键词可以通过特定的方法被提取出来,提取方法可以包括但不限于基于统计学的卡方统计、同义词规则、布尔关联规则、位置规则、信息增益、互信息、几率比、交叉熵、类间信息差等方法中的一种或多种的结合。在本发明的另一个实施例中,,所述系统可采用基于机器学习的文本分类方法,包括但不限于决策树、Rocchio、朴素贝叶斯、神经网络、支持向量机、线性最小平方拟合、最邻近算法kNN、遗传算法、最大熵等方法,或其中的一种或多种。即通过对标注好类别标签的文本的训练学习,得到分类器,从而对新的文本对象进行情感分类。文本信息经过分类可以被分到一个或多个类别中。类别可以是系统预定义的类别,并且可以包含若干级别的子类。如在金融领域,文本信息可以被分为包含但不限于公告类、新闻类、研报类、博客类、论坛类、微博类、互动投资类。公告类可以包含但不限于合同类、年报类、情感类等子类。可选择地,公告类又可以包括但不限于定期报告和权益分配公告、交易公告、募集资金公告、重大事项、政策优惠公告、高管人员变动公告、收购回购公告等。系统可以根据新闻来源将新闻分为 可靠来源和非可靠来源,如官方信息源央视财经频道可认为是可靠信息来源,新闻种类可以包括但不限于财经类、时政类、科教类、政法类、社会类、体育类、军事类、娱乐类等一种或多种组合。需要注意的是,上述对分类的描述只是为了便于理解发明,不应被视为是本发明唯一的实施例。分类步骤在本发明中不是必须的,对于一些文本信息,系统可以直接判定其类别,因而可以略过分类步骤。例如:对于某信息的标题显示为《A公司关于签署日常经营重大经营合同的公告》,系统可直接判定其为公告类。After completing the number and unit normalization process, the system will classify the text information to obtain a large category of text information (such as financial class). In an embodiment of the present invention, the system can access the system database 205 (see FIG. 2 for details), and pass the storage attributes appearing in the text and the attribute features such as the number of category keywords or the preset weights in the database 205 through certain conditions. The calculation method is calculated and classified by calculation. The category keywords may be extracted by a specific method, and the extraction methods may include, but are not limited to, statistically based chi-square statistics, synonym rules, Boolean association rules, position rules, information gain, mutual information, odds ratio, cross entropy A combination of one or more of the methods such as poor information between classes. In another embodiment of the present invention, the system may employ a machine learning based text classification method including, but not limited to, decision trees, Rocchio, Naïve Bayes, neural networks, support vector machines, linear least squares fitting , nearest neighbor algorithm kNN, genetic algorithm, maximum entropy, etc., or one or more of them. That is, by classifying the training of the text labeled with the category label, the classifier is obtained, thereby performing emotional classification on the new text object. Text information can be classified into one or more categories. A category can be a predefined category of the system and can contain several levels of subclasses. For example, in the financial field, text information can be divided into but not limited to announcements, news, research, blogs, forums, microblogs, and interactive investment. Announcement classes can include, but are not limited to, subclasses such as contract, annual, and emotional. Optionally, the announcement class may include, but is not limited to, periodic reports and equity distribution announcements, transaction announcements, fundraising announcements, major events, policy preferential announcements, executive change announcements, acquisition repurchase announcements, and the like. The system can divide news into news sources. Reliable sources and unreliable sources, such as the official information source CCTV financial channel can be considered as a reliable source of information, news types can include but not limited to financial, political, science and education, political and legal, social, sports, military, entertainment One or more combinations of classes. It is to be noted that the above description of the classification is merely for the purpose of facilitating the understanding of the invention and should not be construed as the only embodiment of the invention. The classification step is not essential in the present invention. For some text information, the system can directly determine its category, and thus the classification step can be skipped. For example, if the title of a certain information is displayed as "A Company's Announcement on Signing a Major Operation Contract for Daily Operations," the system can directly determine that it is an announcement type.
完成文本分类之后,文本信息中的属性可以被抽取。属性为关于实体性质和关系的描述。例如,图19所示是《中信证券股份有限公司2014年年度报告》公告的封面和节选图,对于这一公告,抽取的实体可以为“中信证券”,从图19的表格中可以抽取出的属性包括营业收入、净利润同期增减、资产总额、负债总额、股东权益总额等,或其中的一种或多种。完成属性抽取之后,系统可以按照一定的规则方法,将实体与属性结合,生成细化事件。例如对于前述的实体“中信证券”,可以将其与其属性之一的“净利润同期增减”相结合可生成“中信证券2014年年报净利润增长116.2%”这一细化事件。得到细化事件之后,系统可以根据得到的细化事件生成自然语言语句(步骤303),比如“中信证券年报净利润增长大于100%”,也可以是针对个股事件、行业事件和全市场事件的自然语言语句,比如“中信证券年报净利润同比增长率大于100%,券商行业年报净利润同比增长率大于100%,年报净利润同比增长率大于100%。”。步骤303可以由自然语言处理模块203完成。生成的自然语言语句可以被输入到分析系统(例如,回测模块204)中,以对识别出的事件进行分析(步骤304)。在本发明的一个实施例中,步骤304可以由回测模块204完成。上述分析可以包括但不限于对事件进行回测。回测是指将事件与相关历史事件及数据按照一定的规则组合,生成回测报告,以供用户在投资时参考。After the text classification is completed, the attributes in the text information can be extracted. Attributes are descriptions of the nature and relationships of entities. For example, Figure 19 shows the cover and excerpts of the announcement of CITIC Securities Co., Ltd. Annual Report 2014. For this announcement, the extracted entity can be “CITIC Securities”, which can be extracted from the table in Figure 19. Attributes include operating income, net profit increase and decrease over the same period, total assets, total liabilities, total shareholders' equity, etc., or one or more of them. After the attribute extraction is completed, the system can combine the entity and the attribute according to a certain rule method to generate a refinement event. For example, for the aforementioned entity “CITIC Securities”, it can combine the “net profit increase and decrease” of one of its attributes to generate a detailed event of “CITIC Securities' 2014 annual net profit growth of 116.2%”. After the refinement event is obtained, the system may generate a natural language statement according to the obtained refinement event (step 303), such as “the net profit growth of CITIC Securities annual report is greater than 100%”, or may be for individual stock events, industry events and full market events. Natural language statements, such as "CITIC Securities annual report net profit growth rate greater than 100%, brokerage industry annual report net profit growth rate greater than 100%, annual report net profit growth rate greater than 100%.". Step 303 can be completed by natural language processing module 203. The generated natural language statement can be input into an analysis system (eg, flyback module 204) to analyze the identified event (step 304). In one embodiment of the invention, step 304 may be performed by the backtest module 204. The above analysis may include, but is not limited to, backtesting the event. Backtesting refers to combining events and related historical events and data according to certain rules to generate a backtest report for users to refer to when investing.
需要注意的是,上述对信息分析系统流程的描述只是为了便 于理解发明,不应被视为是本发明唯一可行的实施例。系统也可以将收集到的信息直接转换为自然语言语句(步骤303),然后对上述自然语言语句进行分析(步骤304)。可选择地,系统也可以直接对收集到的信息进行分析(步骤304)。It should be noted that the above description of the information analysis system process is only for the sake of The invention is not to be considered as the only possible embodiment of the invention. The system can also convert the collected information directly into a natural language statement (step 303) and then analyze the natural language statement (step 304). Alternatively, the system may also directly analyze the collected information (step 304).
图4所示的是收集模块201的结构示意图。收集模块201可以包括但不限于一个采集单元401、一个处理单元402、以及一个存储单元403。采集单元401可以从信息源103(详见图2),或系统中的其他模块(例如,处理模块202、自然语言处理模块203、回测模块204、系统数据库205)采集需要的信息。上述需要的信息可以包括但不限于各种新闻、公告、评论、研报、博客、消息、报告、通知、论文、期刊等中的一种或多种。上述需要的信息可以是关于各个行业的信息,包括但不限于体育、娱乐、经济、政治、军事、文化、艺术、科学、工程等中的一种或多种。上述需要的信息的形式可以包括但不限于文字、图片、音频、视频等中的一种或多种例如,新闻可以是某视频网站播放的视频新闻《世界银行下调今年全球经济增长预期至2.8%》、某新闻网站报道的网页新闻《汇丰5月份中国服务业PMI升至53.5》、某证券交易所发布的上市公司公告《A股份有限公司关于签署日常经营重大合同的公告》、某体育赛事直播平台发布的足球赛事预告《本周六切尔西俱乐部将在主场斯坦福桥球场迎战同城死敌阿森纳》等。可选择地,采集单元401也可以直接接收用户输入的信息,该信息可以包括但不限于自然语言语句、程序语言等。FIG. 4 is a schematic structural view of the collection module 201. The collection module 201 can include, but is not limited to, one acquisition unit 401, one processing unit 402, and one storage unit 403. The collection unit 401 can collect the required information from the information source 103 (see FIG. 2), or other modules in the system (eg, the processing module 202, the natural language processing module 203, the backtest module 204, the system database 205). The above required information may include, but is not limited to, one or more of various news, announcements, comments, research reports, blogs, messages, reports, notices, essays, journals, and the like. The information required above may be information about various industries including, but not limited to, one or more of sports, entertainment, economics, politics, military, culture, art, science, engineering, and the like. The form of the above-mentioned required information may include, but is not limited to, one or more of text, pictures, audio, video, etc. For example, the news may be a video news broadcast by a video website. The World Bank lowered the global economic growth forecast to 2.8% this year. ", a news report reported by a news website, "HSBC China's service industry PMI rose to 53.5 in May", a listed company announcement issued by a stock exchange, "A company's announcement on signing a major contract for daily operations", a live broadcast of a sports event The platform released the football event preview "This Saturday Chelsea club will be at the Stamford Bridge at home against the rivals Arsenal" and so on. Alternatively, the collection unit 401 may directly receive information input by the user, and the information may include, but is not limited to, a natural language sentence, a program language, and the like.
处理单元402可以对采集到的信息进行处理。处理可以包括但不限于将采集到的信息存入存储单元403、将采集到的信息存入系统数据库205、从存储单元403中调取信息并将信息发送给其他模块(例如,处理模块202、自然语言处理模块203、回测模块204、系统数据库205)、从系统数据库205中调取信息并将信息发送给其他模块(例如,处理模块202、自然语言处理模块203、回测模块204)。可选择地,处理单元402也可将采集到的信息直接发送给其他模块,如处理模块202、自然语言处理模块203、回测模块204、系统数据库 205。存储单元403可以存储收集模块201收集到的信息。存储单元403可以存储处理单元402处理过的信息。 Processing unit 402 can process the collected information. The processing may include, but is not limited to, storing the collected information in the storage unit 403, storing the collected information in the system database 205, retrieving information from the storage unit 403, and transmitting the information to other modules (eg, the processing module 202, The natural language processing module 203, the backtest module 204, the system database 205) retrieves information from the system database 205 and transmits the information to other modules (eg, the processing module 202, the natural language processing module 203, and the backtest module 204). Optionally, the processing unit 402 can also directly send the collected information to other modules, such as the processing module 202, the natural language processing module 203, the backtest module 204, and the system database. 205. The storage unit 403 can store the information collected by the collection module 201. The storage unit 403 can store information processed by the processing unit 402.
以上对收集模块的描述仅仅是具体的示例,不应被视为是唯一可行的实施方案。显然,对于本领域的专业人员来说,在了解所需要的信息的基本原理后,可能在不背离这一原理的情况下,对所需要的信息的内容进行各种修正和改变,但是这些修正和改变仍在以上描述的范围之内。The above description of the collection module is merely a specific example and should not be considered as the only feasible implementation. Obviously, for those skilled in the art, after understanding the basic principles of the required information, various modifications and changes may be made to the content of the required information without departing from the principle, but these corrections And changes are still within the scope of the above description.
图5所示的是处理模块202的结构示意图。处理模块202可以包含但不限于一个格式转换模块501、一个文本预处理模块502、一个文本分类模块503、一个属性抽取模块504、以及一个事件识别模块505。上述各个模块可以是独立的,也可以是部分模块合并为一个模块。处理模块202中,格式转换模块501可以将收集模块201收集的信息进行格式转换。格式转换可以是系统自动完成的,也可以是人工完成的。格式转换可以是实时进行的,也可以是以固定的时间间隔进行的。可以转换的信息的文件格式包括但不限于pdf、doc、docx、epub、mobi、caj、kdh、nh、bmp、jpg、tiff、gif、pcx、tga、exif、fpx、svg、psd、cdr、pcd、dxf、ufo、eps、ai、raw、mpeg、avi、mov、asf、wmv、navi、3gp、RA、RAM、mkv、flv、rmvb、WebM等一种或多种组合。例如,收集的信息是图片jpg格式,如果该图片中包含文字信息,则可以通过OCR(Optical Character Recognition)识别将图片转换成文本格式,如txt格式。FIG. 5 is a schematic diagram showing the structure of the processing module 202. The processing module 202 can include, but is not limited to, a format conversion module 501, a text preprocessing module 502, a text classification module 503, an attribute extraction module 504, and an event recognition module 505. Each of the above modules may be independent, or some modules may be combined into one module. In the processing module 202, the format conversion module 501 can perform format conversion on the information collected by the collection module 201. Format conversion can be done automatically by the system or manually. Format conversion can be done in real time or at regular intervals. File formats for information that can be converted include, but are not limited to, pdf, doc, docx, epub, mobi, caj, kdh, nh, bmp, jpg, tiff, gif, pcx, tga, exif, fpx, svg, psd, cdr, pcd One or more combinations of dxf, ufo, eps, ai, raw, mpeg, avi, mov, asf, wmv, navi, 3gp, RA, RAM, mkv, flv, rmvb, WebM, and the like. For example, the collected information is in the picture jpg format. If the picture contains text information, the picture can be converted into a text format, such as a txt format, by OCR (Optical Character Recognition) recognition.
文本预处理模块502可以对格式转换后的文本进行预处理,预处理可以包括但不限于文本分词、实体识别、归一化处理等中的一种或多种。文本分类模块503可以对预处理后的文本进行分类,预处理后的文本可以被分为包括但不限于公告类、新闻类、研报类、博客类、论坛类、微博类、互动投资类。公告类可以包括但不限于合同类、年报类、等子类。可选择地,公告类又可以包括但不限于重组公告、股权激励公告、重大合同公告、政策优惠公告、高管变动公告、收购回购公告等。系统可以根据新闻来源将新闻分为可靠来源和非可靠来 源,如官方信息源央视财经频道可认为是可靠信息来源,新闻种类可以包括但不限于财经类、时政类、科教类、政法类、社会类、体育类、军事类、娱乐类等一种或多种组合。属性抽取模块504可以自动匹配抽取事件相关的属性,抽取规则可以是系统配置的,也可以是人工配置的。事件识别模块505可以根据属性抽取模块504的结果、文本预处理模块502的结果、系统数据库205以及一定的规则得出最终的细化事件。The text pre-processing module 502 can pre-process the format-converted text, and the pre-processing can include, but is not limited to, one or more of text segmentation, entity recognition, normalization processing, and the like. The text classification module 503 can classify the pre-processed text, and the pre-processed text can be divided into, but not limited to, an announcement type, a news category, a research report class, a blog class, a forum class, a microblog class, and an interactive investment class. . Announcement classes may include, but are not limited to, contract classes, annual report classes, and the like. Alternatively, the announcement category may include, but is not limited to, a reorganization announcement, an equity incentive announcement, a major contract announcement, a policy offer announcement, an executive change announcement, an acquisition repurchase announcement, and the like. The system can divide news into reliable sources and unreliable based on news sources. Source, such as the official information source CCTV financial channel can be considered as a reliable source of information, news types can include but not limited to financial, political, science and education, political and legal, social, sports, military, entertainment, etc. or A variety of combinations. The attribute extraction module 504 can automatically match the attributes related to the extraction event, and the extraction rules can be configured by the system or manually. The event identification module 505 can derive the final refinement event based on the results of the attribute extraction module 504, the results of the text pre-processing module 502, the system database 205, and certain rules.
图6所示的是格式转换模块501的结构示意图。格式转换模块501可以包括但不限于一个控制单元601、一个文本处理单元602、一个图片处理单元603、一个音频处理单元604、以及一个视频处理单元605。控制单元601可以根据收集模块201收集的信息选择相应的处理单元;文本处理单元602可以对收集模块201收集的文本格式的信息进行处理。图片处理单元603可以对收集模块201收集的图片格式的信息进行处理。音频处理单元604可以对收集模块201收集的音频格式的信息进行处理。视频处理单元605可以对收集模块201收集的视频格式的信息进行处理。本说明书所述实施例中上述单元可以是各自独立分布的,但在某些实施例中,上述部分单元可以合并为一个单元,如音频处理单元604可以和视频处理单元605合并为一个音视频处理单元,实现两者的功能。FIG. 6 is a schematic structural diagram of the format conversion module 501. The format conversion module 501 can include, but is not limited to, a control unit 601, a text processing unit 602, a picture processing unit 603, an audio processing unit 604, and a video processing unit 605. The control unit 601 can select a corresponding processing unit according to the information collected by the collection module 201; the text processing unit 602 can process the information in the text format collected by the collection module 201. The picture processing unit 603 can process the information of the picture format collected by the collection module 201. The audio processing unit 604 can process the information of the audio format collected by the collection module 201. The video processing unit 605 can process the information in the video format collected by the collection module 201. The above units in the embodiments of the present specification may be independently distributed, but in some embodiments, the above partial units may be combined into one unit, for example, the audio processing unit 604 may be combined with the video processing unit 605 to form an audio and video processing. Unit to achieve the functions of both.
控制单元601可以对收集模块201收集的信息进行类型判断,并根据类型选择相应的处理单元。例如,控制单元601对收集模块201收集的信息进行判断后,判断是文本格式的信息,则为其选择文本处理单元602进行下一步骤的处理。The control unit 601 can perform type determination on the information collected by the collection module 201, and select a corresponding processing unit according to the type. For example, after the control unit 601 determines the information collected by the collection module 201 and determines that it is the text format information, the selection text processing unit 602 performs the processing of the next step.
文本处理单元602可以对收集模块201收集的文本格式的信息进行处理,转换为统一格式的文本数据。具体来说,收集模块201收集的信息中的文本格式可以包括但不限于超文本标识语言格式(Hypertext Markup Language,html)、可扩展超文本标识语言格式(Extensible Hypertext Markup Language,xhtml)、可扩展标识语言格式(Extensible Markup Language,xml)、pdf格式(Portable Document  Format)、doc及docx格式(Microsoft公司的专属格式)等中的一种或多种,文本处理单元602可以将上述格式转换为统一的文本格式,统一的文本格式可以包括但不限于txt格式。例如,收集模块201收集的信息是《中信证券股份有限公司2014年年度报告》,如图19所示,该公告的格式是pdf格式,则文本处理单元602可以将该公告由pdf格式转换为txt格式。The text processing unit 602 can process the information in the text format collected by the collection module 201 and convert the text data into a unified format. Specifically, the text format in the information collected by the collection module 201 may include, but is not limited to, Hypertext Markup Language (html), Extensible Hypertext Markup Language (xhtml), and expandable. Extensible Markup Language (xml), pdf format (Portable Document) The text processing unit 602 can convert the above format into a unified text format, and the unified text format can include, but is not limited to, the txt format, in one or more of Format, doc, and docx formats (a proprietary format of Microsoft Corporation). For example, the information collected by the collection module 201 is the “2014 Annual Report of CITIC Securities Co., Ltd.”, as shown in FIG. 19, the format of the announcement is in pdf format, and the text processing unit 602 can convert the announcement from pdf format to txt format. format.
图片处理单元603可以对收集模块201收集的图片格式的信息进行处理,转换为统一的文本格式。具体来说,收集模块201收集的图片格式的信息可以是图书、报纸、杂志、信件等,该类型的图片包含有文本信息在内,图片处理单元603可以利用OCR(Optical Character Recognition)技术将图片信息转换为统一的文本格式。The picture processing unit 603 can process the information of the picture format collected by the collection module 201 and convert it into a unified text format. Specifically, the information in the picture format collected by the collection module 201 may be a book, a newspaper, a magazine, a letter, or the like. The image processing unit 603 may use an OCR (Optical Character Recognition) technology to take a picture. The information is converted to a uniform text format.
音频处理单元604可以对收集模块201收集的信息中的音频格式的数据进行处理,转换为统一的文本格式。具体来说,收集模块201收集的信息中音频格式可以包括但不限于CD、WAVE、AIFF、AU、MPEG、MP3、MIDI、WMA、RealAudio、VQF、OggVorbis、AAC、APE等中的一种或多种组合,音频处理单元604可以利用语音识别技术将之转换为文本格式。语音识别技术可以包括但不限于基于声道模型和语音知识的方法、模式匹配的方法以及利用人工神经网络等方法中的一种或多种,或是上述方法的任意组合。The audio processing unit 604 can process the data in the audio format in the information collected by the collection module 201 and convert it into a unified text format. Specifically, the audio format in the information collected by the collection module 201 may include, but is not limited to, one or more of CD, WAVE, AIFF, AU, MPEG, MP3, MIDI, WMA, RealAudio, VQF, OggVorbis, AAC, APE, and the like. Alternatively, the audio processing unit 604 can convert it to a text format using speech recognition technology. Speech recognition techniques may include, but are not limited to, methods based on vocal tract model and speech knowledge, methods of pattern matching, and one or more of methods using artificial neural networks, or any combination of the above.
视频处理单元605可以对收集模块201收集的信息中的视频格式的数据进行处理,转换为统一的文本格式。具体来说,收集模块201收集的信息中的视频格式可以包括但不限于Flash Video、AVI、WMV、MPEG、Mastroska、Real Video、QuickTime File Format、Ogg、MOD等中的一种或多种组合,视频处理单元605可以对视频中的字幕部分进行文本导出,字幕包括视频内置字幕和外挂字幕,并且转换为统一的文本格式。视频处理单元605也可以提取视频中的音频部分,并进行语音识别将之转换为统一的文本格式。在具体的实施例中,如果视频没有搭载字幕部分,则视频处理单元605可以提取视频中的音频部分进行语音识别,并转换为统一的文本格式,如果视频搭载有字 幕,则将字幕导出并转换为统一的文本格式,也可以选择提取音频部分进行语音识别并转换为统一的文本格式。The video processing unit 605 can process the data in the video format in the information collected by the collection module 201 and convert it into a unified text format. Specifically, the video format in the information collected by the collection module 201 may include, but is not limited to, one or more combinations of Flash Video, AVI, WMV, MPEG, Mastroska, Real Video, QuickTime File Format, Ogg, MOD, and the like. The video processing unit 605 can perform text export on the subtitle portion in the video, the subtitle includes the video built-in subtitle and the external subtitle, and is converted into a unified text format. The video processing unit 605 can also extract the audio portion of the video and perform speech recognition to convert it into a uniform text format. In a specific embodiment, if the video is not equipped with a subtitle portion, the video processing unit 605 can extract the audio portion of the video for speech recognition and convert it into a unified text format if the video is loaded with words. Cursor, the subtitles are exported and converted into a unified text format, or you can choose to extract the audio part for speech recognition and convert to a unified text format.
上述格式转换模块501包含的四种处理单元即文本处理单元602、图片处理单元603、音频处理单元604和视频处理单元605,在一些实施例中可以不是全部包含的,可以只包含其中的一个单元,也可以是其中的某些单元。在一些实施例中,上述四种处理单元是全部包含的,各处理单元间的执行顺序可以是依次进行的,也可以是同时进行的,也可是任何合适的顺序。格式转换模块501对收集模块201收集的信息进行格式转换后转换为将统一的文本格式,文本预处理模块502对该文本格式的信息进行后续处理。The four processing units included in the format conversion module 501, namely, the text processing unit 602, the picture processing unit 603, the audio processing unit 604, and the video processing unit 605 may not be all included in some embodiments, and may include only one of the units. It can also be some of these units. In some embodiments, the above four processing units are all included, and the order of execution between the processing units may be sequentially performed, may be performed simultaneously, or may be any suitable order. The format conversion module 501 converts the information collected by the collection module 201 into a unified text format, and the text preprocessing module 502 performs subsequent processing on the information in the text format.
以上的描述仅仅是本发明的具体实施例,不应被视为是唯一的实施例。显然,对于本领域的专业人员来说,在了解本发明内容和原理后,都可能在不背离本发明原理、结构的情况下,进行形式和细节上的各种修正和改变,但是这些修正和改变仍在本发明的权利要求保护范围之内。The above description is only a specific embodiment of the invention and should not be considered as the only embodiment. It is apparent to those skilled in the art that various modifications and changes in form and detail may be made without departing from the spirit and scope of the invention. Changes are still within the scope of the claims of the present invention.
图7所示的是文本预处理模块502的结构示意图。文本预处理模块502可以包括但不限于一个语言识别单元701、一个文本分词单元702、一个实体识别单元703、以及一个归一化单元704。语言识别单元701可以对经过格式转换模块501处理后的文本信息进行语言识别。文本分词单元702可以对文本进行分词处理。实体识别单元703可以对文本中的实体进行识别。归一化单元704可以对文本中包含数字信息的内容及其对应的单位进行统一的归一化处理,形成标准的数字数据形式。上述单元可以是各自独立的,也可以是部分单元合并为一个单元。例如,语言识别单元701可以和文本分词单元702合并为一个单元。FIG. 7 is a schematic structural diagram of a text preprocessing module 502. The text pre-processing module 502 can include, but is not limited to, a language recognition unit 701, a text segmentation unit 702, an entity identification unit 703, and a normalization unit 704. The language recognition unit 701 can perform language recognition on the text information processed by the format conversion module 501. The text segmentation unit 702 can perform word segmentation on the text. The entity identification unit 703 can identify an entity in the text. The normalization unit 704 can perform unified normalization processing on the content containing the digital information in the text and its corresponding units to form a standard digital data format. The above units may be independent or may be combined into one unit. For example, the language recognition unit 701 can be combined with the text segmentation unit 702 into one unit.
语言识别单元701可以对格式转换模块501处理后的文本进行语言识别。收集模块201收集的信息所用的语言可以包括但不限于汉语、英语、法语、俄语、西班牙语、阿拉伯语、日语、德语等中的一种或多种,语言识别单元701可以识别收集模块201收集的信息所 用的语言。The language recognition unit 701 can perform language recognition on the text processed by the format conversion module 501. The language used by the collection module 201 may include, but is not limited to, one or more of Chinese, English, French, Russian, Spanish, Arabic, Japanese, German, etc., and the language recognition unit 701 may identify the collection module 201 to collect Information office The language used.
文本分词单元702可以利用一定的分词算法对语言识别单元701识别后的文本进行分词处理。收集模块201收集的信息所用的语言中,包括以词为单位的语言,如英语、法语、俄语等,这些语言中词与词之间有天然的分隔;还包括以字为单位的语言,词是由字组成的,而词与词之间没有天然分隔,如汉语。因此在进行中文文本的词频统计前首先需要对中文文本进行分词处理,而英文文本则不需要。分词算法可以包括但不限于基于字符串匹配的分词方法(即机械性分词法)、基于理解的分词方法、基于统计的分词方法等,或是上述几种分词方法的任意组合。本发明的一个实施例为基于统计的分词方法和基于词典的分词方法相结合的方法。文本分词单元702可以向系统数据库205发送访问词典数据库的请求,系统数据库205收到请求时,可以将请求的词典发送给文本分词单元702。词典可以是针对特定领域的词典,例如,可以是针对公告的词典或针对新闻的词典。具体地,该词典可以是针对重组公告的词典、针对激励公告的词典、针对重大合同公告的词典、针对政策优惠公告的词典、针对高管变动公告的词典、针对收购回购公告的词典等。文本分词单元702可以结合统计得到的分词结果和词典匹配得到的分词结果得到最终的文本分词结果。The text segmentation unit 702 can perform word segmentation processing on the text recognized by the language recognition unit 701 by using a certain word segmentation algorithm. The language used by the collection module 201 includes language in words, such as English, French, Russian, etc., where there is a natural separation between words and words; also includes words in words, words It is composed of words, and there is no natural separation between words and words, such as Chinese. Therefore, before the word frequency statistics of Chinese texts are performed, the Chinese text must first be processed by word segmentation, while the English text is not required. The word segmentation algorithm may include, but is not limited to, a word segmentation based word segmentation method (ie, mechanical segmentation method), an understanding based word segmentation method, a statistical based word segmentation method, or the like, or any combination of the above several word segmentation methods. One embodiment of the present invention is a method combining a statistical-based word segmentation method and a dictionary-based word segmentation method. The text segmentation unit 702 can send a request to the system database 205 to access the dictionary database, and when the system database 205 receives the request, the requested dictionary can be sent to the text segmentation unit 702. The dictionary may be a dictionary for a specific domain, and may be, for example, a dictionary for an announcement or a dictionary for news. Specifically, the dictionary may be a dictionary for reorganization announcements, a dictionary for incentive announcements, a dictionary for major contract announcements, a dictionary for policy offer announcements, a dictionary for executive change announcements, a dictionary for purchase repurchase announcements, and the like. The text segmentation unit 702 can obtain the final text segmentation result by combining the statistically obtained word segmentation result with the word segmentation result obtained by the dictionary matching.
实体识别单元703可以通过实体识别方法对经过分词处理后的文本进行实体识别,可以将识别后的实体集合存入系统数据库205中的实体数据库。实体可以包括但不限于产品、机构名、人名、地名、时间、日期、货币、数字、百分比等中的一种或多种,具体举例来说,《中信证券股份有限公司2014年年度报告》这条标题信息中可识别的实体有“中信证券”、“股份有限公司”、“2014年”、“年度报告”。实体识别方法可以包括但不限于隐马尔科夫模型、最大熵模型、支持向量机、基于布尔关联规则、基于同义词配置规则、基于位置规则的识别方法和基于统计的识别方法等,或是上述几种识别方法的任意组合。The entity identification unit 703 can perform entity identification on the word-processed text by the entity identification method, and can store the identified entity set in the entity database in the system database 205. The entity may include, but is not limited to, one or more of a product, an organization name, a person's name, a place name, a time, a date, a currency, a number, a percentage, etc., for example, CITIC Securities Co., Ltd. 2014 Annual Report The entities that can be identified in the title information are “CITIC Securities”, “Company”, “2014” and “Annual Report”. The entity identification method may include, but is not limited to, a hidden Markov model, a maximum entropy model, a support vector machine, a Boolean association rule, a synonym-based configuration rule, a location rule-based recognition method, and a statistical-based recognition method, or the like. Any combination of identification methods.
归一化单元704可以对文本中的数字及其单位进行归一化处 理,使其具有一致的单位。具体举例来说,归一化单元704可将文本中出现的“上涨概率百分之五”转换为“上涨概率5%”,将“项目总投资三万元”转换为“项目总投资30000元”等。The normalization unit 704 can normalize the numbers in the text and their units Rational, so that it has a consistent unit. For example, the normalization unit 704 can convert the "five percent probability of rising" appearing in the text to "the probability of rising 5%", and convert the "total investment of 30,000 yuan" into "the total investment of the project is 30,000 yuan. "Wait.
上述文本预处理模块502包含的四种处理单元执行顺序可以依次是语言识别单元701、文本分词单元702、实体识别单元703和归一化单元704。文本预处理模块502中的处理单元的执行顺序也可以首先是语言识别单元701执行,依据语言识别单元701的识别结果来决定文本分词单元702是否执行。当识别结果是中文文本时,则执行文本分词单元702。当识别结果是其他具有固定的分隔符的语言时,如英语、韩语、俄语等,则文本分词单元702可不执行。后续的实体识别单元703和归一化单元704的执行顺序可以是顺序的,可以是逆序的,也可以是同时进行的。文本预处理模块502可以对经格式转换模块501处理后的文本信息进行预处理,文本分类模块503可以对预处理后的文本进行后续处理。以上的描述仅仅是本发明的具体实施例,不应被视为是唯一的实施例。显然,对于本领域的专业人员来说,在了解本发明内容和原理后,都可能在不背离本发明原理、结构的情况下,进行形式和细节上的各种修正和改变,但是这些修正和改变仍在本发明的权利要求保护范围之内。The four processing unit execution sequences included in the text preprocessing module 502 may be, in order, a language recognition unit 701, a text word segmentation unit 702, an entity recognition unit 703, and a normalization unit 704. The execution order of the processing units in the text preprocessing module 502 may also be first performed by the language recognition unit 701, and it is determined whether the text segmentation unit 702 is executed according to the recognition result of the language recognition unit 701. When the recognition result is Chinese text, the text segmentation unit 702 is executed. When the recognition result is another language having a fixed separator, such as English, Korean, Russian, etc., the text segmentation unit 702 may not execute. The order of execution of the subsequent entity identification unit 703 and the normalization unit 704 may be sequential, may be reversed, or may be performed simultaneously. The text pre-processing module 502 can pre-process the text information processed by the format conversion module 501, and the text classification module 503 can perform subsequent processing on the pre-processed text. The above description is only a specific embodiment of the invention and should not be considered as the only embodiment. It is apparent to those skilled in the art that various modifications and changes in form and detail may be made without departing from the spirit and scope of the invention. Changes are still within the scope of the claims of the present invention.
图8所示的是文本分类模块503的结构示意图。文本分类模块503可以包括但不限于一个或多个关键词提取单元801、一个或多个分类单元802。关键词提取单元801可以对文本预处理模块502处理后的文本进行关键词提取。分类单元802可以对提取出的关键词按照预先定义的规则对文本进行分类。上述单元可以是各自独立的,也可以是部分单元合并为一个单元。FIG. 8 is a schematic diagram showing the structure of the text classification module 503. The text classification module 503 can include, but is not limited to, one or more keyword extraction units 801, one or more classification units 802. The keyword extracting unit 801 can perform keyword extraction on the text processed by the text preprocessing module 502. Classification unit 802 can classify the extracted keywords according to predefined rules. The above units may be independent or may be combined into one unit.
关键词提取单元801可以对文本预处理模块502处理后的文本进行分析并提取关键词。关键词的提取方法可以包括但不限于基于统计学的卡方统计、同义词规则、布尔关联规则、位置规则、信息增益、互信息、几率比、交叉熵、类间信息差等方法,或者上述方法的任意组合。具体来说,针对《中信证券股份有限公司2014年年度报 告》这一公告,关键词提取单元801首先进行关键词提取,提取的关键词可以包括但不限于“中信证券”、“2014年”、“年度报告”、“净利润”、“同期增减”等等。The keyword extracting unit 801 can analyze the text processed by the text preprocessing module 502 and extract keywords. The extraction method of the keyword may include, but is not limited to, a method based on statistical chi-square statistics, synonym rules, Boolean association rules, position rules, information gain, mutual information, probability ratio, cross entropy, inter-class information difference, or the like. Any combination. Specifically, for the "2014 Annual Report of CITIC Securities Co., Ltd. In the announcement, the keyword extracting unit 801 first performs keyword extraction, and the extracted keywords may include, but are not limited to, "CITIC Securities", "2014", "annual report", "net profit", "synchronous increase and decrease" "and many more.
分类单元802可以利用关键词提取单元801提出的关键词按照一定的分类方法对文本进行分类并贴上类别标签。分类方法可以包括但不限于决策树、Rocchio、朴素贝叶斯、神经网络、隐马尔科夫模型、支持向量机、线性最小平方拟合、最邻近算法kNN、遗传算法、最大熵等方法,或者上述方法的任意组合。分类单元802可以向系统数据库205发送关键词数据库访问请求。系统数据库205收到请求后,将请求的关键词发送给分类单元802。分类单元802可以依照一定的算法对关键词提取单元801提取的关键词和系统数据库205发送的关键词进行匹配,根据匹配结果来对文本分类,并给文本贴上相应的类别标签。具体来说,针对前述的题为《中信证券股份有限公司2014年年度报告》这一全文公告,分类单元将之归为公告大类,年度报告子类,并贴上标签。某个文本根据匹配结果,可同时归属于不同的类别,这时只要给该文本贴上两个标签即可,一个文本可以同时拥有两个以上的标签。The classification unit 802 can classify the text and paste the category label according to a certain classification method by using the keyword proposed by the keyword extraction unit 801. Classification methods may include, but are not limited to, decision trees, Rocchio, Naïve Bayes, neural networks, hidden Markov models, support vector machines, linear least squares fits, nearest neighbor algorithm kNN, genetic algorithms, maximum entropy, etc., or Any combination of the above methods. Classification unit 802 can send a keyword database access request to system database 205. After receiving the request, the system database 205 sends the requested keyword to the classification unit 802. The classifying unit 802 can match the keywords extracted by the keyword extracting unit 801 and the keywords sent by the system database 205 according to a certain algorithm, classify the text according to the matching result, and paste the text with the corresponding category tag. Specifically, in response to the above-mentioned full-text announcement entitled "CITIC Securities Co., Ltd. 2014 Annual Report", the classification unit classifies it as a bulletin category, an annual report subcategory, and labels it. A text can be attributed to different categories according to the matching result. In this case, just paste two labels on the text, and one text can have more than two labels at the same time.
文本分类模块503在某些实施例中是可选的。例如,如果收集模块201收集的信息中信息要素已经很确定,就可以略过文本分类步骤。具体地,如果收集的信息是一则新闻简讯,该新闻简讯的内容是“第18届上海国际电影节于2015年6月21日晚闭幕,主竞赛单元金爵奖揭晓,中国影片《烈日灼心》成最大赢家,邓超、郭涛、段奕宏三人同获影帝,曹保平获最佳导演”。这则简讯的要素如时间、人物、事件等就清晰明了,就可略过文本分类模块503直接传送至下一模块属性抽取模块504进行后续处理。上述文本分类模块503包含的关键词提取单元801和分类单元802的执行顺序可以是顺序的,即关键词提取单元801先执行,分类单元802后执行。以上的描述仅仅是本发明的具体实施例,不应被视为是唯一的实施例。显然,对于本领域的专业人员来说,在了解本发明内容和原理后,都可能在不背离 本发明原理、结构的情况下,进行形式和细节上的各种修正和改变,但是这些修正和改变仍在本发明的权利要求保护范围之内。 Text classification module 503 is optional in some embodiments. For example, if the information elements in the information collected by the collection module 201 are already determined, the text classification step can be skipped. Specifically, if the information collected is a newsletter, the content of the newsletter is “The 18th Shanghai International Film Festival was closed on the evening of June 21, 2015. The main competition unit Golden Jubilee Award was announced. The Chinese film “The Sunshine "Heart" became the biggest winner, Deng Chao, Guo Tao, Duan Yihong and the three won the same film, Cao Baoping won the best director. The elements of the newsletter, such as time, characters, events, etc., are clear and clear, and the text sorting module 503 can be skipped directly to the next module attribute extraction module 504 for subsequent processing. The execution order of the keyword extraction unit 801 and the classification unit 802 included in the text classification module 503 may be sequential, that is, the keyword extraction unit 801 is executed first, and the classification unit 802 is executed later. The above description is only a specific embodiment of the invention and should not be considered as the only embodiment. Obviously, it will be possible for a person skilled in the art to understand without departing from the scope and principles of the present invention. Various modifications and changes in form and detail may be made without departing from the spirit and scope of the invention.
图9所示的是属性抽取模块504的结构示意图。属性抽取模块504可以包括但不限于一个或多个关键词提取单元901、一个或多个属性抽取模板902、一个或多个属性抽取单元903。关键词提取单元901可以对文本进行关键词提取;属性抽取模板902可以存储能够抽取事件属性的抽取规则;属性抽取单元903可以完成对事件属性的抽取工作。上述单元可以是各自独立的,也可以是部分单元合并为一个单元。FIG. 9 is a schematic structural diagram of the attribute extraction module 504. The attribute extraction module 504 can include, but is not limited to, one or more keyword extraction units 901, one or more attribute extraction templates 902, and one or more attribute extraction units 903. The keyword extraction unit 901 can perform keyword extraction on the text; the attribute extraction template 902 can store an extraction rule capable of extracting event attributes; the attribute extraction unit 903 can complete the extraction of the event attributes. The above units may be independent or may be combined into one unit.
关键词提取单元901可以对文本预处理模块502处理后的文本进行分析并提取关键词,提取方法可以包括但不限于基于统计学的卡方统计、同义词规则、布尔关联规则、位置规则、信息增益、互信息、几率比、交叉熵、类间信息差等方法,或者上述方法的任意组合。关键词提取单元901在某些实施例中是可选的。由于文本分类模块503是可选的,当需要略过文本分类模块503直接由属性抽取模块504对信息进行处理时,关键词提取单元901可以被执行,对预处理后的文本进行关键词的提取。如果文本分类模块503已经执行过,则可略过关键词提取单元901。The keyword extracting unit 901 may analyze and extract keywords processed by the text preprocessing module 502, and the extraction methods may include, but are not limited to, statistically based chi-square statistics, synonym rules, Boolean association rules, position rules, and information gains. , mutual information, odds ratio, cross entropy, inter-class information difference, etc., or any combination of the above methods. The keyword extraction unit 901 is optional in some embodiments. Since the text classification module 503 is optional, when the text classification module 503 needs to be directly processed by the attribute extraction module 504, the keyword extraction unit 901 can be executed to perform keyword extraction on the preprocessed text. . If the text classification module 503 has been executed, the keyword extraction unit 901 may be skipped.
属性抽取模板902可以存储能够抽取事件属性的抽取规则。事件是由实体及实体的属性组成的。文本预处理模块502中的实体识别单元703已经对文本进行过实体识别,并形成了实体集合,属性抽取模板902存储的是针对不同实体的不同的属性抽取规则。抽取规则是预先配置的。配置方法可以是人工配置,根据预先设定的文本的种类,为每个种类的文本设定不同的实体属性抽取规则。配置方法也可以是机器学习方法。例如,可以首先选定一批训练文本。该训练文本是通过人工标注的类别清晰的文本。通过对训练文本的训练学习,得到属性抽取器(图中未展示)。该属性抽取器可根据不同的文本类别抽取所需的属性,然后利用该属性抽取器对新的文本进行属性抽取。属性抽取模板902在某些实施例中是可选的。当文本分类模块503没 有执行,文本没有分类就没有对应的属性抽取模板。The attribute extraction template 902 can store extraction rules that are capable of extracting event attributes. Events are made up of attributes of entities and entities. The entity identification unit 703 in the text pre-processing module 502 has performed entity recognition on the text and formed a set of entities, and the attribute extraction template 902 stores different attribute extraction rules for different entities. The extraction rules are pre-configured. The configuration method may be manual configuration, and different entity attribute extraction rules are set for each type of text according to the type of text set in advance. The configuration method can also be a machine learning method. For example, a batch of training text can be selected first. The training text is a clearly typed text that is manually labeled. Through the training of the training text, the attribute extractor (not shown) is obtained. The attribute extractor extracts the required attributes according to different text categories, and then uses the attribute extractor to perform attribute extraction on the new text. The attribute extraction template 902 is optional in some embodiments. When the text classification module 503 is not There is execution, there is no corresponding attribute extraction template without text classification.
属性抽取单元903可以完成从文本中抽取属性的工作。当文本分类模块503执行后,属性抽取单元903就可以根据文本分类的结果及文本所贴标签来选择相应的属性抽取模板。当文本贴有两个及以上的标签时,可以同时选取对应数量的属性抽取模板,然后对文本进行属性抽取,将所得结果进行聚类。具体举例来说,对于《中信证券股份有限公司2014年年度报告》这一信息,文本分类模块503赋予它的标签是公告大类年度报告子类。依据该标签,属性抽取单元903选取相应的属性抽取模板,根据选取的模板抽取事件属性。事件属性可以包括但不限于营业收入金额及同比涨幅大小、净利润金额、总资产金额及涨幅大小等等方面的信息。举例来说,对上述公告可抽取其中的“归属于母公司股东的净利润/本期比上年同期增长(%)/116.20%”这条属性信息。以上的描述仅仅是本发明的具体实施例,不应被视为是唯一的实施例。显然,对于本领域的专业人员来说,在了解本发明内容和原理后,都可能在不背离本发明原理、结构的情况下,进行形式和细节上的各种修正和改变,但是这些修正和改变仍在本发明的权利要求保护范围之内。The attribute extraction unit 903 can complete the work of extracting attributes from the text. After the text classification module 503 is executed, the attribute extraction unit 903 can select a corresponding attribute extraction template according to the result of the text classification and the label attached to the text. When the text has two or more labels attached, the corresponding number of attribute extraction templates can be selected at the same time, and then the attributes are extracted from the text, and the obtained results are clustered. Specifically, for the information of "CITIC Securities Co., Ltd. 2014 Annual Report", the text classification module 503 assigns it a sub-category of the annual category of the announcement category. According to the label, the attribute extraction unit 903 selects a corresponding attribute extraction template, and extracts an event attribute according to the selected template. Event attributes may include, but are not limited to, information on the amount of operating income and the year-on-year increase, the amount of net profit, the amount of total assets, and the magnitude of the increase. For example, for the above announcement, the attribute information of “net profit attributable to shareholders of the parent company/current period growth (%)/116.20% over the same period of the previous year” can be extracted. The above description is only a specific embodiment of the invention and should not be considered as the only embodiment. It is apparent to those skilled in the art that various modifications and changes in form and detail may be made without departing from the spirit and scope of the invention. Changes are still within the scope of the claims of the present invention.
回到图5,事件识别模块505可以完成对事件的识别工作,文本预处理模块502中的实体识别单元703已经对文本进行过实体识别,并形成了实体集合。属性抽取模块504从文本中抽取出了所需的属性集合。事件识别模块505可以依据实体识别结果和属性抽取结果,依据一定的事件识别模板,将事件识别出来,并生成细化事件。具体举例来说,对于《中信证券股份有限公司2014年年度报告》,前述文本预处理模块302识别的实体包括“中信证券”,“股份有限公司”,“2014年”,“年度报告”,“净利润”,“同期增减”等等,属性抽取模块304抽取的一条事件属性是:“归属于母公司股东的净利润/本期比上年同期增减(%)/116.20%”,事件识别模块505可以根据实体识别结果和属性抽取结果得到最终的细化事件:“中信证券2014年年度报告公告归属于母公司股东的净利润同比增长率等于116.2%”。 Returning to FIG. 5, the event recognition module 505 can complete the identification of the event. The entity identification unit 703 in the text preprocessing module 502 has performed entity recognition on the text and formed a set of entities. The attribute extraction module 504 extracts the required set of attributes from the text. The event recognition module 505 can identify the event according to a certain event recognition template according to the entity recognition result and the attribute extraction result, and generate a refinement event. Specifically, for the CITIC Securities Co., Ltd. 2014 Annual Report, the entities identified by the aforementioned text preprocessing module 302 include "CITIC Securities", "Company Co.," "2014," "Annual Report," Net profit", "synchronous increase and decrease", etc., an attribute attribute extracted by attribute extraction module 304 is: "net profit attributable to shareholders of the parent company / current period increase or decrease (%) / 116.20% over the same period of the previous year", event The identification module 505 can obtain the final refinement event according to the entity identification result and the attribute extraction result: "The annual profit growth rate of the net profit attributable to the parent company shareholders of the CITIC Securities 2014 annual report is equal to 116.2%."
在某些实施例中,当收集模块201收集到用户输入的信息,对该信息进行回测时,可能会存在复杂逻辑,事件识别模块505仅仅根据实体信息和属性信息不能细化地识别出事件。这时需要根据系统数据库205中的事件库(事件属性数据库1405或事件识别数据库1411)和一定的规则方法来进行事件的识别。具体举例来说,当用户输入的信息是“中标金额占营业收入50%以上合同的公司”,那么中标公告中很可能不会包含该公司营业收入的数据。这时可以根据数据库的数据(如历史数据)和一定的规则方法来计算得到最终的细化事件类别。以上的描述仅仅是本发明的具体实施例,不应被视为是唯一的实施例。显然,对于本领域的专业人员来说,在了解本发明内容和原理后,都可能在不背离本发明原理、结构的情况下,进行形式和细节上的各种修正和改变,但是这些修正和改变仍在本发明的权利要求保护范围之内In some embodiments, when the collection module 201 collects information input by the user and backtests the information, there may be complex logic, and the event recognition module 505 cannot separately identify the event based on the entity information and the attribute information. . At this time, it is necessary to identify the event according to the event library (event attribute database 1405 or event identification database 1411) in the system database 205 and a certain rule method. For example, when the information input by the user is “a company whose contracted amount accounts for more than 50% of the operating income”, the bidding announcement may not include the data of the company’s operating income. At this time, the final refinement event category can be calculated based on the data of the database (such as historical data) and certain rule methods. The above description is only a specific embodiment of the invention and should not be considered as the only embodiment. It is apparent to those skilled in the art that various modifications and changes in form and detail may be made without departing from the spirit and scope of the invention. Changes are still within the scope of the claims of the present invention
图10所示的是处理模块的流程图。系统对收集模块201传输的信息进行格式转换,将之转换为统一的文本格式(步骤1001)。格式转换可以包括但不限于对文本、图片、音频、视频等中的一种或多种进行格式转换。步骤1001可以由格式转换模块501实现。系统对文本格式的信息进行预处理(步骤1002)。预处理可以包括但不限于语言识别、文本分词、实体识别、归一化等中的一种或多种。步骤1002可以由文本预处理模块502实现。系统对文本进行分类(步骤1003)。分类步骤可以包括但不限于进行关键词提取和进行分类。步骤1003可以由文本分类模块503实现。系统对文本进行属性抽取(步骤1004)。步骤1004可以由属性抽取模块504实现。系统对事件进行识别(步骤1005)。步骤1005可以由事件识别模块505实现。可选地,系统还可以不经过步骤1001直接进行步骤1002,不经过步骤1003直接进行步骤1004。Figure 10 is a flow chart of the processing module. The system performs format conversion on the information transmitted by the collection module 201 and converts it into a unified text format (step 1001). Format conversion may include, but is not limited to, format conversion of one or more of text, pictures, audio, video, and the like. Step 1001 can be implemented by format conversion module 501. The system preprocesses the information in text format (step 1002). Pre-processing may include, but is not limited to, one or more of speech recognition, text segmentation, entity recognition, normalization, and the like. Step 1002 can be implemented by text pre-processing module 502. The system classifies the text (step 1003). The classification step may include, but is not limited to, performing keyword extraction and classification. Step 1003 can be implemented by text classification module 503. The system performs attribute extraction on the text (step 1004). Step 1004 can be implemented by attribute extraction module 504. The system identifies the event (step 1005). Step 1005 can be implemented by event identification module 505. Optionally, the system may directly perform step 1002 without going through step 1001, and directly perform step 1004 without going through step 1003.
图11所示的是自然语言处理模块203的结构示意图。自然语言处理模块203可以包括但不限于一个收集单元1101,以及一个自然语言生成单元1102。收集单元1101可以通过访问系统中的其他模 块收集需要的信息(例如,收集模块201、处理模块202、回测模块204、系统数据库205)。自然语言生成单元1102可以将收集单元1101收集到的信息转化为自然语言语句。在本发明的一些实施例中,收集单元1101可以接收处理模块202输出的细化事件。同时,收集单元1101还可以从收集模块201接收用户输入信息。自然语言生成单元1102可以接收细化事件,并根据用户输入信息,对细化事件进行处理。在本发明的一种实施例中,例如,在股票领域,针对公告,用户可以选择生成针对个股的自然语言语句,或者选择针对行业的自然语言语句,还可以选择针对全市场的自然语言语句,或者其中的一种或多种。例如在新闻领域,针对IPO新股,用户可以选择生成针对大盘的自然语言语句。需要注意的是,上述的自然语言语句也可以在用户不介入的情况下自动生成。例如,针对个股,系统可以自动生成针对个股的自然语言语句、针对行业的自然语言语句、针对全市场的自然语言语句,或者其中的一种或多种的结合。在新闻领域,针对IPO新股,系统可以自动生成针对大盘的自然语言语句。进一步地,在新闻领域,自然语言生成单元1102可以生成针对商品价格的自然语言语句、针对天气状况的自然语言语句、针对人口统计的自然语言语句。上述自然语言语句中相关的数值可以缺省(在没有数值的情况下只会陈述改事件是否发生)。自然语言生成单元1102生成的自然语言语句可以被输入到回测模块204中,以对该自然语言语句进行回测。在本发明的另一些实施例中,用户输入待回测的自然语言语句到收集模块201,自然语言处理模块203从收集模块201中接收用户输入的自然语言语句。可选择地,用户可以直接将待回测的自然语言语句输入到自然语言处理模块203的查询框中,(图中未展示)。自然语言处理模块203可以对用户输入的自然语言语句进行预处理,得到标准节点序列(节点至少包括指标节点和条件节点),并根据指标节点和其他节点之间的关系,构造节点树。节点树可以用于表征指标条件组合。根据节点树可以生成数据查询指令。该数据查询指令可以被输入回测模块204中以进行回测分析。用户可以通过自然语言处理模块203调用存储在 系统数据库205中的历史数据,用户可以利用布尔操作符(AND、OR、NOT等)将一定数量的自然语言语句组合到一起。FIG. 11 is a schematic structural diagram of the natural language processing module 203. The natural language processing module 203 can include, but is not limited to, one collection unit 1101, and one natural language generation unit 1102. The collecting unit 1101 can access other modules in the system. The block collects the required information (eg, collection module 201, processing module 202, backtest module 204, system database 205). The natural language generating unit 1102 can convert the information collected by the collecting unit 1101 into a natural language sentence. In some embodiments of the invention, the collection unit 1101 may receive the refinement event output by the processing module 202. At the same time, the collecting unit 1101 can also receive user input information from the collecting module 201. The natural language generating unit 1102 can receive the refinement event and process the refinement event according to the user input information. In one embodiment of the present invention, for example, in the field of stocks, for announcements, the user may choose to generate natural language statements for individual stocks, or select natural language statements for the industry, and may also select natural language statements for the entire market. Or one or more of them. For example, in the news field, for IPO new shares, users can choose to generate natural language statements for the market. It should be noted that the above natural language statement can also be automatically generated without the user intervening. For example, for individual stocks, the system can automatically generate natural language statements for individual stocks, natural language statements for the industry, natural language statements for the entire market, or a combination of one or more of them. In the news field, for IPO new shares, the system can automatically generate natural language statements for the market. Further, in the field of news, the natural language generating unit 1102 can generate natural language sentences for commodity prices, natural language sentences for weather conditions, and natural language statements for demographics. The relevant values in the above natural language statements can be defaulted (in the absence of a value, only the change event will be declared). The natural language statement generated by the natural language generating unit 1102 can be input to the backtesting module 204 to backtest the natural language sentence. In still other embodiments of the present invention, the user inputs a natural language statement to be backtested to the collection module 201, and the natural language processing module 203 receives the natural language statement input by the user from the collection module 201. Alternatively, the user can directly input the natural language statement to be returned to the query box of the natural language processing module 203 (not shown). The natural language processing module 203 can preprocess the natural language sentence input by the user to obtain a standard node sequence (the node includes at least an indicator node and a condition node), and construct a node tree according to the relationship between the indicator node and other nodes. The node tree can be used to characterize the combination of indicator conditions. Data query instructions can be generated based on the node tree. The data query command can be input to the backtest module 204 for backtest analysis. The user can call the stored by the natural language processing module 203 Historical data in the system database 205, the user can use Boolean operators (AND, OR, NOT, etc.) to group a certain number of natural language statements together.
需要注意的是,以上对自然语言处理模块203的描述仅仅是具体的示例,不应被视为是唯一可行的实施方案。显然,对于本领域的专业人员来说,在了解所需要的信息的基本原理后,可能在不背离这一原理的情况下,对所需要的信息的内容进行各种修正和改变,但是这些修正和改变仍在以上描述的范围之内。例如,自然语言处理模块203也可以接收回测模块204输出的回测结果。It should be noted that the above description of the natural language processing module 203 is merely a specific example and should not be considered as the only feasible implementation. Obviously, for those skilled in the art, after understanding the basic principles of the required information, various modifications and changes may be made to the content of the required information without departing from the principle, but these corrections And changes are still within the scope of the above description. For example, the natural language processing module 203 can also receive the backtest result output by the backtest module 204.
图12所示的是回测模块的结构示意图。回测模块204可以包含一个标准问句单元1201、一个其他问句单元1202、一个优化单元1203、和一个扩展单元1204。标准问句单元1201、其他问句单元1202、优化单元1203和扩展单元1204可以是独立的。上述单元中的一些单元也可以是合并为一个单元工作。Figure 12 shows the structure of the backtest module. The backtest module 204 can include a standard question unit 1201, a other question unit 1202, an optimization unit 1203, and an extension unit 1204. The standard question unit 1201, the other question unit 1202, the optimization unit 1203, and the extension unit 1204 may be independent. Some of the above units may also be combined into one unit to work.
标准问句单元1201可以接收系统标准的自然语言语句事件。本发明的一些实施例中,标准问句单元1201可以接收自然语言处理模块203生成的自然语言语句,也可以接收来源于其他模块的自然语言语句。其他模块包括但不限于收集模块201、处理模块202、自然语言处理模块203、和系统数据库205等一种或多种组合。其他问句单元1202可以接收非系统的自然语言语句。非系统的自然语言语句可以包括但不限于用户输入、专家定义、系统抽取的结果等一种或多种组合。标准问句单元1201和其他问句单元1202可以合并为一个问句单元,该问句单元可以接收系统和用户输入的自然语言语句事件等。The standard question unit 1201 can receive natural language statement events of the system standard. In some embodiments of the present invention, the standard question unit 1201 may receive the natural language statement generated by the natural language processing module 203, or may receive natural language statements derived from other modules. Other modules include, but are not limited to, one or more combinations of collection module 201, processing module 202, natural language processing module 203, and system database 205. Other question unit 1202 can receive non-systematic natural language statements. Non-systematic natural language statements may include, but are not limited to, one or more combinations of user input, expert definition, system extracted results, and the like. The standard question unit 1201 and the other question unit 1202 can be combined into one question unit, which can receive natural language sentence events input by the system and the user, and the like.
优化单元1203可以根据接收的信息和回测算法优化出策略组合。优化方式可以是自动的,也可以是人工的。例如,在金融领域,上述的回测算法得到的基本回测数据包括但不限于持有期、单次收益平均值、单次收益最大值、单次收益最小值、预期年化收益率、交易次数、盈亏比、成功率、最大回撤率、周战胜率、夏普比率、最大连续无选股结果天数、平均每天选股数等一种或多种组合。优化单元1203进一步还可以根据回测结果给出最优策略,同时还会有报告评级 等。扩展单元1204可以配置提供订阅功能,也可以配置提供信息分享功能。订阅功能可以是扩展单元1204根据用户选择订阅包含特定关键词的信息,通过各种方式将经过该系统分析后的信息内容推送给用户。分享功能可以是用户通过各种方式把感兴趣的信息分享给朋友等。The optimization unit 1203 can optimize the policy combination according to the received information and the backtesting algorithm. The optimization method can be automatic or manual. For example, in the financial field, the basic backtesting data obtained by the above backtesting algorithm includes but is not limited to the holding period, the single return average, the single return maximum, the single return minimum, the expected annualized return, and the transaction. One or more combinations of the number of times, the profit-loss ratio, the success rate, the maximum revival rate, the weekly win rate, the Sharpe ratio, the maximum number of consecutive consecutive stock-free results, and the average number of stocks per day. The optimization unit 1203 can further provide an optimal strategy according to the backtest result, and also has a report rating. Wait. The extension unit 1204 can be configured to provide a subscription function, and can also be configured to provide an information sharing function. The subscription function may be that the extension unit 1204 subscribes to the information including the specific keyword according to the user's selection, and pushes the information content analyzed by the system to the user in various manners. The sharing function can be that the user shares the information of interest to friends in various ways.
图13所示的是回测流程图。步骤1301接收信息。接收的信息来源可以包括但不限于收集模块201、处理模块202、自然语言处理模块203、和系统数据库205。接收的信息可以是自然语言语句,也可以是机器语句。步骤1301可以是回测模块204接收到自然语言语句。自然语言语句信息可以是由用户直接输入的,也可以是其他模块生成的。步骤1301可以由标准问句单元1201和/或其他问句单元1203完成。Figure 13 shows the backtest flow chart. Step 1301 receives the information. The received source of information may include, but is not limited to, a collection module 201, a processing module 202, a natural language processing module 203, and a system database 205. The information received can be either a natural language statement or a machine statement. Step 1301 may be that the backtest module 204 receives the natural language statement. Natural language statement information can be directly input by the user or generated by other modules. Step 1301 can be accomplished by standard question element unit 1201 and/or other question element unit 1203.
步骤1302中,将接收到的自然语言语句与历史数据进行回测分析,历史数据可以储存在系统数据库205,也可以储存在回测模块204。自然语言语句与历史数据的回测分析可以通过一定的优化方法实现。优化方法可以包括但不限于系统定义、用户自定义选择、机器学习等其中的一种或多种组合。In step 1302, the received natural language statement and the historical data are back analyzed, and the historical data may be stored in the system database 205 or may be stored in the backtest module 204. Backtesting analysis of natural language statements and historical data can be achieved by certain optimization methods. Optimization methods may include, but are not limited to, one or more combinations of system definitions, user-defined selections, machine learning, and the like.
步骤1303中,优化分析的信息结果会与相应的文字模板匹配。文字模板可以是系统定义的,也可以是用户自定义的。与模板匹配的结果内容可以包括但不限于回测报告、报告评级、最优策略、走势预测等一种或多种组合。In step 1303, the information result of the optimized analysis will match the corresponding text template. Text templates can be system-defined or user-defined. The result content matching the template may include, but is not limited to, one or more combinations of backtest reports, report ratings, optimal strategies, trend predictions, and the like.
需要注意的是,上述对信息分析系统流程的描述只是为了便于理解发明,不应被视为是本发明唯一可行的实施例。显然,对于本领域的专业人员来说,在了解所需要的基本原理后,可能在不背离这一原理的情况下,对所需要的信息的内容进行各种修正和改变,但是这些修正和改变仍在以上描述的范围之内。例如,步骤1303中,回测的历史信息可以与文字模板匹配,也可以是与语音、视频、图片等模板匹配。又如,回测过程可以由回测模块204完成,也可以由自然语言处理模块203、处理模块202、收集模块201等完成。 It should be noted that the above description of the information analysis system flow is only for facilitating understanding of the invention and should not be considered as the only feasible embodiment of the present invention. Obviously, those skilled in the art, after understanding the basic principles required, may make various modifications and changes to the content of the required information without departing from the principle, but these modifications and changes Still within the scope of the above description. For example, in step 1303, the backtested historical information may be matched with a text template, or may be matched with a template such as a voice, a video, or a picture. For another example, the backtesting process may be completed by the backtesting module 204, or may be completed by the natural language processing module 203, the processing module 202, the collecting module 201, and the like.
回到图12,回测模块204进一步可以包括扩展单元1204。扩展单元1204可以配置提供订阅功能,也可以配置提供信息分享功能。扩展单元可以包括但不限于各种类型的应用程序接口(API),如面向对象的API、库与框架的API、API与协议、API与设备接口、web API,或其中的一种或多种。订阅功能可以是扩展单元1204根据用户选择订阅包含特定关键词的信息,通过各种方式将经过该系统分析后的信息内容推送给用户。分享功能可以是用户通过各种方式把感兴趣的信息分享给朋友。扩展单元1204的订阅功能可以包括但不限于为用户提供推送信息,也可以推荐关注兴趣相似的用户,还可以推荐信息的评论,并且提供信息有无帮助的评分等。扩展单元1204推送的方式可以包含但不限于移动客户端软件、电子邮件、短信、RSS门户网站、在线单用户聚合器、搜索引擎、浏览器、即时通讯软件、社交网络等。扩展单元1204的推送周期可以是系统设定的,也可以是用户自定义的。推送周期可以是定期的也可以是不定期的。推送可以是实时的也可以是延时的。扩展单元1204推送的信息内容形式可以包括但不限于文字、语音、图片、动画、视频等中的一种或多种。扩展单元1204推送的信息内容可以包括但不限于用户已浏览的信息内容更新,可以是用户关注的信息,也可以是系统根据用户记录推荐的信息,还可以是同类信息关注的热度情况等中的一种或多种。Returning to FIG. 12, the flyback module 204 can further include an expansion unit 1204. The extension unit 1204 can be configured to provide a subscription function, and can also be configured to provide an information sharing function. The extension unit may include, but is not limited to, various types of application program interfaces (APIs) such as an object-oriented API, a library and framework API, an API and protocol, an API and device interface, a web API, or one or more of them. . The subscription function may be that the extension unit 1204 subscribes to the information including the specific keyword according to the user's selection, and pushes the information content analyzed by the system to the user in various manners. The sharing function can be that the user shares the information of interest to the friend in various ways. The subscription function of the extension unit 1204 may include, but is not limited to, providing push information to the user, and may also recommend users who are interested in similar interests, may also recommend comments of the information, and provide ratings for whether or not the information is helpful. The manner in which the extension unit 1204 pushes may include, but is not limited to, mobile client software, email, SMS, RSS portal, online single-user aggregator, search engine, browser, instant messaging software, social network, and the like. The push period of the extension unit 1204 may be set by the system or may be user-defined. The push cycle can be periodic or irregular. Push can be real-time or delayed. The information content form pushed by the extension unit 1204 may include, but is not limited to, one or more of text, voice, picture, animation, video, and the like. The information content pushed by the extension unit 1204 may include, but is not limited to, information content update that the user has browsed, may be information that the user pays attention to, or may be information recommended by the system according to the user record, or may be a hot situation of the same type of information. One or more.
扩展单元1204的分享功能可以是用户使用的一种发布信息方式,分享到指定的地方,选择哪些人可以看到该信息等。信息分享的内容可以是单条信息也可以是多条信息,可以是部分选取内容的信息也可以是页面整体内容的信息,可以是信息内容分享也可以是信息评论分享,可以是信息的关注度分享也可以是信息的帮助评分分享等。信息分享的方式可以包括但不限于短信、彩信、电子邮件、QQ、MSN、微信、微博、豆瓣、Twitter、Facebook、Instagram、人人、即时通讯软件工具等中的一种或多种。信息分享接收对象可以包括但不限于单个朋友、多个朋友、朋友圈、公共社交圈、论坛、其他用户等中的一个或多个。信息分享的内容格式可以包括但不限于文字、图片、语音、 动画、视频、网页链接等中的一种或多种。以上对信息分享模式其所实现功能的描述仅仅是具体的示例,不应被视为是唯一可行的实施方案。以上对扩展单元1204的描述仅仅是具体的示例,不应被视为是唯一可行的实施方案。显然,对于本领域的专业人员来说,在了解扩展单元1204的基本原理后,可能在不背离这一原理的情况下,对实施扩展单元的具体方式与步骤、以及扩展单元所能实现的功能进行形式和细节上的各种修正和改变,但是这些修正和改变仍在以上描述的范围之内。The sharing function of the extension unit 1204 may be a method of publishing information used by the user, sharing to a designated place, selecting who can view the information, and the like. The content of the information sharing may be a single piece of information or a plurality of pieces of information, and may be information of a part of the selected content or information of the entire content of the page, and may be information content sharing or information comment sharing, and may be information sharing. It can also be informational help rating sharing, etc. The manner of information sharing may include, but is not limited to, one or more of SMS, MMS, email, QQ, MSN, WeChat, Weibo, Douban, Twitter, Facebook, Instagram, everyone, instant messaging software tools, and the like. The information sharing receiving object may include, but is not limited to, one or more of a single friend, a plurality of friends, a circle of friends, a public social circle, a forum, other users, and the like. The content format of information sharing may include, but is not limited to, text, pictures, voice, One or more of animations, videos, web links, and more. The above description of the functions implemented by the information sharing mode is merely a specific example and should not be regarded as the only feasible implementation. The above description of the extension unit 1204 is merely a specific example and should not be considered as the only feasible implementation. Obviously, for those skilled in the art, after understanding the basic principle of the extension unit 1204, the specific manner and steps of implementing the extension unit and the functions that the extension unit can implement without departing from this principle may be Various modifications and changes in form and detail are made, but such modifications and changes are still within the scope of the above description.
图14所示的是系统数据库205的模块示意图。系统数据库205可以包括但不限于一个原始信息数据库、一个文本数据库、一个文本预处理数据库、一个实体数据库、一个事件属性数据库、一个关键词数据库、一个文本分类数据库、一个历史信息数据库、一个自然语言处理数据库、一个事件识别数据库、一个回测模块数据库、一个文字模板数据库、一个词典数据库等一种或多种组合。系统数据库205可以储存数据及模板,也可以处理数据。例如,历史信息数据库1409收集的历史信息可以在该数据库中进行分类储存。同样地,当信息处理过程中有任何的更新,也会在各数据库中进行信息实时更新,如关键词数据库1406,同义词的更新实现在该数据库。显然,对于本领域的专业人员来说,在了解信息分析系统及方法的数据库原理后,可能在不背离这一原理的情况下,对实施上述方法和系统的应用领域形式和细节上的各种修正和改变,但是这些修正和改变仍在以上描述的范围之内。例如,系统数据库205中的各类数据库可以是在收集模块201、处理模块202、自然语言处理模块203、和回测模块204中分别实现其功能。系统数据库205中的各类数据库也可以是一种数据库实现两种及两种以上数据的库功能,如文本预处理数据库1403可以同时储存预处理数据、实体数据、事件属性和关键词等,同时实现实体数据库1404、事件属性数据库1405、和关键词数据库1406的功能。Figure 14 is a block diagram of the system database 205. The system database 205 can include, but is not limited to, an original information database, a text database, a text pre-processing database, an entity database, an event attribute database, a keyword database, a text classification database, a historical information database, a natural language. Processing one or more combinations of a database, an event identification database, a backtest module database, a text template database, a dictionary database, and the like. The system database 205 can store data and templates as well as data. For example, historical information collected by the historical information database 1409 can be classified and stored in the database. Similarly, when there is any update in the information processing process, the information is also updated in real time in each database, such as the keyword database 1406, and the synonym update is implemented in the database. Obviously, for those skilled in the art, after understanding the database principle of the information analysis system and method, various forms and details of the application fields of the above methods and systems may be implemented without departing from this principle. Corrections and changes, but these corrections and changes are still within the scope of the above description. For example, various types of databases in the system database 205 may be implemented in the collection module 201, the processing module 202, the natural language processing module 203, and the backtest module 204, respectively. The various types of databases in the system database 205 can also be a library function for realizing two or more types of data in the database. For example, the text pre-processing database 1403 can simultaneously store pre-processed data, entity data, event attributes and keywords, etc. The functions of the entity database 1404, the event attribute database 1405, and the keyword database 1406 are implemented.
图15所示的是信息分析流程图。信息分析系统在步骤1501对需要的信息进行收集。步骤1501可以由收集模块201完成。上述 需要的信息可以包括但不限于各种新闻、公告、评论、研报、博客、消息、报告、通知、论文、期刊等中的一种或多种。上述需要的信息可以是关于各个行业的信息,包括但不限于体育、娱乐、经济、政治、军事、文化、艺术、科学、工程等中的一种或多种。上述需要的信息的形式可以包括但不限于文字、图片、音频、视频等中的一种或多种。在本发明的一些实施例中,系统在步骤1501收集的信息可以是文本信息。该文本信息包括但不限于以下格式:pdf、doc、epub、mobi、caj等,或其中的一种或多种。Figure 15 shows the flow chart of information analysis. The information analysis system collects the required information in step 1501. Step 1501 can be completed by the collection module 201. Above The required information may include, but is not limited to, one or more of various news, announcements, reviews, research reports, blogs, messages, reports, notices, essays, journals, and the like. The information required above may be information about various industries including, but not limited to, one or more of sports, entertainment, economics, politics, military, culture, art, science, engineering, and the like. The form of the above-mentioned required information may include, but is not limited to, one or more of text, picture, audio, video, and the like. In some embodiments of the invention, the information collected by the system at step 1501 may be textual information. The text information includes, but is not limited to, the following formats: pdf, doc, epub, mobi, caj, etc., or one or more of them.
系统在步骤1502对可以对文本信息进行预处理。步骤1502可以由处理模块202完成。文本预处理可以包括但不限于格式转换、分词处理、实体识别、数字及单位归一化处理等其中的一种或多种组合。例如,系统在步骤1501收集到的信息为“中信证券股份有限公司2014年年度报告”公告,该公告为PDF文件,可以在上海证券交易所网站下载。系统通过格式转换,将该公告转换成txt格式的文本,以方便分词和后续的文本处理。同时对PDF格式内部的表格部分进行分析处理,保留部分格式化信息和上下文信息。格式转换完成后,系统将按照一定的方法对该公告进行分词处理。分词处理可以依据一个统计模型以及词典数据库1407进行。可选择地,分词处理也可通过应用一定的规则实现。规则可以包括但不限于,同义词配置、布尔关联规则、位置规则,或其中的一种或多种。完成分词处理之后,系统将对公告进行实体识别。实体识别包括但不限于产品、机构名、人名、地名、时间、日期、货币、数字、百分比等。具体地,系统可以总结提炼过往信息中的要素,定义各种事件类别。例如:外交类、金融类、体育类、政治类、科学类、教育类等,或其中的一种或多种。上述类别也可包含若干级别的子类,例如,金融类包含国债类、股票类、基金类等。完成识别后,系统将对该公告进行数字及单位归一化处理。例如,将“净利润增长百分之三”转换为“净利润增长3%”。The system may preprocess the textual information at step 1502. Step 1502 can be completed by processing module 202. Text pre-processing may include, but is not limited to, one or more combinations of format conversion, word segmentation processing, entity recognition, number and unit normalization processing, and the like. For example, the information collected by the system in step 1501 is the “2014 Annual Report of CITIC Securities Co., Ltd.” announcement, which is a PDF file and can be downloaded from the website of the Shanghai Stock Exchange. The system converts the announcement into text in txt format through format conversion to facilitate word segmentation and subsequent text processing. At the same time, the table part inside the PDF format is analyzed and processed, and part of the formatting information and context information are retained. After the format conversion is completed, the system will process the word segmentation according to a certain method. The word segmentation process can be performed in accordance with a statistical model and a dictionary database 1407. Alternatively, word segmentation can also be implemented by applying certain rules. Rules may include, but are not limited to, synonym configuration, Boolean association rules, location rules, or one or more of them. After the word segmentation process is completed, the system will identify the entity. Entity identification includes, but is not limited to, product, institution name, person name, place name, time, date, currency, number, percentage, and the like. Specifically, the system can summarize the elements in the past information and define various event categories. For example: diplomacy, finance, sports, politics, science, education, etc., or one or more of them. The above categories may also include sub-categories of several levels. For example, the financial category includes treasury bonds, stocks, funds, and the like. After the identification is completed, the system will normalize the number and unit of the announcement. For example, convert “3% of net profit growth” to “3% of net profit growth”.
在完成本文预处理之后,系统将对该公告进行文本处理(步骤1503)。文本处理可以由处理模块202完成。系统对经过预处理后 的公告进行关键词匹配。关键词匹配可以与同义词配置、布尔关联规则、位置规则等方法相结合。根据关键词匹配或其它方法处理的结果,系统可以对该公告进行类别判断(步骤1504)。例如,图19所示的《中信证券股份有限公司2014年年度报告》,抽取出的关键词可以为“中信证券”“2014年”“年度报告”“净利润”“同期增减”等,则该公告可以被判定为财务报告大类。步骤1504可以由处理模块202完成。After completing the pre-processing herein, the system will perform text processing on the bulletin (step 1503). Text processing can be done by the processing module 202. After the system is preprocessed Announcement for keyword matching. Keyword matching can be combined with synonym configuration, Boolean association rules, location rules, and the like. Based on the results of keyword matching or other method processing, the system can make a category determination for the announcement (step 1504). For example, in the “2014 Annual Report of CITIC Securities Co., Ltd.” shown in Figure 19, the keywords extracted may be “CITIC Securities”, “2014”, “Annual Report”, “Net Profit”, “Sequential Increase and Decrease”, etc. The announcement can be judged as a broad category of financial reports. Step 1504 can be completed by processing module 202.
在完成类别判断之后,系统可以依据一个规则生成细化事件(步骤1505)。在本发明的一些实施例中,完成类别判断后,文本信息中的属性可以被抽取出来。属性为关于实体性质或关系的描述。例如,对于图19所示的《中信证券股份有限公司2014年年度报告》,抽取出的实体可以为“中信证券”,则从图19的表格中可以抽取出的属性可以是营业收入、净利润同期增减、资产总额、负债总额、股东权益总额等,或其中的一种或多种。完成属性抽取之后,系统可以按照一定的规则方法,将实体与属性结合,生成细化事件。步骤1505可以由处理模块202完成。生成细化事件的规则可以包括但不限于,同义词配置、布尔关联规则、位置规则,或其中的一种或多种。对于上述公告所抽取出的实体“中信证券”,可以将其与其属性之一的“净利润同期增减”相结合可生成“中信证券2014年年报净利润增长116.2%”这一细化事件。After completing the category determination, the system can generate a refinement event in accordance with a rule (step 1505). In some embodiments of the invention, the attributes in the textual information may be extracted after the category determination is completed. An attribute is a description of the nature or relationship of an entity. For example, for the “2014 Annual Report of CITIC Securities Co., Ltd.” shown in Figure 19, the extracted entity can be “CITIC Securities”, and the attributes that can be extracted from the table in Figure 19 can be operating income, net profit. The increase or decrease during the same period, total assets, total liabilities, total shareholders' equity, etc., or one or more of them. After the attribute extraction is completed, the system can combine the entity and the attribute according to a certain rule method to generate a refinement event. Step 1505 can be completed by processing module 202. The rules for generating refinement events may include, but are not limited to, synonym configuration, Boolean association rules, location rules, or one or more of them. For the entity “CITIC Securities” extracted from the above announcement, it can combine the “net profit increase and decrease” of one of its attributes to generate a detailed event of “CITIC Securities' 2014 annual net profit growth of 116.2%”.
生成细化事件后,系统可以针对该细化事件生成自然语言语句(步骤1506)。步骤1506可以由自然语言处理模块203完成。对于上述公告,系统可以生成3句自然语言语句,“中信证券年报净利润同比增长大于100%”,“券商行业年报净利润同比增长率大于100%”,“年报净利润同比增长率大于100%”,即分别对应个股、行业和整个股票市场等三个层次。系统可以针对以上生成的3句自然语言语句,分别进行个股事件回测、行业事件回测、以及全市场回测(步骤1507)。步骤1507可以由回测模块204完成。完成回测之后,系统可以将回测结果匹配文字模板,生成回测报告(步骤1508)。步骤1508可以由 回测模块204完成。对于上述公告,系统生成的一个示例回测报告可以是“近1年,A股所有财务报告公告,次日收盘平均收益0.48%,上涨概率47.77%。其中,证券行业共公布165次同类公告,次日收盘平均收益0.72%,上涨概率46.67%,股价次日下跌概率偏大,获利概率偏低。最优策略:上涨概率最高是持股11天后收盘卖出,平均收益9.40%。”由于上涨概率分布在50%上下,因此判定“无足轻重”。After generating the refinement event, the system can generate a natural language statement for the refinement event (step 1506). Step 1506 can be performed by natural language processing module 203. For the above announcement, the system can generate 3 natural language sentences, “the net profit of CITIC Securities annual report increased by more than 100% year-on-year”, “the annual growth rate of net profit of brokerage industry annual report is greater than 100%”, “annual report net profit growth rate is greater than 100% ", that is, corresponding to three levels of individual stocks, industry and the entire stock market. The system can perform individual stock event backtesting, industry event backtesting, and full market backtesting for the three natural language sentences generated above (step 1507). Step 1507 can be completed by the backtest module 204. After the backtest is completed, the system can match the backtest result to the text template to generate a backtest report (step 1508). Step 1508 can be performed by The flyback module 204 is completed. For the above announcement, an example back test report generated by the system may be “nearly one year, all financial report announcements of A shares, the average return of the next day's closing is 0.48%, and the rising probability is 47.77%. Among them, the securities industry has announced 165 similar announcements. The average return on the next day was 0.72%, the probability of increase was 46.67%, the probability of the stock price falling next day was too large, and the probability of profit was low. The optimal strategy: the highest probability of rise is to close the market after 11 days of holding, the average return is 9.40%. The probability of rise is distributed at around 50%, so the judgment is “insignificant”.
需要注意的是,上述对信息分析系统流程图的描述只是为了便于理解发明,不应被视为唯一的实施例。显然,对于本领域的专业人员来说,在了解本发明的内容和原理之后,都可能在不背离本发明原理和结构的情况下,进行形式和细节上的各种修正和改变,但这些修正和改变仍在本发明的权利要求保护范围之内。例如,在步骤1504,系统可以直接收集用户输入的信息,并将该信息转换为自然语言语句(步骤1506)。同样地,系统在完成信息收集之后,可以直接进行文本处理(步骤1503)。同样地,系统在完成文本处理之后,可以直接依据处理后的文本生成细化事件,步骤1504不是必须的。It should be noted that the above description of the flow chart of the information analysis system is only for the purpose of understanding the invention and should not be regarded as the only embodiment. It is apparent to those skilled in the art that various modifications and changes in form and detail may be made without departing from the spirit and scope of the invention. And modifications are still within the scope of the claims of the present invention. For example, at step 1504, the system can directly collect information entered by the user and convert the information into a natural language statement (step 1506). Similarly, after the system completes the collection of information, the system can directly perform text processing (step 1503). Similarly, after the text processing is completed, the system can generate a refinement event directly according to the processed text, and step 1504 is not necessary.
图16所示的是信息分析系统交互工作流程图。信息分析系统在步骤1601接收自然语言语句。该自然语言语句可以是用户直接输入的,也可以是通过处理公告、新闻等文本得到的自然语言语句。具体地,用户可以通过信息分析系统提供的交互界面(详见图18)来输入自然语言语句。用户可以输入任何自然语言语句。例如,在金融领域,用户可以输入“000826,签订重大合同”;系统可以根据输入的自然语言语句检索代码000826对应的公司签订的重大合同。自然语言语句的数字和日期可以是任意格式的(详见图18)。Figure 16 shows the flow chart of the interaction of the information analysis system. The information analysis system receives the natural language statement at step 1601. The natural language statement can be directly input by the user, or can be a natural language statement obtained by processing texts such as announcements and news. Specifically, the user can input the natural language statement through the interactive interface provided by the information analysis system (see FIG. 18). Users can enter any natural language statement. For example, in the financial field, the user can enter "000826, sign a major contract"; the system can retrieve the major contract signed by the company corresponding to code 000826 based on the input natural language statement. The numbers and dates of natural language statements can be in any format (see Figure 18).
信息分析系统在步骤1602处理步骤1601中接收的自然语言语句。处理可以包括但不限于,分词处理、实体识别、数字及单位归一化处理、文本分类、事件属性抽取、细化事件识别等一种或多种组合。信息分析系统在步骤1603对步骤1602中经过处理的自然语言语句进行回测,然后在步骤1604生成回测报告。关于回测报告的具体内容将在图18中作详细描述。 The information analysis system processes the natural language statement received in step 1601 in step 1602. Processing may include, but is not limited to, word segmentation processing, entity recognition, number and unit normalization processing, text classification, event attribute extraction, refinement event recognition, and the like. The information analysis system backtests the processed natural language statements in step 1602 at step 1603, and then generates a backtest report at step 1604. The details of the backtest report will be described in detail in FIG.
图17是信息分析系统针对新闻或公告的一个交互界面示意图。参考图2,该交互界面可以由回测模块204生成并用来展示回测结果(回测展示)。该交互界面可以在各种电子设备上显示。电子设备可以包括但不限于手机、个人电脑、平板电脑、PDA、智能手表、智能家电、智能交通工具等,或其中的一种或多种。在地址栏中,用户可以输入任意公告或新闻的统一资源定址器(URL)以对该公告或新闻及其分析结果进行阅读。可选择地,用户也可以输入任意公告或新闻的IP(Internet Protocol)地址。17 is a schematic diagram of an interactive interface of an information analysis system for news or announcements. Referring to FIG. 2, the interactive interface can be generated by the backtest module 204 and used to display backtest results (backtested impressions). The interactive interface can be displayed on a variety of electronic devices. The electronic device may include, but is not limited to, a mobile phone, a personal computer, a tablet, a PDA, a smart watch, a smart home appliance, a smart vehicle, etc., or one or more of them. In the address bar, the user can enter a Uniform Resource Locator (URL) for any announcement or news to read the announcement or news and its analysis results. Alternatively, the user can also enter an IP (Internet Protocol) address for any announcement or news.
在查询框1701中,用户可以输入公告或新闻的完整名称。可选择地,用户也可以输入公告或新闻中的关键字,以在结果列表中选择指定的新闻或公告。在选定好指定公告或新闻之后,该公告或新闻的标题、正文内容、以及回测报告可以在交互界面上被显示出来。区域1702可以显示公告或新闻的正文内容的全部或部分,用户可以通过鼠标、按键、触摸屏、语音控制或者触摸板来查看正文内容。区域1703可以显示针对选定公告或新闻的回测报告。所述回测报告可以包括但不限于对历史数据的展示,例如,In query block 1701, the user can enter the full name of the announcement or news. Alternatively, the user can also enter keywords in the announcement or news to select the specified news or announcement in the results list. After the specified announcement or news is selected, the title, body content, and backtest report of the announcement or news may be displayed on the interactive interface. The area 1702 can display all or part of the body content of the announcement or news, and the user can view the body content through a mouse, a button, a touch screen, a voice control, or a touch pad. Area 1703 can display a backtest report for the selected announcement or news. The backtest report may include, but is not limited to, display of historical data, for example,
“近1年,A股所有签订重大合同公告,次日收盘平均收益0.38%,上涨概率47.62%。"In the past year, all A-shares have signed major contract announcements. The average return on the next day was 0.38%, and the probability of increase was 47.62%.
其中,XX制药共公布16次同类公告,次日收盘平均收益0.29%,上涨概率68.75%,股价次日上涨概率极大,获利概率较高。”Among them, XX Pharmaceutical announced 16 similar announcements, with an average return of 0.29% on the next day, with a rising probability of 68.75%. The stock price has a high probability of rising the next day and has a high probability of profit. ”
除了展示历史数据外,对公告或新闻的评级和建议策略也可以被展示在区域1703中。所述建议策略基于回测周期中最晚时间之后的最优历史表现,建议策略可以是,例如,上涨概率最高是持股1天后收盘卖出,平均收益0.29%。报告评级可以是利好、利空或无足轻重等。区域1704可以显示最新公告或新闻以方便用户查看。显示方式可以是列表显示,即只显示最新公告或新闻的标题和时间。区域1705可以显示选定报告或新闻的一些相关信息,如选定报告或新闻的,公告(新闻)类型、发布时间、新闻或公告中涉及的证券名称、证券代码、公告(新闻)编号等。 In addition to presenting historical data, rating and suggestion strategies for announcements or news may also be displayed in area 1703. The proposed strategy is based on the optimal historical performance after the latest time in the backtesting cycle. The suggested strategy may be, for example, the highest probability of rise is to close the market after 1 day of holding, with an average return of 0.29%. Report ratings can be positive, bad or insignificant. Area 1704 can display the latest announcements or news for the user to view. The display mode can be a list display, that is, only the title and time of the latest announcement or news are displayed. The area 1705 can display some relevant information of the selected report or news, such as the selected report or news, the type of announcement (news), the time of publication, the name of the securities involved in the news or announcement, the stock code, the announcement (news) number, and the like.
以上的描述仅仅是本发明分类展示模块的具体实施例,不应被视为是唯一的实施例。显然,对于本领域的专业人员来说,在了解本发明内容和原理后,都可能在不背离本发明原理、结构的情况下,进行形式和细节上的各种修正和改变,但是这些修正和改变仍在本发明的权利要求保护范围之内。The above description is only a specific embodiment of the classification display module of the present invention and should not be considered as the only embodiment. It is apparent to those skilled in the art that various modifications and changes in form and detail may be made without departing from the spirit and scope of the invention. Changes are still within the scope of the claims of the present invention.
图18是信息分析系统针对用户输入的一个交互界面示意图。参考图2,该交互界面可以由回测模块204生成并用来展示回测结果(回测展示)。该交互界面可以在各种电子设备上显示,电子设备可以包括但不限于,手机、个人电脑、平板电脑、PDA、智能手表、智能家电、智能交通工具等,或其中的一种或多种。在地址栏中,用户可以输入任意文本的统一资源定址器(URL)以对该文本及其分析结果进行阅读。可选择地,用户也可以输入任意文本的IP(Internet Protocol)地址。18 is a schematic diagram of an interactive interface of an information analysis system for user input. Referring to FIG. 2, the interactive interface can be generated by the backtest module 204 and used to display backtest results (backtested impressions). The interactive interface can be displayed on various electronic devices, including but not limited to, a mobile phone, a personal computer, a tablet computer, a PDA, a smart watch, a smart home appliance, a smart vehicle, etc., or one or more of them. In the address bar, the user can enter a uniform resource addresser (URL) of any text to read the text and its analysis results. Alternatively, the user can also input an IP (Internet Protocol) address of any text.
在区域1801中,用户可以输入任何自然语言语句。如在金融领域,用户可以输入“12月10日20日均线粘合;振幅小于3%;两比从大到小排序”或者“股息率连续两年大于3%;每股收益大于2元;市值小于50亿营业收入同比增长率从小到大”。输入的自然语句中的数字和日期可以是以任意格式表现的数字和日期。日期可以是大前天、上上个星期、上个周末、上周5、上个月、上个季度、大前年、近N天、近N周、近N个交易日等。数字可以是3分之1、1/3、5元、百分之5、5%等。数字也可以是一个范围,如5至10元、5-10%等。输入的自然语言语句中还可以加入排序规则。排序规则可以是,如量比从大到小、量比从小到大、涨跌幅从小到大、涨跌幅从大到小、换手率从大到小、换手率从小到大、资金流向从大到小、流通盘从小到大、DDE从大到小、市值从小到大、基本每股收益从大到小、销售毛利率从大到小、销售毛利率从大到小、净资产(同比增长率)从大到小、资产收益率roe从大到小、已获利息倍数从大到小、营业收入(同比增长率)从大到小等,或其中的一种或多种。In area 1801, the user can enter any natural language statement. For example, in the financial field, users can enter “December 10, 20-day moving average bonding; amplitude is less than 3%; two ratios are ranked from large to small” or “dividend rate is greater than 3% for two consecutive years; earnings per share is greater than 2 yuan; The market value of less than 5 billion operating income grew from small to large. The numbers and dates in the natural statements you enter can be numbers and dates in any format. The date can be the previous day, last week, last weekend, last week 5, last month, last quarter, big year, near N days, nearly N weeks, nearly N trading days, and so on. The number can be 1, 3, 3, 5, 5, 5%, etc. The number can also be a range, such as 5 to 10 yuan, 5-10%, and the like. A collation can also be added to the input natural language statement. The sorting rule can be, for example, the ratio is from large to small, the quantity ratio is from small to large, the increase and decrease is from small to large, the increase and decrease is from large to small, the turnover rate is from large to small, the turnover rate is from small to large, and the capital is small. Flows from large to small, circulation from small to large, DDE from large to small, market value from small to large, basic earnings per share from large to small, sales gross margin from large to small, sales gross margin from large to small, net assets (Year-year growth rate) from large to small, the return on assets roe from large to small, the interest multiplier has been increased from large to small, operating income (year-on-year growth rate) from large to small, or one or more of them.
在区域1802,用户可以设置分析策略。例如,用户可以设置 时间范围、持仓股票、买入时机、持股周期、止盈条件、止损条件、交易费率等。具体地,用户可以设置买入时机为第二天开盘后买入,也可以设置止盈条件为“大于25%时,回撤5%止盈”。在设置好区域1801以及区域1802中的内容后,用户可以点击查询按钮,关于输入自然语言语句的回测报告将被生成。区域1803可以展示对生成回测报告的评级以及建议的策略。报告评级可以是对最大预期年化收益率及最大成功率的估计等。区域1804可以显示针对输入的自然语言语句的回测报告。回测报告可以包括但不限于回测数据分析、累计收益图、收益分布图、历史交易查询等内容。回测数据可以包括但不限于持有期、单次收益平均值、单次收益最小值、预期年化收益率、交易次数、盈亏比、成功率、最大回撤率、周战胜率、夏普比率、最大连续无选股结果天数、平均每天选股数等In area 1802, the user can set an analysis strategy. For example, the user can set Time range, position stock, buying opportunity, shareholding period, take profit condition, stop loss condition, transaction rate, etc. Specifically, the user can set the buying opportunity to buy after the opening of the next day, or set the take profit condition to “retract 5% take profit when greater than 25%”. After setting the contents in the area 1801 and the area 1802, the user can click on the query button, and a backtest report on the input natural language sentence will be generated. Region 1803 can present a rating for generating a backtest report and a suggested strategy. The report rating can be an estimate of the maximum expected annualized rate of return and maximum success rate. Region 1804 can display a backtest report for the input natural language statement. The backtest report may include, but is not limited to, backtest data analysis, cumulative income graph, revenue distribution graph, historical transaction query, and the like. Backtest data can include, but is not limited to, holding period, single return average, single return minimum, expected annualized return, number of transactions, profit-loss ratio, success rate, maximum volatility, weekly win rate, Sharpe ratio , the maximum number of consecutive consecutive stock selection results, the average number of stocks per day, etc.
以上的描述仅仅是本发明分类展示模块的具体实施例,不应被视为是唯一的实施例。显然,对于本领域的专业人员来说,在了解本发明内容和原理后,都可能在不背离本发明原理、结构的情况下,进行形式和细节上的各种修正和改变,但是这些修正和改变仍在本发明的权利要求保护范围之内。The above description is only a specific embodiment of the classification display module of the present invention and should not be considered as the only embodiment. It is apparent to those skilled in the art that various modifications and changes in form and detail may be made without departing from the spirit and scope of the invention. Changes are still within the scope of the claims of the present invention.
以上对适用领域的描述仅仅是具体的示例,不应被视为是唯一可行的实施方案。显然,对于本领域的专业人员来说,在了解一种信息分析方法和系统的基本原理后,可能在不背离这一原理的情况下,对实施上述方法和系统的应用领域形式和细节上的各种修正和改变,但是这些修正和改变仍在以上描述的范围之内。凡是能整理成数据的配置系统都可以使用本发明描述的系统实现信息分析的功能,例如本发明可以作为浏览器插件,当用户浏览网站时,需要对当前网页的新闻或公告进行信息分析,可使用该插件回测该新闻或公告的历史信息,并给予预测;同样地,该系统还可以嵌入公司系统中对财务报表进行智能数据分析;另外,各种传感器采集数据,如温度传感器、湿度传感器、风力传感器可以读取环境数据,可以通过该系统来分析环境历史趋势并预测未来环境变化;医学方面,不同年龄段使用同一药物的 效果进行回测,如感冒得病的症状,根据历史数据分析得出多少天痊愈等。 The above description of the applicable fields is merely a specific example and should not be considered as the only feasible implementation. Obviously, for those skilled in the art, after understanding the basic principles of an information analysis method and system, it is possible to implement the application form and details of the above method and system without departing from this principle. Various modifications and changes, but such modifications and changes are still within the scope of the above description. Any configuration system that can be organized into data can use the system described in the present invention to implement information analysis functions. For example, the present invention can be used as a browser plug-in. When a user browses a website, information analysis of the current web page news or announcement is required. Use the plug-in to backtest the historical information of the news or announcement, and give predictions; similarly, the system can also be embedded in the company system for intelligent data analysis of financial statements; in addition, various sensors collect data, such as temperature sensors, humidity sensors Wind sensors can read environmental data, and the system can be used to analyze environmental historical trends and predict future environmental changes. In medical terms, the same drug is used in different age groups. The effect is backtested, such as the symptoms of a cold, and it is based on historical data to analyze how many days have passed.

Claims (18)

  1. 一种信息分析系统,包括:An information analysis system comprising:
    一种计算机可读的存储媒介,所述存储媒介存储可执行模块,包括:A computer readable storage medium storing the executable module, comprising:
    收集模块,所述收集模块能够收集信息;a collection module capable of collecting information;
    处理模块,所述处理模块能够对收集的信息进行预处理,从预处理后的信息中提取事件;a processing module, wherein the processing module is capable of pre-processing the collected information, and extracting an event from the pre-processed information;
    自然语言处理模块,所述自然语言处理模块能够根据提取出的事件生成自然语言语句;a natural language processing module capable of generating a natural language statement according to the extracted event;
    回测模块,所述回测模块能够根据生成的自然语言语句获取历史信息,并结合所述历史信息生成回测结果;a backtesting module, wherein the backtesting module is capable of acquiring historical information according to the generated natural language statement, and generating a backtesting result according to the historical information;
    一个处理器,所述处理器能够执行所述计算机可读的存储媒介存储的可执行模块。A processor capable of executing the executable module of the computer readable storage medium storage.
  2. 根据权利要求1,所述的信息分析系统进一步包括一个数据库,所述数据库能够储存所述的收集信息、预处理后的信息、提取的事件、自然语言语句、历史信息、回测结果。The information analysis system according to claim 1, further comprising a database capable of storing said collected information, preprocessed information, extracted events, natural language sentences, historical information, and backtest results.
  3. 根据权利要求2,所述的数据库包括原始信息数据库、文本数据库、文本预处理数据库、实体数据库、事件属性数据库、关键词数据库、文本分类数据库、历史信息数据库、自然语言处理数据库、事件识别数据库、回测模块数据库、文字模板数据库、词典数据库。According to claim 2, the database comprises an original information database, a text database, a text pre-processing database, an entity database, an event attribute database, a keyword database, a text classification database, a history information database, a natural language processing database, an event recognition database, Backtest module database, text template database, dictionary database.
  4. 根据权利要求1,所述的处理模块进一步包括格式转换模块、文本处理模块、属性抽取模块、事件识别模块。The processing module of claim 1, further comprising a format conversion module, a text processing module, an attribute extraction module, and an event recognition module.
  5. 根据权利要求4,所述的处理模块进一步包括文本分类模块。The processing module of claim 4 further comprising a text classification module.
  6. 根据权利要求1,所述的处理模块采用的方法包括卡方统计、信息增益、互信息、几率比、交叉熵、类间信息差、关键词统计、决策树、Rocchio、朴素贝叶斯、神经网络、支持向量机、线性最小平方拟合、最邻近算法kNN、遗传算法、情感分类、最大熵、Generalized Instance Set、同义词配置、布尔关联规则、位置规则、机器学习。According to claim 1, the processing module adopts a method including chi-square statistics, information gain, mutual information, probability ratio, cross entropy, inter-class information difference, keyword statistics, decision tree, Rocchio, naive Bayes, and nerve. Network, support vector machine, linear least squares fit, nearest neighbor algorithm kNN, genetic algorithm, sentiment classification, maximum entropy, Generalized Instance Set, synonym configuration, Boolean association rules, position rules, machine learning.
  7. 根据权利要求1,所述的自然语言处理模块可以从收集模块接 收信息。According to claim 1, the natural language processing module can be connected from the collection module Receive information.
  8. 根据权利要求1,所述的回测模块进一步包括回测信息判断,所述回测信息判断根据回测结果的情况给出评价。According to claim 1, the backtesting module further includes a backtesting information judgment, and the backtesting information determines that the evaluation is given according to the situation of the backtesting result.
  9. 根据权利要求1,所述的回测结果可以被展示给用户。According to claim 1, the backtest result can be displayed to the user.
  10. 一种信息分析方法,包括:An information analysis method, including:
    收集信息;collect information;
    根据所述信息提取事件;Extracting an event based on the information;
    根据所述事件生成自然语言语句;Generating a natural language statement based on the event;
    根据所述自然语言语句获取历史信息;Obtaining historical information according to the natural language statement;
    结合所述历史信息对所述自然语言语句进行回测分析。The natural language sentence is back-tested and analyzed in combination with the historical information.
  11. 根据权利要求10,所述的收集信息包括收集用户输入信息和非用户输入信息,所述非用户输入信息来源包括通信终端和服务器。According to claim 10, the collecting information includes collecting user input information and non-user input information, and the non-user input information source includes a communication terminal and a server.
  12. 根据权利要求11,所述的收集信息包括收集公告信息和新闻信息。According to claim 11, the collecting information includes collecting announcement information and news information.
  13. 根据权利要求10,所述的提取事件进一步包括实体识别和属性抽取。According to claim 10, the extraction event further comprises entity identification and attribute extraction.
  14. 根据权利要求13,所述的实体识别进一步包括格式转换、文本分词、数字及单位归一化处理。According to claim 13, the entity identification further includes a format conversion, a text segmentation, a number, and a unit normalization process.
  15. 根据权利要求13,所述的属性抽取可以通过系统定义的模型实现。According to claim 13, the attribute extraction can be implemented by a system-defined model.
  16. 根据权利要求10,所述的自然语言语句可以根据用户输入信息生成。According to claim 10, the natural language statement can be generated based on user input information.
  17. 根据权利要求10,所述的自然语言语句可以进一步根据事件类别扩展。According to claim 10, the natural language statement can be further extended according to the event category.
  18. 根据权利要求10,所述的自然语言语句回测可以根据信息类别生成回测结果。 According to claim 10, the natural language sentence backtesting can generate a backtest result according to the information category.
PCT/CN2015/098086 2015-12-21 2015-12-21 Information analysis system and method based on event regression test WO2017107010A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2015/098086 WO2017107010A1 (en) 2015-12-21 2015-12-21 Information analysis system and method based on event regression test

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2015/098086 WO2017107010A1 (en) 2015-12-21 2015-12-21 Information analysis system and method based on event regression test

Publications (1)

Publication Number Publication Date
WO2017107010A1 true WO2017107010A1 (en) 2017-06-29

Family

ID=59088809

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/098086 WO2017107010A1 (en) 2015-12-21 2015-12-21 Information analysis system and method based on event regression test

Country Status (1)

Country Link
WO (1) WO2017107010A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109840281A (en) * 2019-02-27 2019-06-04 浪潮软件集团有限公司 A kind of self study intelligent decision method based on random forests algorithm
CN110286835A (en) * 2019-06-21 2019-09-27 济南大学 A kind of interactive intelligent container understanding function with intention
CN110750643A (en) * 2019-09-29 2020-02-04 上证所信息网络有限公司 Method and device for classifying non-periodic announcements of listed companies and storage medium
CN111522906A (en) * 2020-04-22 2020-08-11 电子科技大学 Financial event main body extraction method based on question-answering mode
CN112464162A (en) * 2020-11-25 2021-03-09 易方达基金管理有限公司 Fund comparison method, apparatus, computer device and medium based on historical data
CN113962444A (en) * 2021-09-29 2022-01-21 湖北美和易思教育科技有限公司 Student quality literacy prediction system based on reinforcement learning

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120240020A1 (en) * 2002-09-16 2012-09-20 Mckeown Kathleen R System and method for document collection, grouping and summarization
CN103473263A (en) * 2013-07-18 2013-12-25 大连理工大学 News event development process-oriented visual display method
CN103488663A (en) * 2012-06-11 2014-01-01 国际商业机器公司 System and method for automatically detecting and interactively displaying information about entities, activities, and events from multiple-modality natural language sources
US20150339269A1 (en) * 2014-05-23 2015-11-26 Alon Konchitsky System and method for generating flowchart from a text document using natural language processing

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120240020A1 (en) * 2002-09-16 2012-09-20 Mckeown Kathleen R System and method for document collection, grouping and summarization
CN103488663A (en) * 2012-06-11 2014-01-01 国际商业机器公司 System and method for automatically detecting and interactively displaying information about entities, activities, and events from multiple-modality natural language sources
CN103473263A (en) * 2013-07-18 2013-12-25 大连理工大学 News event development process-oriented visual display method
US20150339269A1 (en) * 2014-05-23 2015-11-26 Alon Konchitsky System and method for generating flowchart from a text document using natural language processing

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109840281A (en) * 2019-02-27 2019-06-04 浪潮软件集团有限公司 A kind of self study intelligent decision method based on random forests algorithm
CN110286835A (en) * 2019-06-21 2019-09-27 济南大学 A kind of interactive intelligent container understanding function with intention
CN110286835B (en) * 2019-06-21 2022-06-17 济南大学 Interactive intelligent container with intention understanding function
CN110750643A (en) * 2019-09-29 2020-02-04 上证所信息网络有限公司 Method and device for classifying non-periodic announcements of listed companies and storage medium
CN110750643B (en) * 2019-09-29 2024-02-09 上证所信息网络有限公司 Method, device and storage medium for classifying non-periodic announcements of marketing companies
CN111522906A (en) * 2020-04-22 2020-08-11 电子科技大学 Financial event main body extraction method based on question-answering mode
CN111522906B (en) * 2020-04-22 2023-03-28 电子科技大学 Financial event main body extraction method based on question-answering mode
CN112464162A (en) * 2020-11-25 2021-03-09 易方达基金管理有限公司 Fund comparison method, apparatus, computer device and medium based on historical data
CN113962444A (en) * 2021-09-29 2022-01-21 湖北美和易思教育科技有限公司 Student quality literacy prediction system based on reinforcement learning

Similar Documents

Publication Publication Date Title
US10380249B2 (en) Predicting future trending topics
US10728203B2 (en) Method and system for classifying a question
CN110325986B (en) Article processing method, article processing device, server and storage medium
US11405344B2 (en) Social media influence of geographic locations
WO2018040068A1 (en) Knowledge graph-based semantic analysis system and method
Chehal et al. Implementation and comparison of topic modeling techniques based on user reviews in e-commerce recommendations
US10180979B2 (en) System and method for generating suggestions by a search engine in response to search queries
US11755676B2 (en) Systems and methods for generating real-time recommendations
WO2017107010A1 (en) Information analysis system and method based on event regression test
CN111177569A (en) Recommendation processing method, device and equipment based on artificial intelligence
CN108346075B (en) Information recommendation method and device
CN106062730A (en) Systems and methods for actively composing content for use in continuous social communication
CN104281622A (en) Information recommending method and information recommending device in social media
CN112015962A (en) Government affair intelligent big data center system architecture
CN111159341A (en) Information recommendation method and device based on user investment and financing preference
Luo et al. Exploring energy-saving refrigerators through online e-commerce reviews: an augmented mining model based on machine learning methods
JP6767342B2 (en) Search device, search method and search program
Zhu et al. Real-time personalized twitter search based on semantic expansion and quality model
Wei et al. Sentiment classification of Chinese Weibo based on extended sentiment dictionary and organisational structure of comments
WO2022126873A1 (en) Intelligent financial information recommendation system
CN111429214B (en) Transaction data-based buyer and seller matching method and device
US20180239790A1 (en) Provision device, provision method and non-transitory computer readable storage medium
CN118250516B (en) Hierarchical processing method for users
WO2021136009A1 (en) Search information processing method and apparatus, and electronic device
CN112445955A (en) Business opportunity information management method, system and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15911010

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15911010

Country of ref document: EP

Kind code of ref document: A1