CN106445988A

CN106445988A - Intelligent big data processing method and system

Info

Publication number: CN106445988A
Application number: CN201610382955.7A
Authority: CN
Inventors: 程明强; 蒋朦; 曹国梁; 耿志贤
Original assignee: COEUSYS Inc
Current assignee: Silver Li'an Financial Information Services (beijing) Co Ltd
Priority date: 2016-06-01
Filing date: 2016-06-01
Publication date: 2017-02-22

Abstract

Embodiments of the invention provide an intelligent big data processing method and system. The system comprises a data structured module, a representative learning module and an application algorithm module, wherein the data structured module is used for pre-processing original big data and networking the pre-processed original big data to obtain a relationship network with nodes and edges; the representative learning module is used for obtaining high-dimensional vectors of the relationship network by adoption of an embedded mapping-based representative learning algorithm; and the application algorithm module is used for obtaining an application service request of a user, determining a processing algorithm corresponding to the application service request, and determining a result of the application service request by utilizing the processing algorithm corresponding to the application service request and the high-dimensional vectors, obtained by the representative learning module, of the nodes of the relationship network. The system provided by the embodiments of the invention can effectively extract the feature information in the big data and uniformly express the feature information in a form of high-dimensional vectors, is high in calculation efficiency, high in correctness and sensitive in response to user requests, and can provide a uniform effective processing method for a plurality of application services.

Description

Intelligent processing method and system for big data

Technical Field

The embodiment of the invention relates to the technical field of computers, in particular to an intelligent processing method and system for big data.

Background

The insurance industry is greatly changing due to the technological progress, and the wide application of big data changes the way of realizing services by insurance companies. The existing insurance industry websites and software usually collect mass data and contain a large amount of useful information, including personal information, consumption habits and the like of users. Only by fully utilizing the insurance big data can the requirements of the big data era be adapted in various aspects such as risk pricing, product design, marketing strategy, customer service, risk management and control and the like.

Currently, in the insurance industry, database systems are generally used to store and manage insurance data. The database system usually stores data in a table mode, a large amount of relational data and text information exist in the table, and the stored data can be in various formats. For example, a user's personal profile and product description information are typically stored in a database in the form of text strings, while the user's age and product price are typically stored in the form of non-negative numbers. Although current data processing techniques can extract and match numerical values such as formatted numbers and categories, useful feature information cannot be extracted from unstructured data such as text.

Common insurance business comprises accurate product recommendation of insurance business data, insurance purchasing user classification, fraud and guarantee detection and the like. In the insurance marketing service, users are either enabled to obtain insurance products through searching and then purchase the insurance products, or the methods of popularity recommendation, association rule recommendation, collaborative filtering recommendation and the like are adopted to actively recommend the insurance products to the users. The popularity recommendation means that the current most popular insurance products are recommended to users, and the defects are that personalized consideration is lacked and the accuracy is low. The association rule recommendation is a rule for learning the purchasing interest of the user and the characteristics of the user and the product through data analysis, for example, women over 40 years old are more likely to purchase health insurance, and the recommendation accuracy is not high. Collaborative filtering recommendation is based on a basic assumption that users who have an interest in similar insurance products will purchase similar insurance products afterwards, and products purchased by similar users will be purchased by similar users afterwards.

When the insurance purchasing users are classified, different user characteristics need to be extracted from different categories because the user categories can describe living habits, friend making habits, consumption habits and the like of the users. The method generally adopts a mode of extracting characteristics such as user monthly income, monthly expense, annual income standard deviation, annual expense standard deviation and the like from consumption records of users, and classifying test users by marking a large number of user category labels and training a supervision learning model. The method needs to extract a large number of features by depending on experience, and needs to collect a large number of marking data, which causes the problems of high cost, poor accuracy and the like.

Fraud detection, namely judging whether the application behavior of a certain user is fraud behavior, wherein the most core task is to collect the characteristics of the user in the application behavior. The existing fraud insurance detection system mainly extracts a large number of numerical statistical results from personal information of users, information of insured insurance products, information of insured processes and the like, labels a part of users, judges whether the users are fraud users by manpower, trains a supervision learning model and classifies the insured behaviors. However, this system requires experience in extracting features and collecting label data, and is not efficient to implement.

Therefore, the existing intelligent processing system for insurance industry big data has at least the following disadvantages: 1) the existing insurance data technology lacks the analysis of unstructured data, loses a large amount of effective information and influences the analysis result of insurance business; 2) the existing insurance industry recommendation system, the insurance purchasing user classification system, the fraud cheating insurance detection system and the like excessively rely on the feature extraction of manpower, have low accuracy and poor calculation efficiency, slowly respond to the user request and influence the user experience; 3) different insurance services typically employ different data processing and feature extraction methods, resulting in a large amount of redundant data processing, and the features of the data units of the different services are not compatible.

Disclosure of Invention

The embodiment of the invention aims to provide an intelligent processing method and system for big data, which can effectively extract characteristic information from various big data sources without manual participation, have high calculation efficiency and high accuracy, respond sensitively to a user request and provide a unified and effective processing method for various application services.

The technical scheme adopted by the embodiment of the invention is as follows:

the embodiment of the invention discloses an intelligent processing system of big data, which comprises a data structuring module, a representation learning module and an application algorithm module;

the data structuring module is used for preprocessing the original big data and networking the preprocessed original big data to obtain a relational network comprising nodes and edges;

the characterization learning module is used for obtaining a high-dimensional vector of a node of the relational network by adopting a characterization learning algorithm based on embedded mapping for the relational network;

the application algorithm module is used for acquiring an application service request of a user; and determining a processing algorithm corresponding to the application service request, and determining a result of the application service request by using the processing algorithm corresponding to the application service request and the high-dimensional vector of the node of the relational network obtained by the representation learning module.

Optionally, if the relationship network includes a high-dimensional relationship network, the characterization learning module is specifically configured to perform embedded mapping on the high-dimensional relationship network to obtain a high-dimensional vector of a node of the high-dimensional relationship network.

Optionally, if the relationship network includes a semantic network, the representation learning module is specifically configured to perform embedded mapping on the semantic network to obtain a high-dimensional vector of a node of the semantic network.

Optionally, the data structuring module is specifically configured to perform networking on the behavior data in the preprocessed raw big data to obtain a behavior network including nodes and edges;

networking the attribute data in the preprocessed original big data to obtain an attribute network containing nodes and edges; and the number of the first and second groups,

networking the text data in the preprocessed original big data to obtain a semantic network containing nodes and edges;

wherein the behavior network, the attribute network and the semantic network together form the relationship network.

The embodiment of the invention also provides an intelligent processing method of big data, which comprises the following steps:

preprocessing original big data;

networking the preprocessed original big data to obtain a relational network comprising nodes and edges;

adopting a characterization learning algorithm based on embedded mapping to the relational network to obtain a high-dimensional vector of the nodes of the relational network;

acquiring an application service request of a user;

determining a processing algorithm corresponding to the application service request;

and determining the result of the application service request by using a processing algorithm corresponding to the application service request and the high-dimensional vector of the node of the relational network.

Optionally, if the relationship network includes a high-dimensional relationship network, obtaining a high-dimensional vector of a node of the relationship network by using a characterization learning algorithm based on embedded mapping for the relationship network, where the method includes: and carrying out embedded mapping on the high-dimensional relationship network to obtain a high-dimensional vector of the node of the high-dimensional relationship network.

Optionally, if the relationship network includes a semantic network, the obtaining a high-dimensional vector of a node of the relationship network by using a characterization learning algorithm based on embedded mapping for the relationship network includes: and carrying out embedded mapping on the semantic network to obtain a high-dimensional vector of the nodes of the semantic network.

Optionally, the step of networking the preprocessed raw big data to obtain a relationship network including nodes and edges includes: networking the behavior data in the preprocessed original big data to obtain a behavior network comprising nodes and edges;

the behavior network, the attribute network and the semantic network together form the relationship network.

acquiring an application service request of a user and a high-dimensional vector of a node of a relational network converted from original big data;

Optionally, the relationship network converted from the raw big data is: and the relation network is obtained by carrying out networking on the preprocessed raw big data.

The technical scheme of the embodiment of the invention has the following advantages: the data structuring module can preprocess and network original big data to convert the original big data into network data or structure data, so that the representation learning module can utilize a representation learning algorithm of the network data to realize rapid and uniform feature extraction of the data and express the data in a high-dimensional vector form; the application algorithm module can determine a corresponding processing algorithm according to an application service request of a user, and calculate by using the characteristics expressed in a vector form extracted by the representation learning module to determine a processing result. Different from the prior art, the whole feature extraction process in the embodiment of the invention is automatically completed by using a characterization learning algorithm based on embedded mapping without human participation, and the calculation efficiency is high; structural information (namely effective information) in the original big data is greatly reserved in the characteristic extraction process, so that the accuracy of tasks such as classification or prediction is improved; moreover, because the characterization learning algorithm based on the embedded mapping is adopted, the data characteristic system mined from the original big data can be uniformly represented in a high-dimensional vector form, so that the system in the embodiment of the invention is not limited to a certain specific application service, and can provide a uniform and effective processing method for various application services.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 is a flowchart of an intelligent big data processing method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a behavioral network;

FIG. 3 is a flowchart of another method for intelligently processing big data according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of an intelligent big data processing system according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of another big data intelligent processing system according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Before describing the embodiments of the present invention, the related concepts are explained in order to better explain the embodiments of the present invention.

The data unit refers to a basic unit that is inseparable when representing the relational data, such as a certain "client or user", a certain "age group", a certain "product classification", and the like. These basic units are physically present in life. The data unit is a non-data unit, which refers to a structure that the data units are composed of customer relations, behaviors of the customers on products, a series of products belonging to the same category and the like.

Behavior data refers to data generated by a user acting on a product, such as data generated by a user purchasing, unsubscribing, or evaluating an insurance product. Behavioral data describes the relationship between two or more data elements, typically describing the relationship between "users" and "products".

The attribute data refers to the relationship between data units of users, products and the like and attributes thereof, such as the ages of the users, the types of the products and the like. Attribute data describes the relationship of a data unit to its attributes, typically between "user" and its attributes, or "product" and its attributes ".

Text data refers to text containing words or phrases. Words or phrases may be used as data elements.

Structured data refers to data that can be represented by data or a uniform structure, such as numbers or symbols, stored in a database that can be logically represented by a two-dimensional table structure.

Unstructured data, as opposed to structured data, refers to data that cannot be represented numerically or with a uniform structure, and is not conveniently represented by a database two-dimensional logical table, such as text, images, voice, web pages, various types of reports, and the like.

A high-dimensional relationship refers to a relationship involving multiple data units (or multiple nodes in a network), being an interaction of multiple data units. A two-dimensional relationship is an interaction of only two data elements. The purchasing behavior is a behavior of high dimensional relationship under the condition of rich information, and may generally include a user, a product, a purchasing place, a purchasing mode and the like, but if the information collection is incomplete, it may be a behavior of only two dimensional relationship, such as only containing the user and the product. Conventional data processing systems are only able to take into account the behavior of two-dimensional relationships, but are unable to handle the behavior of high-dimensional relationships. And the behavior of the high-dimensional relationship generates data of the high-dimensional relationship which is ubiquitous in the current various fields.

In addition, with the development of network technology, the amount of unstructured data is increasing. At this time, the exposure of limitations of data processing systems that are only capable of managing and analyzing structured data is becoming more apparent. Moreover, in many industries, not only in the insurance industry, but also in the feature extraction of big data, experts still need to be used, and the feature extraction cannot be completed by only a computer. The system for processing the big data also generally has a series of problems of low accuracy, poor computing efficiency, slow response to the user request and the like.

In order to solve the above problem, an embodiment of the present invention provides an intelligent processing method for big data, as shown in fig. 1, the method includes:

s101: and preprocessing the original big data.

The raw big data may be collected through various websites or Applications (APPs), and thus may include structural data such as behavior data and attribute data, and may also include unstructured data such as text data, and the format of the data may also be various. Thus, the raw big data may be preprocessed before the data is characterized or serviced. The data preprocessing method comprises data cleaning, data integration, data transformation, data analysis, data reduction and the like.

Optionally, in the embodiment of the present invention, the preprocessing of the raw big data may be to perform data analysis and cleaning on the raw big data, that is, to perform statistical analysis on the raw big data to remove the content of the unqualified or erroneous data, may be to filter the format of the illegal data, for example, to remove the value that is supposed to be a floating point number but filled in a character string type, and the like, and may also unify time or units, or fill in a missing finger, smooth noise data, and the like, so that the format of the big data may be standardized, abnormal data may be cleaned, and an error may be corrected or duplicate data may be cleaned.

S102: and networking the preprocessed original big data to obtain a relational network containing nodes and edges.

The nodes in the relational network are converted from the data units in the preprocessed original big data, and the edges in the relational network are used for representing the relationship between the nodes in the network.

However, such a conventional data storage method cannot uniformly store and manage data in a large scale, and may lose a large amount of semantic information (which is useful information and is important for providing accurate application services to users) contained in text data, and most importantly, a fragmented table storage method cannot be conveniently and quickly accessed and utilized by subsequent application services, and cannot meet the requirements of application services with high implementation frequency and high response speed.

In the embodiment of the invention, the big data or mass data in the table can be converted into the relational network by networking the original big data, so that the problems are effectively solved. Firstly, after the preprocessed original big data is networked, the data can be processed uniformly in a node and edge mode, and the cost of data storage and management is greatly reduced. Secondly, the preprocessed text data such as words and phrases in the original big data are networked to construct a semantic network, so that semantic information in the text is reserved, the text data can be effectively utilized in the subsequent process, and the accuracy of application service is improved. In addition, after the preprocessed original big data is expressed as a relational network containing nodes and edges, the rapid and uniform feature extraction of the data can be realized by utilizing a representation learning algorithm of the network data, so that different application service requests can be responded rapidly.

Optionally, the preprocessed raw big data may include behavior data, attribute data, and text data, and the networking the preprocessed raw big data may include: networking the behavior data in the preprocessed original big data, for example, converting the behavior data of purchasing, evaluation and the like into a behavior network; or, the method may further include performing networking on attribute data in the preprocessed raw big data, for example, converting attribute information such as age and price into an attribute network; or, the method may further include networking text data in the preprocessed raw big data, for example, converting text data such as product introduction or evaluation content into a semantic network with words and phrases as nodes. The behavioral network, the attribute network, and the semantic network together form the relationship network.

S103: and obtaining the high-dimensional vector of the node of the relational network by adopting a characterization learning algorithm based on embedded mapping for the relational network.

Characterization learning is one of the core research problems in machine learning and data mining. In the embodiment of the invention, by adopting a characterization learning algorithm based on embedded mapping for the relational network, nodes in the relational network, such as users, products, phrases and the like, are uniformly represented by vectors with higher dimensionality, and structural information in original big data is retained. Wherein each vector may represent a node in the relational network, and a dimension in the vector represents a feature of the node. The relationship (or edge) between the nodes in the relational network is converted into the similarity between the high-dimensional vector of the node and the high-dimensional vector of the node, if the node 1 and the node 2 have a relationship (i.e., are connected by an edge in the relational network), the similarity between the high-dimensional vector of the node 1 and the high-dimensional vector of the node 2 is high, otherwise, the similarity is low.

By the representation learning mode, an artificial feature extraction mode depending on expert experience in the prior art is avoided, features which are obtained by using big data as driving and conform to data rules are realized, and the features are expressed in a vector form, so that the follow-up method can be directly applied to various tasks including classification, clustering, prediction and the like.

Further, by adopting a characterization learning algorithm based on embedded mapping, the structural information in the relationship network can be kept as much as possible, and different structural information can be kept for different networks. For example, for a behavior network of "user-product", purchasing behavior information may be retained, so that users represented by similar features in a vector have similar purchasing habits, and products represented by similar features have similar purchasing populations, for example, 50-dimensional vectors in a high-dimensional vector may be selected to store structural information of "purchasing behavior relation", so that vector similarity between high-dimensional vectors corresponding to two nodes (user and product) having the structure of "purchasing behavior relation" is high, and another 50-dimensional vector in the high-dimensional vector may be selected to store result information of "similar purchasing tendency", so that vector similarity between high-dimensional vectors corresponding to two nodes (user and user) having the structure of "similar purchasing tendency" is high. Therefore, the accuracy of tasks such as classification and prediction corresponding to the later-stage application service can be greatly improved, and the problems that structural information in data cannot be effectively extracted and a large amount of effective information is lost in the prior art are solved.

In addition, a common learning method is to use matrix or tensor decomposition to obtain a high-dimensional representation of a node, however, such a method often faces the problem of too high complexity (cubic level), cannot be widely applied to an industrial scene of mass data, and is also not high in computational efficiency. In the embodiment of the invention, the embedded mapping school method is adopted, and Negative Sampling technology (Negative Sampling) is adopted in the method, so that a large amount of data is sampled and learned in a reasonable proportion, and a learning project can achieve a better learning result in less time. And after the relational network is expressed by the high-dimensional vector, the learning time can be shortened, the calculation efficiency can be greatly improved, and the request of the user can be quickly responded.

Besides the embedded mapping, the implementation of the characterization learning algorithm also has other modes, such as singular value decomposition, non-negative matrix decomposition, and the like, but these methods are limited to two-dimensional relational networks and the computation speed is very slow. In the embodiment of the invention, considering that the collected big data tends to be more and more diversified in the application scenes of the insurance industry, the financial industry, the shopping industry, the e-commerce industry and the like at present, the relational network obtained by utilizing the technical processing of the embodiment of the invention is not limited to a two-dimensional relational network and is a high-dimensional relational network in most cases. The scale of the data is quite large, so that a representation learning algorithm based on embedded mapping is selected, the method can be applied to a two-dimensional relationship network and a multi-dimensional relationship network, the calculation speed can be accelerated, the calculation time is greatly shortened, and the application requirement is quickly responded.

Specifically, a characterization learning algorithm of "embedding mapping" may be adopted, and the "attitude mapping" in the domain theory is utilized to implement the dimension reduction "embedding" of the structure-preserving mapping to implement the characterization learning. That is, for data in the relational network, a high-dimensional vector representation of a node is obtained through a learning algorithm that retains structural information in the relational network.

S104: an application service request of a user is obtained.

When a user browses a web page, uses an APP, or clicks a function button of an operation interface, etc., the user may trigger an application service request, so that the application service request may be obtained to determine a related algorithm to be subsequently used.

S105: and determining a processing algorithm corresponding to the application service request.

S106: and determining the result of the application service request by using a processing algorithm corresponding to the application service request and the high-dimensional vector of the node of the relational network.

The services of the application layer can be defined as tasks such as sequencing, classification, clustering, prediction, correlation analysis, anomaly detection and the like, the tasks can be completed by using a specific processing algorithm, and according to a high-dimensional vector obtained after characterization learning, an accurate and efficient solution can be obtained by using the processing algorithm corresponding to the tasks (namely the processing algorithm corresponding to the application service request), and the solution is returned to the user.

Specifically, the corresponding relationship between the application service request and the processing algorithm may be specified or obtained in advance, for example, when the application service request is product recommendation, it may be known that a recommended product is actually predicted, a series of products most likely to be purchased by a user are predicted, the processing algorithm calculates the degree of similarity between the high-dimensional vector of the user node and the high-dimensional vector of the product node, and if the corresponding relationship between the application service request and the processing algorithm is specified or obtained in advance, after the application service request is received, it may be determined that the processing algorithm corresponding to the application service request calculates the degree of similarity between the high-dimensional vector of the user node and the high-dimensional vector of the product node. And finally, carrying out similarity calculation by using the high-dimensional vectors of the user nodes and the high-dimensional vectors of the product nodes to obtain a series of products with the highest similarity to the user, namely obtaining the result of the application service request.

In the embodiment of the invention, the preprocessed original big data is networked to obtain the relational network comprising nodes and edges, and the high-dimensional vectors of the nodes of the relational network are obtained by adopting the characterization learning algorithm based on the embedded mapping for the relational network, so that the feature extraction of the original big data is realized, the whole process does not need to depend on the experience of experts and does not need human participation, and the feature extraction is automatically completed by utilizing the characterization learning algorithm based on the embedded mapping, and the calculation efficiency is high. Different from the prior art, effective information is greatly reserved in the characteristic extraction process in the embodiment of the invention, so that the accuracy of subsequent tasks such as classification or prediction is improved. Further, in the embodiment of the present invention, since the features of the data are uniformly expressed in the form of the high-dimensional vector, a processing algorithm may be determined according to the application service request, so as to determine the result of the application service request by using the features expressed in the form of the high-dimensional vector.

It should be noted that the intelligent processing method for big data according to the embodiment of the present invention may not only be applied to the insurance industry field, but also be applied to other fields, such as the financial field, the shopping consumption field, and the like, and is particularly applicable to the case of processing data including structural data and non-structural data, and the case of processing data with high dimensional relationship, which has obvious advantages compared with the prior art.

It should be noted that, in S106, when determining the result of the application service request by using the high-dimensional vectors of the nodes of the relational network, the result of the application service request may be determined by using the high-dimensional vectors of all the nodes of the relational network; it is also possible to determine the result of the application service request using only high-dimensional vectors of part of the relational network. In particular, only the node associated with the application service request may be utilized to determine the result of the application service request. For example, when the application service request is a product recommendation, the calculation may be performed using only the high-dimensional vectors of the product nodes and the high-dimensional vectors of the user nodes.

Alternatively, in step S102, specifically how to network the behavior data, the text data or the attribute data, the following manner may be referred to.

1. Networking the behavior data in the preprocessed original big data

In particular, behavioral data describes a relationship between two or more data units, and networking behavioral data refers to representing the relationship as an edge of a network, with the data units being represented as nodes of the network. The network may be a two-dimensional relationship network or a high-dimensional relationship network, and accordingly, when the behavior data is networked, the relationship may be represented as a two-dimensional edge or a high-dimensional edge. I.e. representing actions such as purchase, unsubscribe or evaluation as an edge of the network. The two-dimensional edge means that the edge contains two nodes, and the high-dimensional edge means that the edge contains a plurality of nodes.

For example: abbreviated user behavior data can be represented as a two-dimensional relationship form of "user-product". In addition, the user behavior may also have rich context information, and the context information may be nodulated to form a multivariate relational graph, such as a three-dimensional relational graph of "user-product-evaluation". Taking the action of mr. zhangmian purchasing insurance product a as an example, mr. zhangmian purchases the evaluation given to the insurance product a as follows: the price is expensive, but is worth while. The behavioral data networking of the data can obtain a behavioral network as shown in fig. 2. In fig. 2, "mr. zhang" and "insurance product a" are represented as nodes of the action network, and the purchase action constitutes an edge between the two nodes. Furthermore, the evaluated phrases or words- "expensive" and "worthy", which are actually attributed to the networking of the text data, are represented as nodes of the network, as will be explained in detail in the following description. Thus, a 'user-product-evaluation' behavior network, namely a three-dimensional relationship network, is formed.

2. Performing networking on text data in the preprocessed original big data

The networking of the text data is to represent data units formed by words or phrases as nodes of a network, so that the text is constructed into a relational network taking the words or phrases as nodes. The edges between nodes in a network, which are composed of words or phrases, describe how often they appear in sentences or documents. For example, if the two phrases "expensive" and "worthy" are included in 3 sentences, the two phrases "expensive" and "worthy" may be two nodes of the relationship network, there may be an edge between them for connection, and the weight of the edge may be set to 3; if "expensive" and "really cheap" in the network never co-occur in sentences, there is no edge between these two nodes to connect. In addition, the edges formed by the nodes composed of words or phrases and other nodes (such as users, products) belong to behavior data and describe the relationship between two or more data units.

Taking the above-mentioned mr. zhang buying and evaluating insurance product a as an example, text data such as evaluation content can be structured, i.e. word segmentation, phrase extraction, category labeling, emotion analysis, etc. are performed, so that natural language is expressed into a processable data structure. Specifically, according to the fact that "price is expensive but is worthy," it can be known that "expensive" and "worthy" are core words, and the "expensive" describes the characteristics of the product at the "price" level, and the "worthy" reflects the positive purchasing mood and emotion of the user. When the text data is networked, "expensive" and "worthwhile" are thus represented as nodes of the network, which belong to the behavior data with other nodes, such as edges formed by users and products.

Therefore, the text data is networked, so that the analysis of unstructured data is realized, words, phrases and the like can be associated with the behavior data, and certain useful information is reserved.

3. Networking the attribute data in the preprocessed original big data

The attribute data describes the relationship of the data unit and its attributes, and the networking of the attribute data means that the relationship is represented as an edge of the network and the data unit is represented as a node of the network. The attribute data may be category information such as health insurance or travel insurance, or may be numerical information such as age or price. Thus, the attribute data may be networked by representing the category information as nodes of the network, and representing the numerical information in the attribute information such as age and price as nodes after dividing the numerical information into sections.

For example, Mr. 25 years old purchased an insurance product at a price of 2000. In this example, an age interval containing 25 years may be represented as a node, e.g., young years between 24-30 years may be represented as a node "young adult"; a price interval containing a value of 2000 may be represented as a node, such as a node "entry level insurance product" where the price is between 1000 and 5000. After the processing, the attribute network is finally converted into the attribute network of 'user-age level' and 'product-price interval'.

Optionally, after the preprocessed raw big data is networked, large-scale storage and management of format normalization can be performed on the nodes and edges of the relational network, so as to facilitate subsequent feature extraction and use. Therefore, after S102, the method may further include:

s102': and storing the nodes and edges of the relational network in a database.

For example, two tables may be stored in the database to store the nodes and edges of the relational network, respectively, and each row in the table storing node information is the ID, name, query frequency, and the like of the node. Each row in the table holding the edge information is an ID of an edge, an ID of a relevant node, a generation time, and the like. After the pre-processed raw big data is networked, all the pre-networked data is actually converted into structured data. In practical applications, for Structured Data Management (Structured Data Management), there are several Data Management technologies, such as distributed storage, cloud database, NOSQL database (non-relational database), mobile database, and the like. For example, BaseX, MongoDB and No2DB are three NO-SQL databases developed to be popular by means of Java, C + + and C # languages respectively; MySQL and HBase are common database software; AllegroGraph, DEX, Neo4j and FlockDB in the network relational store are graph databases that are based on SPARQL, Java and Scala.

Optionally, when step S103 is implemented, the relationship network may include a semantic network, and may also include an attribute network and a behavior network. They may belong to both homogeneous and two-dimensional relational networks and also to high-dimensional relational networks. Therefore, applying a characterization learning algorithm based on embedded mapping to the relational network, and obtaining a high-dimensional vector of the nodes of the relational network may include: embedding and mapping a high-dimensional relationship network in the relationship network to obtain a high-dimensional vector of a node of the high-dimensional relationship network; or embedding and mapping a two-dimensional relationship network in the relationship network to obtain a high-dimensional vector of a node of the two-dimensional relationship network; or embedding and mapping a semantic network in the relational network to obtain a high-dimensional vector of a node of the semantic network; or embedding and mapping homogeneous networks in the relational network to obtain high-dimensional vectors of nodes of the homogeneous networks.

Firstly, Embedding mapping (Text Embedding) is carried out on the semantic network

The nodes in the form of words and phrases in the semantic network are represented as high-dimensional vectors by using an embedding mapping method, and after embedding mapping, the high-dimensional vectors of the nodes representing similar words or phrases in the nodes have high similarity, namely the similar words and phrases have similar semantics.

Specifically, a word embedding mapping method based on a Skip-gram model can be used for achieving the purpose of accurately predicting adjacent words by learning the vector representation of the words. The most efficient learning objective (i.e., the maximized objective function) is: after a word is hidden in a sentence, the most suitable vector of the hidden word can be obtained through other adjacent words in the given sentence. Under the natural language state, the vacant words where the hidden words are located have similar semantics, so that the similarity of the vectors of the hidden words is high when embedding mapping is carried out.

In short, the objective function of the embedded mapping of the semantic network to maximize the conditional probability is to give vectors of neighboring nodes (connected nodes), predict the vector of the target node so that the nodes connected to some given nodes have similar vectors therebetween. And further expansion can be carried out, various elements such as words, phrases and phrase categories are blended, and the semantic-level representation learning is realized.

Selecting the scale c of the context information of the training text, namely the window size, and converting the current word w into a word_tThe objective function of the maximization of the training model with neighboring unit cells as output layers is, as input:

wherein, w_iRefers to the ith word in the text.

By maximizing the objective function, learning obtains a vector representation w for each word_(i)So that a vector w is given_(t)When the position t is reached, the similarity between the vector of the position (t + j) and the vector of the word at the position in the actual document is high (the probability is maximized) by learning the objective function, so that similar words and phrases have similar semantics, and the semantics of the words can be kept.

For example, several adjacent words, "today", "noon", "eaten" appear in the semantic web, possibly from the textual information "rice at noon today" and "rice at noon today" in the original big data.By adopting the method of the embodiment of the invention, the vector quantity of the rice is w_(t)The vector of "today", "noon" and "eaten" is w_(t+j)I.e. w_(t-3),w_(t-2),w_(t-1)And the similarity of the vectors corresponding to the rice and the rice is high by a characterization learning algorithm based on the embedded mapping, namely the two words or phrases of rice and rice have similar semantics. In the prior art, the 'rice' and the 'rice' are different words, so that semantic information cannot be reserved.

Secondly, Embedding and mapping the two-dimensional relationship Network (binary Network Embedding)

The two-dimensional relationship network refers to that each edge in the network corresponds to two nodes, and the nodes in the network are only of two types, for example, "user-product" is a two-dimensional relationship network.

The embedding and mapping of the two-dimensional relationship network refers to that nodes (such as nodes of a user, a product, an age layer, a price layer and the like) in a behavior network and an attribute network with two-dimensional relationships (such as a user-product, a user-age, a product-price and the like) are expressed as high-dimensional vectors by using an embedding and mapping method.

As with the embedding map of the semantic network, the embedding map of the two-dimensional relational network, in which the objective function maximizing the conditional probability is a vector of given neighboring nodes (connected nodes), predicts the vector of the target node so as to be associated with given nodes v_jConnected nodes v_iWith similar vectors in between.

The two-dimensional relational network is assumed to contain class a nodes and class B nodes. Then by maximizing the objective function, one can at a given class B node v_jWhen, obtained and v_jVector of connected nodes, and class A node v_iIs similar, i.e. the conditional probability is maximized.

May be defined by class B sectionV in the dot_jV capable of generating class A nodes_iThe conditional probability represented is:

wherein u is_iIs v_iHigh-dimensional vector of u_jIs v_jHigh-dimensional vector of (2).

By taking a two-dimensional relationship network composed of "user-product" as an example, assuming that the class a node represents a user and the class B node represents a product, it can be predicted which users may buy a certain product, or how much the probability that the user purchases the product can be calculated in the above manner.

For example, after the data is networked, a two-dimensional relationship network exists as follows: user A-product C, user A-product D, user B-product C. Then the objective function is: when a product C node is given, vectors corresponding to a user A node and a user B node are changed (learned), so that the vectors of all nodes connected with the product C node are similar to the vectors of the user A node and the user B node, and the vectors of the user A node and the user B node are similar. By the mode, the structure information in the network is successfully stored, and the accuracy of subsequently solving the corresponding problems is greatly improved.

Thirdly, Embedding and mapping the high-dimensional relationship Network (sensor Network Embedding)

The high-dimensional relationship network refers to a network with edges corresponding to three nodes, for example, the "user-product-evaluation" network shown in fig. 2 belongs to the high-dimensional relationship network. High-dimensional relationships (High-order relationships) are also common in data, such as evaluation behaviors involving users, products and evaluation texts, and therefore tensors rather than matrices, ternary relationships rather than simple bipartite graphs are required to represent such behavior data.

The embedding and mapping of the high-dimensional relationship network refers to that nodes in a behavior network and an attribute network with high-dimensional relationship (such as user-product-evaluation) are represented as high-dimensional vectors by using an embedding and mapping method.

Like the embedding mapping of the semantic network, the embedding mapping of the high-dimensional relational network, the objective function of maximizing the conditional probability is to give vectors of neighboring nodes (connected nodes), predict the vector of the target node so that the nodes connected with some given nodes have similar vectors therebetween.

To realize the embedded mapping of the high-dimensional relational network, the objective function needs to be updated, and two processing methods are available. One is to update the vector representation of the associated node once per sample for the multivariate relationship, then the maximized objective function is as follows:

where S is a set of nodes, A_(j)Refers to a high-dimensional set of relationships, r, associated with a j node_(m/j)Refers to one of the high-dimensional relationships, m is the number of the high-dimensional relationship, and lambda_m,/jIs the weight of the high dimensional relationship, P₁Is the probability, L, of the associated node given the high dimensional relationship₁The similarity of vectors between every two nodes in the associated high-dimensional relationship is maximized for each node j.

The other is that when the multivariate relation is sampled, the multivariate relation is split into a plurality of binary relations, and the vector representation of the associated nodes is updated, and the maximization objective function is as follows:

wherein,is a set of all two-dimensional relationships after the high-dimensional relationship is split into a plurality of two-dimensional relationships, r_mIs the m-th two-dimensional relationship, λ_mIs the weight of the mth two-dimensional relationship, P2 is the probability of the associated node given the high-dimensional relationship, L₂For each split two-dimensional relationship, the vector similarity between two nodes of the relationship is maximized.

For example, assume that the high-dimensional relationship network after the data is networked is: user A-product C-purchase location E, user A-product C-purchase location F, user B-product C-purchase location E.

The objective function is to make the vector of the "user a" node similar to the vector of the "user B" node, given the "product C" node and the "place of purchase E" node, by having the vectors of their associated (i.e., connected by edges) nodes similar. Of course, we may traverse each given information, such as the vectors corresponding to the "purchase location E" node and the "purchase location F" node, given the "user a" node and the "product C" node.

If the maximum objective function L is adopted₁I.e., other nodes given some relationship (e.g., product C and place of purchase E), have a hidden node learned (e.g., user node).

If the maximum objective function L is adopted₂The high-dimensional relationship is split into 9 two-dimensional relationships such as A-C, C-E, A-E, A-C, A-F, C-F, and then embedded mapping of the two-dimensional relationships is called for realization.

By embedding and mapping the semantic network, embedding and mapping the two-dimensional network and embedding and mapping the high-dimensional network, the nodes of the relational network can be uniformly represented by vectors with higher dimensionality by adopting a representation learning algorithm based on the embedding and mapping for the relational network, and each dimensionality of the vector represents the characteristics of the node, so that the characteristic extraction of the original big data is realized. And structural information in the original big data, such as semantic information, purchasing behavior information and the like, is reserved in the high-dimensional vector, so that the accuracy of tasks such as classification, prediction and the like corresponding to later-stage application services is greatly improved. In addition, the characterization learning algorithm based on the embedded mapping in the embodiment of the invention can also be applied to data of high-dimensional relationship, is suitable for various complex application environments, has high calculation speed and can quickly respond to application requirements.

Optionally, when steps S105 to S106 are implemented, the application service request may be converted into tasks such as sorting, classifying, clustering, predicting, association analyzing, and anomaly detecting, the tasks may be completed by using a specific processing algorithm, and a corresponding relationship between the tasks and the processing algorithm (i.e., a corresponding relationship between the application service request and the processing algorithm) may be pre-specified or obtained, so that when the application service request is obtained, which processing algorithm is used may be known. In order to better understand the embodiments of the present invention, and to understand what processing algorithms correspond to and how the tasks are performed by the processing algorithms, the embodiments of the present invention will be described in detail with respect to the relevant contents.

1. Sequencing task

The sorting task is often implemented based on some specific Similarity, and generally involves Similarity calculation of nodes of the relationship network, including Pearson Correlation (Pearson Correlation) and Cosine Similarity (Cosine Similarity).

For example, when a problem that needs to be solved by an application service request is that, given a certain product, the product that is most similar to it in terms of being purchased is listed, the problem can be translated into a ranking task.

And (3) processing algorithm: we can find the high-dimensional vector u of the product node from the high-dimensional vectors obtained by executing S101-S103_iThen the problem is converted into the sum of_iA series of product nodes with the highest similarity. Since each product node has a high-dimensional vector representation, usually K-dimensional (K is usually a number between 200 and 500), the similarity between nodes can be obtained by taking the number product of vectors. Finally the problem is converted into an and vector u_iThe series of vectors that are largest in magnitude product. The above algorithm realizes the sequencing task or obtains the result of the application service request.

2. Classification (Classification) task

The classification tasks comprise two-classification and multi-classification, and the classification tasks can be effectively solved by supervised learning algorithms such as Support Vector Machine (SVM) and Logistic Regression (Logistic Regression);

for example, the problem that the application service request needs to address may be that, given a large number of users, the user category is determined based on age, income interval, etc. information. However, in practical applications, there are often information losses in the data, and how to classify users with unknown information such as age and income into the correct age group and income interval is an important problem. The problem may be translated into a classification task.

And (3) processing algorithm: the high-dimensional vectors of the nodes such as the user, the age group, the income interval and the like can be obtained through characterization learning, and therefore, only the similarity between the high-dimensional vector of the user node and the high-dimensional vector of the age group node is needed to be calculated, the similarity between the high-dimensional vector of the user node and the high-dimensional vector of the income interval node is needed to be calculated, and the age group node and the income interval node with the highest similarity to the high-dimensional vector of the user node are selected. The user can be classified into the right age group and income interval.

3. Clustering (Clustering) task

The clustering task is usually completed by unsupervised learning algorithms such as nearest neighbor and spectral clustering.

For example, the problem that an application service request needs to solve may be: given a large number of users, the users are grouped into K classes according to purchasing behavior habits under the condition of unknown classes, so that the same strategy can be established for the users of the same class. The problem can be translated into a clustering task.

And (3) processing algorithm: and the clustering can be quickly realized by adopting a K-means or KNN algorithm according to the high-dimensional characteristic representation of the user. The difficulty of the general clustering problem is how to reduce the dimensionality of the structured information, which is up to the number of users, i.e., the number of nodes N, whereas the embedding mapping has been successful in reducing the dimensionality to K.

4. Prediction (Prediction) task

The prediction task is usually to use Matrix decomposition (Matrix Factorization) or tensor decomposition (tensrfactor) to fill the Matrix and the high-dimensional tensor, so as to predict missing values (missing value) in the data.

For example, the problem that an application service request needs to solve may be: it is predicted whether a user will purchase a product in the future. In fact, the recommendation problem may translate into a predictive problem, i.e. giving a predicted series of products that the user is most likely to buy.

And (3) processing algorithm: by the method, the high-dimensional vector of the given user node and the high-dimensional vector of the product node are obtained, and the product with the highest similarity to the user node can be recommended to the given user by calculating the similarity between the high-dimensional vector of the given user node and the high-dimensional vector of the product node.

5. Correlation Analysis (Correlation Analysis) task

The problem that the application service request needs to solve may be: and judging whether the age layer and the income interval of the user have a correlation with the price interval of the product.

And (3) processing algorithm: by the method of the embodiment of the invention, the high-dimensional vectors of the age level node, the income interval node and the price interval node can be obtained, so that the association relationship and the association strength between different user attributes (the age level and income of the user) and product attributes (the price interval of the product) can be known by quickly calculating the similarity between the age level node, the income interval node and the price interval node.

6. Exception Detection (Outlier Detection) task

The problem that the application service request needs to solve may be: and judging whether a user is an abnormal user in the user group where the user is located, such as a fraudulent user and the like.

And (3) processing algorithm: by the method of the embodiment of the invention, the high-dimensional vectors of all the user nodes can be obtained, and if the similarity is large, the current user can be considered as an abnormal user by calculating the similarity between the high-dimensional vector of the current user node and the high-dimensional vectors of other user nodes.

Optionally, after steps S101 to S103 are performed, that is, after data mining on the original big data is completed to obtain data features represented by uniform high-dimensional vectors, if the original big data has an update, steps S101 to S103 may be performed only on the updated data, and S101 to S103 need not be performed again on all data.

Alternatively, the steps S101 to S103 may be executed on the new data to realize data mining on the new data in case of update of the data, or may be executed only when the new data is accumulated to a certain amount, or the steps S101 to S103 may be executed on the new data periodically.

The embodiment of the invention also provides an intelligent processing method of big data, as shown in fig. 2, the method comprises the following steps:

s301: the application service request of the user and the high-dimensional vector of the nodes of the relational network converted from the original big data are obtained.

In the embodiment of the invention, the features represented by high-dimensional vectors can be directly obtained, so that feature mining by using original big data is not needed. The process of feature mining using raw big data may be performed on other devices, and the embodiments of the present invention are not limited herein. The process of feature mining by using the original big data can refer to S101 to S103, which is not described herein in detail in the embodiments of the present invention.

S302: and determining a processing algorithm corresponding to the application service request.

S303: and determining the result of the application service request by using a processing algorithm corresponding to the application service request and the high-dimensional vector of the node of the relational network.

The specific implementation of S302 and S303 can refer to S105-S106.

In the embodiment of the invention, the characteristics expressed by high-dimensional vectors are directly obtained, and the result of the application service request is determined by utilizing the processing algorithm corresponding to the application service request and the high-dimensional vectors of the nodes of the relational network. The intelligent processing method for big data in the embodiment of the invention is not limited to a specific application service, and can provide a unified and effective processing method for various application services.

Corresponding to the embodiment of the method shown in fig. 1, the present invention also provides an intelligent big data processing system, as shown in fig. 4, which includes a data structuring module 401, a characterization learning module 402, and an application algorithm module 403.

The data structuring module 401 is configured to pre-process raw big data, and network the pre-processed raw big data to obtain a relationship network including nodes and edges. And the nodes in the relational network are converted from the data units in the preprocessed original big data, and the edges in the relational network are used for representing the relationship between the nodes in the network. By networking the original big data, the big data or mass data in the table can be converted into a relational network, so that the data can be uniformly processed in a node and edge mode, and the cost of data storage and management is greatly reduced. Secondly, the preprocessed text data such as words and phrases in the original big data are networked to construct a semantic network, so that semantic information in the text is reserved, the text data can be effectively utilized in the subsequent process, and the accuracy of application service is improved. In addition, after the preprocessed original big data is expressed as a relational network containing nodes and edges, the rapid and uniform feature extraction of the data can be realized by utilizing a representation learning algorithm of the network data, so that different application service requests can be responded rapidly.

The representation learning module 402 is configured to apply a representation learning algorithm based on embedded mapping to the relationship network to obtain a high-dimensional vector of a node of the relationship network. The representation learning module 402 uniformly represents nodes in the relationship network, such as users, products, phrases, and the like, with vectors with higher dimensions by applying a representation learning algorithm based on embedded mapping to the relationship network, where each vector may represent a node in the relationship network, and one dimension in the vector represents a feature of the node. The relationship (or edge) between the nodes in the relational network is converted into the similarity between the high-dimensional vectors of the nodes and the high-dimensional vectors of the nodes, so that the structural information in the original big data is reserved, and the accuracy of tasks such as classification and prediction corresponding to later-stage application services is greatly improved.

An application algorithm module 403, configured to obtain an application service request of a user; determining a processing algorithm corresponding to the application service request, and determining a result of the application service request by using the processing algorithm corresponding to the application service request and the high-dimensional vector of the node of the relationship network obtained by the representation learning module 402. That is, after the characterization learning module 402 uniformly represents the features in the big data in the form of high-dimensional vectors, the application algorithm module 403 may use the features uniformly represented by the high-dimensional vectors to provide solutions for various application services or return the results of problems that need to be solved by the application services.

In this embodiment of the present invention, the data structuring module 401 is configured to perform preprocessing and networking on original big data, so that the representation learning module 402 can utilize a representation learning algorithm of network data to achieve fast and uniform feature extraction on the data, the application algorithm module 403 can determine a corresponding processing algorithm according to an application service request of a user, and calculate by using features expressed in a vector form and extracted by the representation learning module 402, so as to obtain a processing result, which is returned to the user. Different from the prior art, the whole feature extraction process in the embodiment of the invention is automatically completed by using a characterization learning algorithm based on embedded mapping without human participation, and the calculation efficiency is high; structural information (namely effective information) in the original big data is greatly reserved in the characteristic extraction process, so that the accuracy of tasks such as classification or prediction is improved; moreover, because the characterization learning algorithm based on the embedded mapping is adopted, the data characteristic system mined from the original big data can be uniformly represented in a high-dimensional vector form, so that the system in the embodiment of the invention is not limited to a certain specific application service, and can provide a uniform and effective processing method for various application services.

Optionally, the relationship network may include both a semantic network and an attribute network and a behavior network. They may belong to both homogeneous and two-dimensional relational networks and also to high-dimensional relational networks. Therefore, the representation learning module 402 may be specifically configured to perform embedded mapping on a high-dimensional relationship network in the relationship network to obtain a high-dimensional vector of a node of the high-dimensional relationship network; or, the method is specifically configured to perform embedded mapping on a two-dimensional relationship network in the relationship network to obtain a high-dimensional vector of a node of the two-dimensional relationship network; or, the method is specifically configured to perform embedded mapping on a semantic network in the relational network to obtain a high-dimensional vector of a node of the semantic network; or specifically, the method is used for performing embedded mapping on a homogeneous network in the relational network to obtain a high-dimensional vector of a node of the homogeneous network.

Optionally, in this embodiment of the present invention, the raw big data may be collected through each website or APP, and may include structural data such as behavior data and attribute data, and may also include unstructured data such as text data, which is not limited herein.

The preprocessing of the raw big data by the data structuring module 401 may be to perform data analysis and cleaning on the raw big data, that is, to perform statistical analysis on the raw big data to remove the unqualified or wrong data content, to filter the illegal data format, for example, to remove the numerical values such as floating point number and character string type price, to unify time or units, or to fill missing fingers, smooth noise data, and so on, so as to standardize the format of the big data, to clean up abnormal data, to correct errors, or to clean up duplicate data.

Optionally, the preprocessed raw big data may include behavior data, attribute data, and text data, and the networking, by the data processing module, of the preprocessed raw big data may include: networking the behavior data in the preprocessed original big data, for example, converting the behavior data of purchasing, evaluation and the like into a behavior network; or, the method may further include performing networking on attribute data in the preprocessed raw big data, for example, converting attribute information such as age and price into an attribute network; or, the method may further include networking text data in the preprocessed raw big data, for example, converting text data such as product introduction or evaluation content into a semantic network with words and phrases as nodes. The behavioral network, the attribute network, and the semantic network together form the relationship network.

Optionally, when the application algorithm module 403 determines the result of the application service request by using the high-dimensional vectors of the nodes of the relational network, the result of the application service request may be determined by using the high-dimensional vectors of all the nodes of the relational network; it is also possible to determine the result of the application service request using only high-dimensional vectors of part of the relational network. In particular, only the node associated with the application service request may be utilized to determine the result of the application service request. For example, when the application service request is a product recommendation, the calculation may be performed using only the high-dimensional vectors of the product nodes and the high-dimensional vectors of the user nodes.

It should be noted that, in the embodiment of the present invention, specific implementation of each module may refer to description of the method embodiment, for example, how to perform a characterization learning algorithm based on embedded mapping specifically, may refer to description of the method embodiment, and the embodiment of the present invention is not described herein again.

The system according to the embodiment of the present invention may be implemented in one or more computers or servers in the form of software or programs, and the embodiment of the present invention is not limited herein.

For better understanding of the embodiment of the present invention, the intelligent processing system for big data according to the embodiment of the present invention is applied to insurance industry.

When a user performs operations such as completing personal information, checking insurance rules, buying insurance, withdrawing insurance or establishing social relations on a personal computer (personal computer, PC) or a mobile terminal, the operation information can be collected by a server to form original big data, and the original big data can be stored in a database in a table form. The system of the embodiment of the invention can obtain the original big data.

For example, by collecting the operation information, the database may store a user personal information table as shown in table 1, a product information table as shown in table 2, a purchase risk behavior table as shown in table 3, and a withdrawal risk behavior table as shown in table 4.

TABLE 1 user personal information Table

TABLE 2 product information Table

Risk name	Categories	Price	Settlement Co Ltd	Introduction to the product	……
						Danger A	Vehicle insurance	……	……	Low premium and convenient claim settlement	……
Danger B	Life insurance	……	……	Life-long insurance, wide application age range	……
						Danger C	Health risk	……	……	The major disease pay amount is high	……
……	……	……	……	……	……

TABLE 3 Purchase behavior Table

User ID	Risk name	Purchase danger spot (GPS)	Amount of purchase	User rating
					User 1	Danger A	XX Corp Ltd	……	Purchase convenience:)
User 2	Danger C	XX Enterprise	……	Always outside and buy one good
					User 3	Danger B	1.765	……	……
User 4	Danger A	XX route	……	To give love car a lot of danger!
					User 5	Danger B	XX cell	……	……
……	……	……	……	……

TABLE 4 behaviour to quit insurance

User ID	Risk name	Refuge place (GPS)	Amount of refund	Reason for refunding
					User 3	Danger B	XX street (in home)	……	……
……	……	……	……	……

First, the data structuring module in the system can perform data analysis and cleaning on the data. Taking data analysis and cleaning of the data in the risk configuration behavior table shown in table 3 as an example. The data analysis means that more information is obtained through data statistics and association, and the data structuring module can supplement information such as 'work place', 'home', 'marketing point vicinity' and the like on the geographical location information. The data cleaning means that an illegal numerical value is removed, or even an illegal data record is removed, for example, when the 'insurance purchase place' is a real number, the data structuring module can hide the numerical value; when the value of "user ID" or "risk name" recorded in table 3 is illegal, the data structuring module may remove the purchase risk record. Table 5 shows the results of data analysis and cleaning performed on the data in table 3 by the data structuring module.

TABLE 5 purchasing behavior after data analysis and cleaning

User ID	Risk name	Purchase danger spot (GPS)	Amount of purchase	User rating
					User 1	Danger A	XX company [ workplace ]	……	Purchase convenience:)
User 2	Danger C	XX corporation [ workplace ]	……	Always outside and buy one good
					User 3	Danger B	[ lacuna ]	……	……
User 4	Danger A	XX way (near a marketing point)	……	To give love car a lot of danger!
					User 5	Danger B	XX cell [ home ]	……	……
……	……	……	……	……

Then, the original big data after data analysis and cleaning can be networked to obtain a relational network containing nodes and edges. According to the table, a large amount of text information exists in the original big data, so that the data structuring module can network the text data to obtain nodes consisting of phrases or words and edges among the nodes, and a semantic network comprising the nodes and the edges is obtained. The subsequent representation learning module can learn the semantic information in the semantic network by using a representation learning method. For example, the text data after data analysis and cleaning in tables 1 to 4 can be extracted by using the word segmentation tool, and the text data in the form of "document-phrase" is obtained as shown in table 6, and each phrase in table 6 can be represented as a node in the semantic network. If the nodes composed of the phrases are commonly found in sentences or documents, edges can exist among the nodes for connection, and the weight of the edges is determined by the frequency of the common appearance of the edges in the sentences or documents. If the node of 'travel' is connected with the node of 'great business trip' by an edge, the node of 'great business trip' is connected with the node of 'overwork'.

TABLE 6

The normal work is busy and goes a lot
	Many frequent trips with excessive exertion
A bad body has one
	Travel to great extent
Bad body
	Low premium and convenient claim settlement
The life insurance age range is wide
	The major disease pay amount is high
Convenient to purchase
	Always outside and buy one good
To add a danger to love cars
	Is too high to be suitable

In addition, the contents in the table can be converted into a relationship network in a networked manner: for example, the contents in table 1 are converted into a plurality of two-dimensional relationship networks such as "user ID-gender", "user ID-age group", "user ID-occupation", and "user ID-self introduction phrase"; converting the contents in the table 2 into a plurality of two-dimensional relationship networks such as 'risk name-category', 'risk name-price interval', 'risk name-sale company' and 'risk name-product introduction phrase'; the contents in the table 3 are converted into a high-dimensional relationship network of 'user ID-risk name-purchase risk place-amount interval-evaluation phrase', and the contents in the table 4 are converted into a high-dimensional relationship network of 'user ID-risk name-withdrawal risk place-amount interval-withdrawal risk reason phrase'.

The finally formed relationship network comprises the semantic network and a plurality of high-dimensional relationship networks and two-dimensional relationship networks which are converted from the contents of tables 1-4, wherein the high-dimensional relationship networks and the two-dimensional relationship networks comprise attribute networks and behavior networks; in the relationship network, user ID, user attribute, product attribute, place, phrase and the like are used as nodes, and the interaction/relationship among the nodes is used as an edge of the relationship network.

It should be noted that, nodes are allowed to overlap in the relational network, and the two-dimensional relational network and the high-dimensional relational network can be fused into a relational network containing nodes of multiple categories, i.e. a multi-source heterogeneous network, by using "user ID", "risk", "phrase", and the like. After the data structuring module converts the raw big data into a relationship network, the characterization learning module can perform characterization learning on the data in the relationship network. Assuming that the dimension number of the high-dimensional vector is K (K is usually between 200 and 500), the result of the characterization learning is to represent nodes (such as phrase nodes, user attribute nodes, product nodes, etc.) in the relationship network as a plurality of high-dimensional vectors, and the high-dimensional vectors retain the association relationship (i.e., edges) in the relationship network.

As can be seen from the foregoing analysis, in the embodiment of the present invention, the relationship network includes a semantic network, a two-dimensional relationship network, and a high-dimensional relationship network. The characterization learning module may employ an embedding mapping based characterization learning algorithm for the semantic networkThe body ground can be: nodes in semantic networks such as 'travel', 'great business trip', 'overwork', 'bad body', 'serious disease' and the like are expressed as high-dimensional vectors, for example, u is ═ u [ u ]₁,u₂,…,u_K]And by means of the characterization learning algorithm, a vector of a tour node is similar to a vector of a large-trip node, the vector of the large-trip node is similar to a vector of an overwork node, and the vector of the overwork node and the vector of a bad-health node are similar to a vector of a serious disease node. Thereby preserving the structural information of the data in the network.

The characterization learning module may adopt a characterization learning algorithm based on embedded mapping for a two-dimensional relationship network, and the characterization learning result of the two-dimensional relationship network may be: nodes such as user ID, user attribute, product ID (risk name), product attribute and the like are represented as high-dimensional vectors, structural information in the relation network is reserved by enabling the vector similarity of user nodes with similar attributes to be high and the vector similarity of product nodes with similar attributes to be high, and finally, similar vectors exist among user nodes with more trips and among product nodes with the same category.

The characterization learning module may apply a characterization learning algorithm based on embedded mapping to a high-dimensional relationship network, and the characterization learning result of the high-dimensional relationship network may be: the nodes such as the user, the risk name, the place and the evaluation phrase are expressed as high-dimensional vectors, so that the vector similarity of user nodes with similar purchasing and risk returning habits is high, the vector similarity of product nodes with similar purchasing and risk returning users is high, and the structural information in the relational network is reserved.

In the embodiment of the invention, the characterization learning module can be realized based on Embedding mapping (Embedding) and by combining Skip-gram and Negative Sampling, so that the low calculation complexity of the algorithm and the strong expandability of the algorithm are ensured.

In the embodiment of the invention, each node of the relational network is uniformly represented by a high-dimensional vector, the structural information in the relational network is reserved, and the application algorithm module can call the high-dimensional vectors of part of the nodes to calculate in the subsequent application service aiming at different task requirements, so that the calculation complexity is low.

For example, assuming that the problem to be solved by the user's application service request is insurance product recommendation, we can implement the insurance product recommendation using an application algorithm module. Insurance product recommendations are given users looking for products that are most similar in purchasing behavior and most different in unsubscribing behavior from the user. The processing algorithm corresponding to the application service request is: selecting the vector of the user node and the vector of the product node by using a vector similarity calculation method such as Cosine similarity (Cosine similarity), and calculating the similarity between the vector of the user node and the vector of the product node. For example, in the characterization learning, we may store "buying insurance behavior" information through the 1 st to 100 th dimensional vectors in the vector of the user node and the vector of the product node, that is, if the user a purchases the product a, the 1 st to 100 th dimensions of the vector of the user a node are similar to the 1 st to 100 th dimensions of the vector of the product a node; we can also save the "risk-quit behavior" information by 101-200 th dimension vectors in the vector of the user node and the vector of the product node, i.e. if the user a unsubscribes the product B, the 101-200 th dimension of the vector of the user a node is similar to the 101-200 th dimension of the vector of the product B node. Therefore, if a product needs to be recommended to the user a, the vector of the product node is found, which is similar to the 1 st to 100 th dimensional vectors of the vector of the user a node and is dissimilar to the 101 st to 200 th dimensional vectors.

Similarly, the application algorithm module may also use a high-dimensional vector obtained through characterization learning to implement classification of user categories, detection of a fraud protection user, and the like, which is not described herein again in the embodiments of the present invention.

Corresponding to the intelligent big data processing method described in fig. 3, an embodiment of the present invention provides an intelligent big data processing system, and as shown in fig. 5, the system may include:

an obtaining module 501, configured to obtain an application service request of a user and a high-dimensional vector of a node of a relationship network converted from original big data.

A determining module 502, configured to determine a processing algorithm corresponding to the application service request, and determine a result of the application service request by using the processing algorithm corresponding to the application service request and the high-dimensional vector of the node of the relationship network.

In this embodiment of the present invention, the obtaining module 501 may directly obtain the features represented by the high-dimensional vectors, so that the determining module determines the result of the application service request by using the processing algorithm corresponding to the application service request and the high-dimensional vectors of the nodes of the relational network. The intelligent processing system for big data provided by the embodiment of the invention is not limited to a specific application service, and can provide a unified and effective processing method for various application services.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An intelligent big data processing system, comprising:

the characterization learning module: the high-dimensional vector of the node of the relational network is obtained by adopting a characterization learning algorithm based on embedded mapping for the relational network;

an application algorithm module: the method comprises the steps of obtaining an application service request of a user; and determining a processing algorithm corresponding to the application service request, and determining a result of the application service request by using the processing algorithm corresponding to the application service request and the high-dimensional vector of the node of the relational network obtained by the representation learning module.

2. The system according to claim 1, wherein the relational network includes a high-dimensional relational network, and the characterization learning module is specifically configured to perform embedded mapping on the high-dimensional relational network to obtain a high-dimensional vector of a node of the high-dimensional relational network.

3. The system according to claim 1, wherein the relational network includes a semantic network, and the representation learning module is specifically configured to perform embedded mapping on the semantic network to obtain a high-dimensional vector of a node of the semantic network.

4. The system according to claim 2 or 3, wherein the relational network includes a two-dimensional relational network, and the characterization learning module is specifically configured to perform embedded mapping on the two-dimensional relational network to obtain a high-dimensional vector of a node of the two-dimensional relational network.

5. The system of claim 1, wherein the raw big data comprises behavioral data, attribute data, and textual data.

6. The system according to claim 1 or 5, wherein the data structuring module is specifically configured to network behavior data in the preprocessed raw big data to obtain a behavior network including nodes and edges;

7. The system of claim 1, wherein the data structuring module is specifically configured to perform data analysis and cleaning on the raw big data.

8. The system according to claim 1, wherein the application algorithm module is specifically configured to determine the result of the application service request by using a high-dimensional vector of a part of nodes in the relational network and a processing algorithm corresponding to the application service request.

9. An intelligent big data processing system, comprising:

the acquisition module is used for acquiring an application service request of a user and a high-dimensional vector of a node of a relational network converted from original big data;

and the determining module is used for determining a processing algorithm corresponding to the application service request and determining the result of the application service request by using the processing algorithm corresponding to the application service request and the high-dimensional vector of the node of the relational network.

10. The system of claim 9, wherein the relationship network transformed from the raw big data is: and the relation network is obtained by carrying out networking on the preprocessed raw big data.

11. An intelligent big data processing method is characterized by comprising the following steps:

preprocessing original big data;

acquiring an application service request of a user;

12. The method according to claim 11, wherein if the relational network includes a high-dimensional relational network, the obtaining the high-dimensional vectors of the nodes of the relational network by using a characterization learning algorithm based on embedded mapping for the relational network includes:

and carrying out embedded mapping on the high-dimensional relationship network to obtain a high-dimensional vector of the node of the high-dimensional relationship network.

13. The method according to claim 11, wherein the relational network includes a semantic network, and the obtaining the high-dimensional vectors of the nodes of the relational network by using a characterization learning algorithm based on embedded mapping for the relational network includes:

and carrying out embedded mapping on the semantic network to obtain a high-dimensional vector of the nodes of the semantic network.

14. The method according to claim 12 or 13, wherein the relational network includes a two-dimensional relational network, and the obtaining the high-dimensional vectors of the nodes of the relational network by using a characterization learning algorithm based on embedded mapping for the relational network includes:

and performing embedded mapping on the two-dimensional relationship network to obtain a high-dimensional vector of the node of the two-dimensional relationship network.

15. The method of claim 11, wherein the raw big data comprises behavioral data, attribute data, and textual data.

16. The method according to claim 11 or 15, wherein the networking the preprocessed raw big data to obtain a relationship network including nodes and edges comprises:

networking the behavior data in the preprocessed original big data to obtain a behavior network comprising nodes and edges;

17. The method of claim 11, wherein the pre-processing of the raw big data comprises data analysis and cleaning of the raw big data.

18. The method of claim 11, wherein determining the result of the application service request using the processing algorithm corresponding to the application service request and the high-dimensional vector of the node of the relational network comprises:

and determining the result of the application service request by using the high-dimensional vectors of part of nodes in the relational network and the processing algorithm corresponding to the application service request.

19. An intelligent big data processing method is characterized by comprising the following steps:

20. The method of claim 19, wherein the relationship network transformed from the raw big data is: and the relation network is obtained by carrying out networking on the preprocessed raw big data.