CN115374223B

CN115374223B - Intelligent blood margin identification recommendation method and system based on rules and machine learning

Info

Publication number: CN115374223B
Application number: CN202210766523.1A
Authority: CN
Inventors: 金震; 张京日; 穆宇浩; 詹焕哲
Original assignee: Beijing SunwayWorld Science and Technology Co Ltd
Current assignee: Beijing SunwayWorld Science and Technology Co Ltd
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2023-06-13
Anticipated expiration: 2042-06-30
Also published as: CN115374223A

Abstract

The invention discloses an intelligent blood margin identification recommendation method and system based on rules and machine learning, wherein the method comprises the following steps: constructing a machine learning model, and identifying a plurality of characteristic information of all data fields based on the machine learning model; the characteristic information comprises a unique value, a maximum value and a minimum value of a field; clustering the data fields based on a machine learning model to obtain a plurality of clusters; comparing the unique values of the data fields in each cluster based on the data pattern comparison rule, and determining an intersection coverage relationship based on the unique values; sorting the intersection coverage relationships; and sorting and filtering based on the sorting, and forming a blood relationship list among the physical tables after filtering. Based on the data pattern comparison rule and combining with the machine learning capability, the blood-margin identification and discovery of the data are realized, and the enterprise is helped to construct a data network. Greatly reduces the cost of enterprise data management and effectively improves the efficiency of data management.

Description

Intelligent blood margin identification recommendation method and system based on rules and machine learning

Technical Field

The invention relates to the technical field of data management, in particular to an intelligent blood-margin identification recommendation method and system based on rules and machine learning.

Background

The data blood margin is used as the key point in the actual data management process, can effectively solve the problems of treating and developing two skin phenomena, effectively supporting and analyzing various traceability analysis, influence judgment and the like in the data management and development processes. However, the data development tools are different, for example, the data blood-edge recognition mode is performed through SQL analysis and other modes, and SQL (structured query language) (Structured Query Language) is a special purpose programming language and is a database query and programming language used for accessing data and querying, updating and managing a relational database system.

The prior art has the following defects: the data are scattered, the data blood edges cannot be effectively identified and managed, and in many cases, the data are identified manually, so that huge cost waste is caused, and meanwhile, the intelligent process of data management is greatly reduced.

Disclosure of Invention

The invention provides an intelligent blood margin identification recommendation method and system based on rules and machine learning, which are used for solving the problems in the prior art.

The invention provides an intelligent blood margin identification recommendation method based on rules and machine learning, which comprises the following steps:

s100, constructing a machine learning model, and identifying a plurality of characteristic information of all data fields based on the machine learning model; the characteristic information comprises a unique value, a maximum value and a minimum value of a field;

s200, clustering data fields based on a machine learning model to obtain a plurality of clusters;

s300, comparing unique values of data fields in each cluster based on a data pattern comparison rule, and determining intersection coverage relation based on the unique values;

s400, sorting the intersection coverage relation;

s500, sorting and filtering are carried out based on the sorting, and a blood relationship list among the physical tables is formed after filtering.

Preferably, after step S500, the method further includes:

and S600, recommending the content which is ranked ahead in the blood relationship list to a user for selection by the user, wherein the user selects according to the recommended upstream and downstream physical tables, and the selected tables are added into the calculation of intersection coverage relationship ranking as new features.

Preferably, the S200 includes:

s201, text semantic extraction is carried out on the content of the data field based on a machine learning model, and the semantics of the data field are obtained;

s202, clustering the data fields according to the content, the type, the semantics and the labels to form a plurality of clusters containing different features.

Preferably, the method for calculating the cluster comprises the following steps:

forming the data fields into view data;

extracting a feature matrix of data from the views, and learning a similarity graph of all the views by adopting a dynamic neighbor graph construction method; calculating a transition probability matrix corresponding to each view; taking the transition probability matrix as input of a Markov chain spectral clustering algorithm to obtain a clustering result;

specifically, the transition probability matrix is calculated as follows: stacking transition probability matrixes of all views, constructing a target tensor, rotating the tensor, dividing the tensor after rotation into a clean tensor and an error tensor, constraining the clean tensor based on a tensor kernel norm of which t-is v, obtaining a low-rank clean tensor, and summing all side slices of the low-rank clean tensor to obtain the transition probability matrix;

the construction premise of the target tensor is that an objective function is constructed, and the target tensor is determined based on the objective function.

Optimization of the objective function includes optimization of tensor a constructed of a matrix with low rank, and error tensor B constructed of a noise matrix decomposed by each view;

the optimization formula for tensor a is as follows:

wherein A is ^t+1 An iteration optimization value representing the t+1th time of tensor A, A representing low rank tensor, μ ^t Represents penalty parameter, μ at t-th iteration ^t >0, t represents the number of iterations, y ^t Represents the Lagrangian multiplier at the T-th iteration of tensor A, T represents the rotated tensor of the target tensor, T tensor includes tensor A and tensor B, F represents the norm, B ^t Representing the t-th iteration value of tensor B;

the optimization formula for tensor B is as follows:

wherein B is ₍₃₎ Representing tensor B matrixed along modulo-3; b is an error tensor, and gamma represents a non-negative balance parameter;

represents the optimized value, mu, after matrixing along modulo-3 in the t+1st iteration process ^t Represents penalty parameter, μ at t-th iteration ^t >0, t represents the number of iterations, ">

Represents the Lagrangian multiplier, T, at the T-th iteration after the tensor B is matrixed along modulo-3 ₍₃₎ Representing the rotation tensor of the target tensor after matrixing along the modulus-3, F representing the norm,/->

Represents the optimized value of tensor a after matrixing along modulo-3 during the t+1th iteration.

And calculating and determining an optimization result of the objective function based on optimization of the tensor A and the tensor B.

The calculation formula has good convergence, and the calculation complexity is reduced.

Preferably, the S400 includes:

and sequencing the intersection coverage relationship by adopting a PageRank sequencing method.

Preferably, the S500 includes:

s501, setting a sorting threshold value to form a blood margin relation between physical tables;

s502, filtering based on the sorting and sorting threshold value to form a blood relationship list between the physical tables.

The invention provides an intelligent blood margin identification recommendation system based on rules and machine learning, which comprises the following steps:

the characteristic information identification unit is used for constructing a machine learning model and identifying a plurality of characteristic information of all data fields based on the machine learning model; the characteristic information comprises a unique value, a maximum value and a minimum value of a field;

the clustering unit is used for clustering the data fields based on the machine learning model to obtain a plurality of clusters;

an intersection coverage relation determining unit for comparing unique values of the data fields in each cluster based on the data pattern comparison rule, and determining an intersection coverage relation based on the unique values;

the ordering unit is used for ordering the intersection coverage relation;

and the blood relationship list forming unit is used for sorting and filtering based on the sorting, and forming a blood relationship list among the physical tables after filtering.

Preferably, the method further comprises:

and the recommending unit is used for recommending the content which is ranked ahead in the blood relationship list to the user for the user to select, the user selects according to the recommended upstream and downstream physical tables, and the selected tables are added into the calculation of the intersection covering relationship ranking as new features.

Preferably, the clustering unit includes:

the semantic extraction subunit is used for extracting text semantic from the content of the data field based on the machine learning model to obtain the semantic of the data field;

and the feature clustering subunit is used for clustering the data fields according to the content, the type, the semantics and the labels to form a plurality of clusters containing different features.

Preferably, the sorting unit includes:

and the PageRank ordering subunit is used for ordering the intersection coverage relationship by adopting a PageRank ordering method.

Preferably, the blood relationship list forming unit includes:

a sorting threshold setting subunit, configured to set a sorting threshold to form a blood-edge relationship between the physical tables;

and the filtering subunit is used for filtering based on the sorting and the sorting threshold value to form a blood relationship list between the physical tables.

Compared with the prior art, the invention has the following advantages:

the invention provides an intelligent blood margin identification recommendation method and system based on rules and machine learning, comprising the following steps: constructing a machine learning model, and identifying a plurality of characteristic information of all data fields based on the machine learning model; the characteristic information comprises a unique value, a maximum value and a minimum value of a field; clustering the data fields based on a machine learning model to obtain a plurality of clusters; comparing the unique values of the data fields in each cluster based on the data pattern comparison rule, and determining an intersection coverage relationship based on the unique values; sorting the intersection coverage relationships; and sorting and filtering based on the sorting, and forming a blood relationship list among the physical tables after filtering.

The scheme adopted by the invention is based on the data pattern comparison rule and combines the machine learning capability to realize the blood margin identification and discovery of the data and help enterprises to construct a data network. Greatly reduces the cost of enterprise data management and effectively improves the efficiency of data management.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

The technical scheme of the invention is further described in detail through the drawings and the embodiments.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:

FIG. 1 is a flowchart of an intelligent blood margin recognition recommendation method based on rules and machine learning in an embodiment of the invention;

FIG. 2 is a diagram showing an identification recommendation interface of an intelligent blood margin identification recommendation method based on rules and machine learning in an embodiment of the invention;

fig. 3 is a schematic structural diagram of an intelligent blood-margin recognition recommendation system based on rules and machine learning in an embodiment of the invention.

Detailed Description

The preferred embodiments of the present invention will be described below with reference to the accompanying drawings, it being understood that the preferred embodiments described herein are for illustration and explanation of the present invention only, and are not intended to limit the present invention.

The embodiment of the invention provides an intelligent blood margin identification recommendation method based on rules and machine learning, referring to fig. 1, the method comprises the following steps:

s400, sorting the intersection coverage relation;

The working principle of the technical scheme is as follows: the scheme adopted by the embodiment is that a machine learning model is constructed, and a plurality of characteristic information of all data fields are identified based on the machine learning model; the characteristic information comprises a unique value, a maximum value and a minimum value of a field; clustering the data fields based on a machine learning model to obtain a plurality of clusters; comparing the unique values of the data fields in each cluster based on the data pattern comparison rule, and determining an intersection coverage relationship based on the unique values; sorting the intersection coverage relationships; and sorting and filtering based on the sorting, and forming a blood relationship list among the physical tables after filtering.

The beneficial effects of the technical scheme are as follows: the scheme provided by the embodiment is adopted to construct a machine learning model, and a plurality of characteristic information of all data fields are identified based on the machine learning model; the characteristic information comprises a unique value, a maximum value and a minimum value of a field; clustering the data fields based on a machine learning model to obtain a plurality of clusters; comparing the unique values of the data fields in each cluster based on the data pattern comparison rule, and determining an intersection coverage relationship based on the unique values; sorting the intersection coverage relationships; and sorting and filtering based on the sorting, and forming a blood relationship list among the physical tables after filtering.

The scheme adopted by the embodiment realizes the blood margin identification and discovery of the data based on the data pattern comparison rule and combining the machine learning capability, and helps enterprises build the data network. Greatly reduces the cost of enterprise data management and effectively improves the efficiency of data management.

In another embodiment, after step S500, the method further includes:

The working principle of the technical scheme is as follows: the scheme adopted by the embodiment is that the content which is ranked at the front in the blood relationship list is recommended to the user for the user to select, the user selects according to the recommended upstream and downstream physical tables, and the selected tables are added as new features to the calculation of the intersection coverage relationship ranking.

Referring to fig. 2, by generating a corresponding list of blood-relationship, the data relationship system may provide and recommend upstream and downstream physical tables (automatic classification results) to the user, the user may select a corresponding physical table according to the classification results, and the table selected by the user may participate in subsequent calculations as a new feature.

The beneficial effects of the technical scheme are as follows: by adopting the scheme provided by the embodiment, the content which is ranked at the front in the blood relationship list is recommended to the user for the user to select, the user selects according to the recommended upstream and downstream physical tables, and the selected tables are added as new features to the calculation of the intersection coverage relationship ranking.

In another embodiment, the S200 includes:

The working principle of the technical scheme is as follows: the scheme adopted by the embodiment is that text semantic extraction is carried out on the content of the data field based on a machine learning model, so as to obtain the semantics of the data field; clustering the data fields according to the content, the type, the semantics and the labels to form a plurality of clusters containing different features.

The clustering method comprises the following steps: k-means clustering algorithm, hierarchical clustering algorithm and spectral clustering algorithm.

In addition, text semantic extraction can be realized by adopting a semantic extraction model, the semantic extraction model converts an input text into a word vector form to be input, word vector acquisition is carried out by utilizing a one-dimensional convolution structure of a pooling layer, double granularity characteristics are obtained, and overfitting is prevented by utilizing a dropout layer; and obtaining weight vectors of all parts by using the context information and the implicit unit information by adopting a global attention mechanism, carrying out weight distribution, and obtaining text classification based on an activation function and a full connection layer so as to realize text semantic extraction.

The beneficial effects of the technical scheme are as follows: the scheme provided by the embodiment is adopted to extract text semantics of the content of the data field based on the machine learning model, so as to obtain the semantics of the data field; clustering the data fields according to the content, the type, the semantics and the labels to form a plurality of clusters containing different features.

In another embodiment, the S400 includes:

The working principle of the technical scheme is as follows: the scheme adopted in this embodiment is that S400 includes:

PageRank computes the ranking of web pages based on their mutual link relationships, a method used to identify the rank or importance of web pages. The PageRank algorithm computes the PageRank value for each web page, and then ranks the importance of the web pages according to the magnitude of this value.

In another embodiment, the S500 includes:

The working principle of the technical scheme is as follows: the scheme adopted in this embodiment is that S500 includes:

The beneficial effects of the technical scheme are as follows: the step S500 includes:

In another embodiment, the present embodiment further provides an intelligent blood-margin identification recommendation system based on rules and machine learning, referring to fig. 3, the system includes:

the ordering unit is used for ordering the intersection coverage relation;

The working principle of the technical scheme is as follows: the scheme adopted by the embodiment is that the system comprises: the characteristic information identification unit is used for constructing a machine learning model and identifying a plurality of characteristic information of all data fields based on the machine learning model; the characteristic information comprises a unique value, a maximum value and a minimum value of a field; the clustering unit is used for clustering the data fields based on the machine learning model to obtain a plurality of clusters; an intersection coverage relation determining unit for comparing unique values of the data fields in each cluster based on the data pattern comparison rule, and determining an intersection coverage relation based on the unique values; the ordering unit is used for ordering the intersection coverage relation; and the blood relationship list forming unit is used for sorting and filtering based on the sorting, and forming a blood relationship list among the physical tables after filtering.

The beneficial effects of the technical scheme are as follows: the scheme provided by the embodiment is that the system comprises: the characteristic information identification unit is used for constructing a machine learning model and identifying a plurality of characteristic information of all data fields based on the machine learning model; the characteristic information comprises a unique value, a maximum value and a minimum value of a field; the clustering unit is used for clustering the data fields based on the machine learning model to obtain a plurality of clusters; an intersection coverage relation determining unit for comparing unique values of the data fields in each cluster based on the data pattern comparison rule, and determining an intersection coverage relation based on the unique values; the ordering unit is used for ordering the intersection coverage relation; and the blood relationship list forming unit is used for sorting and filtering based on the sorting, and forming a blood relationship list among the physical tables after filtering.

In another embodiment, the method further comprises:

In another embodiment, the clustering unit includes:

The working principle of the technical scheme is as follows: the scheme adopted by the embodiment is that the clustering unit comprises:

The beneficial effects of the technical scheme are as follows: the clustering unit adopting the scheme provided by the embodiment comprises:

In another embodiment, the sorting unit includes:

In another embodiment, the blood relationship list forming unit includes:

The working principle of the technical scheme is as follows: the solution adopted in this embodiment is that the blood relationship list forming unit includes:

The beneficial effects of the technical scheme are as follows: the blood relationship list forming unit adopting the scheme provided by the embodiment comprises:

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. An intelligent blood-margin identification recommendation method based on rules and machine learning is characterized by comprising the following steps:

s400, sorting the intersection coverage relation;

s500, sorting and filtering based on the sorting, and forming a blood margin relation list among the physical tables after filtering;

the S500 includes:

2. The intelligent blood-margin recognition recommendation method based on rules and machine learning according to claim 1, further comprising, after step S500:

3. The intelligent blood-margin recognition recommendation method based on rules and machine learning of claim 1, wherein S200 comprises:

4. The intelligent blood-margin recognition recommendation method based on rules and machine learning of claim 1, wherein S400 comprises:

5. An intelligent blood-margin recognition recommendation system based on rules and machine learning, which is characterized by comprising:

the ordering unit is used for ordering the intersection coverage relation;

the blood relationship list forming unit is used for sorting and filtering based on the sorting, and forming a blood relationship list among the physical tables after filtering;

the blood relationship list forming unit includes:

6. The intelligent blood-margin recognition recommendation system based on rules and machine learning of claim 5, further comprising:

7. The intelligent blood-margin recognition recommendation system based on rules and machine learning of claim 5, wherein the clustering unit comprises:

8. The intelligent blood-margin recognition recommendation system based on rules and machine learning of claim 5, wherein the ranking unit comprises: