CN118069637A - Metadata analysis method and device, electronic equipment and storage medium - Google Patents

Metadata analysis method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN118069637A
CN118069637A CN202410033602.0A CN202410033602A CN118069637A CN 118069637 A CN118069637 A CN 118069637A CN 202410033602 A CN202410033602 A CN 202410033602A CN 118069637 A CN118069637 A CN 118069637A
Authority
CN
China
Prior art keywords
metadata
vector
data
hyperplane
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410033602.0A
Other languages
Chinese (zh)
Inventor
李玮
孙洪龙
朱德福
张晓岩
仇志伟
侯绍君
崔路凯
周江涛
柳行
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Taiyuan Rockontrol Industrial Co ltd
Original Assignee
Taiyuan Rockontrol Industrial Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Taiyuan Rockontrol Industrial Co ltd filed Critical Taiyuan Rockontrol Industrial Co ltd
Priority to CN202410033602.0A priority Critical patent/CN118069637A/en
Publication of CN118069637A publication Critical patent/CN118069637A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a metadata analysis method, a metadata analysis device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring metadata to be analyzed; inquiring a target data bucket where metadata to be analyzed are located based on a preset hash map; the hash map stores the mapping relation between the metadata and the data barrel; the mapping relation is obtained by classifying metadata vectors corresponding to metadata and storing the metadata vectors of the same class into the same data bucket; and performing similarity calculation on the metadata to be analyzed and the target metadata vector in the target data bucket to obtain a data analysis result. When data analysis is carried out, a target data bucket in which metadata to be analyzed is located is searched first, then similarity calculation is carried out on the metadata to be analyzed and target metadata vectors in the target data bucket, and a data analysis result is obtained, and comparison of one piece of metadata with all pieces of metadata is not needed, so that the operation amount and labor cost are reduced, and the data analysis efficiency is improved.

Description

Metadata analysis method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a metadata analysis method, a metadata analysis device, an electronic device, and a storage medium.
Background
Metadata is data defining and describing data, is a translation tie between services and systems, and provides semantics and logic consistent between both services and systems. Metadata analysis is applied in a number of fields such as data acquisition, data development, data quality monitoring, and data set data querying. In the prior art, when analyzing metadata, the metadata is usually required to be manually carded, and a similarity table and a field are found according to the definition of the metadata, so that the data processing efficiency is low.
Disclosure of Invention
The embodiment of the application aims at a metadata analysis method, a device, electronic equipment and a storage medium, which are used for classifying metadata in advance and mapping the metadata to corresponding data barrels according to classification results, wherein metadata with a certain degree of similarity are stored in each data barrel. When data analysis is carried out, a target data bucket in which metadata to be analyzed is located is searched, similarity calculation is carried out on the metadata to be analyzed and target metadata vectors in the target data bucket, a data analysis result is obtained, the operation amount and the labor cost are reduced, and the data analysis efficiency is improved.
In a first aspect, an embodiment of the present application provides a metadata analysis method, including: acquiring metadata to be analyzed; inquiring a target data bucket where metadata to be analyzed are located based on a preset hash map; the hash map stores the mapping relation between the metadata and the data barrel; the mapping relation is obtained by classifying metadata vectors corresponding to metadata and storing the metadata vectors of the same class into the same data bucket; and performing similarity calculation on the metadata to be analyzed and the target metadata vector in the target data bucket to obtain a data analysis result.
In the implementation process, metadata are classified in advance, and mapped to corresponding data buckets according to classification results, and metadata with a certain degree of similarity are stored in each data bucket. When data analysis is carried out, a target data bucket in which metadata to be analyzed is located is searched first, then similarity calculation is carried out on the metadata to be analyzed and target metadata vectors in the target data bucket, and a data analysis result is obtained, and comparison of one piece of metadata with all pieces of metadata is not needed, so that the operation amount and labor cost are reduced, and the data analysis efficiency is improved.
Optionally, in an embodiment of the present application, before querying a target data bucket where metadata to be analyzed is located based on a preset hash map, the method further includes: vector conversion is carried out on metadata acquired in advance, and metadata vectors corresponding to each metadata are obtained; obtaining at least two pre-constructed hyperplanes; classifying the metadata vector by using a hyperplane to obtain a vector classification result; based on the vector classification result, the mapping relation between the metadata and the data bucket is stored by utilizing the hash map.
In the implementation process, metadata is converted into metadata vectors, and the metadata vectors can be divided into categories with similar characteristics by selecting proper hyperplanes in a data space, so that the classification of the metadata vectors is realized. And storing the mapping relation between the metadata and the data bucket by utilizing the hash map, thereby improving the calculation efficiency.
Optionally, in an embodiment of the present application, classifying metadata vectors by using a hyperplane to obtain a vector classification result includes: performing dot product calculation on each metadata vector and each hyperplane to obtain a hash value corresponding to each metadata vector; and classifying the metadata vector based on the hash value corresponding to the metadata vector to obtain a vector classification result.
In the implementation process, the cosine value of the included angle between the vector and the two-dimensional hyperplane is calculated by utilizing the dot product, so that the similarity between the vector and the two-dimensional hyperplane is determined, the metadata vector is classified by utilizing the hyperplane, a vector classification result is obtained, and the data analysis efficiency is improved.
Optionally, in an embodiment of the present application, performing dot product calculation on each metadata vector and each hyperplane to obtain a hash value corresponding to each metadata vector includes: respectively carrying out dot product calculation on the metadata vector and each hyperplane by using a vector dot product formula to obtain a dot product result of the metadata vector and each hyperplane; and ordering dot product results corresponding to the metadata vectors according to a preset hyperplane sequence to obtain hash values corresponding to the metadata vectors.
In the implementation process, the vector dot product formula is utilized to respectively calculate dot products of the metadata vector and each hyperplane, and the dot product results are ordered according to a preset hyperplane sequence to obtain hash values corresponding to the metadata vector, so that more accurate support is provided for subsequent data analysis.
Optionally, in an embodiment of the present application, the metadata vector includes a vector abscissa, a vector ordinate, and a vector ordinate; the hyperplane comprises a hyperplane abscissa, a hyperplane ordinate and a hyperplane ordinate; the vector dot product formula includes:
Vn·Sn=X1·X2+Y1·Y2+Z1·Z2
Wherein Vn is a metadata vector; sn is a hyperplane; vn.Sn is the dot product result; x 1 is the vector abscissa; y 1 is the vector ordinate; z 1 is the vector vertical coordinate; x 2 is the hyperplane abscissa; y 2 is the hyperplane ordinate; z 2 is the hyperplane vertical coordinate.
In the implementation process, the vector dot product formula is utilized to respectively calculate dot products of the metadata vector and each hyperplane, dot product results are ordered according to a preset hyperplane sequence, hash values corresponding to the metadata vector are obtained, the dot product results can be represented by 1 or 0, the number of the hyperplanes can be set according to actual conditions, the hash values are generated, and more accurate support is provided for subsequent data analysis.
Optionally, in an embodiment of the present application, storing, based on the vector classification result, a mapping relationship between metadata and a data bucket using a hash map includes: based on the vector classification result, storing the metadata vectors of the same category into the same data bucket; and storing the mapping relation between the metadata corresponding to the metadata vector and the data bucket by utilizing the hash map.
In the implementation process, after the metadata vectors are classified, the metadata vectors classified into the same category are data with certain similarity, so that before consistency analysis, data without relevance are filtered to a great extent, more accurate data are provided for further consistency analysis, and accuracy and efficiency of data analysis are improved.
Optionally, in an embodiment of the present application, performing similarity calculation on metadata to be analyzed and a target metadata vector in a target data bucket to obtain a data analysis result, where the method includes: obtaining a target metadata vector in a target data bucket; performing similarity calculation on the metadata to be analyzed and each target metadata vector to obtain similarity data corresponding to each target metadata vector; and sorting the similarity data, and obtaining a data analysis result based on the similarity sorting result.
In the implementation process, similarity calculation is performed on the metadata to be analyzed and the target metadata vector in the target data bucket, and comparison of one metadata with all metadata is not needed, so that the operation amount and the labor cost are reduced, and the data analysis efficiency is improved.
In a second aspect, an embodiment of the present application further provides a metadata analysis apparatus, including: the acquisition module is used for acquiring metadata to be analyzed; the query data bucket module is used for querying a target data bucket where metadata to be analyzed are located based on a preset hash map; the hash map stores the mapping relation between the metadata and the data barrel; the mapping relation is obtained by classifying metadata vectors corresponding to metadata and storing the metadata of the same class into the same data bucket; and the analysis module is used for carrying out similarity calculation on the metadata to be analyzed and the target metadata vector in the target data bucket to obtain a data analysis result.
In a third aspect, an embodiment of the present application further provides an electronic device, including: a processor and a memory storing machine-readable instructions executable by the processor to perform the method as described above when executed by the processor.
In a fourth aspect, embodiments of the present application also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the method described above.
By adopting the metadata analysis method, the device, the electronic equipment and the storage medium, metadata are classified in advance, and are mapped to corresponding data barrels according to classification results, and metadata with a certain degree of similarity are stored in each data barrel. When data analysis is carried out, a target data bucket in which metadata to be analyzed is located is searched first, then similarity calculation can be carried out on the metadata to be analyzed and target metadata vectors in the target data bucket, a data analysis result is obtained, one metadata does not need to be compared with all metadata, the operation amount and the labor cost are reduced, and the data analysis efficiency is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be considered as limiting the scope, and other related drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flow chart of a metadata analysis method according to an embodiment of the present application;
FIG. 2 is a schematic view of a hyperplane provided by an embodiment of the present application;
Fig. 3 is a schematic structural diagram of a metadata analysis device according to an embodiment of the present application;
Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Embodiments of the technical scheme of the present application will be described in detail below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present application, and thus are merely examples, and are not intended to limit the scope of the present application.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.
In the description of embodiments of the present application, the technical terms "first," "second," and the like are used merely to distinguish between different objects and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated, a particular order or a primary or secondary relationship. In the description of the embodiments of the present application, the meaning of "plurality" is two or more unless otherwise specifically defined.
Metadata management in data management becomes an indispensable important component, particularly in aspects of data acquisition, data development, data quality and data application, and the metadata management comprises similar metadata recommendation, so that data management personnel can be helped to quickly find out consistent fields defined in different tables in a plurality of bins, and application scenes such as data comparison, data consistency analysis and the like are facilitated.
In the existing metadata management, table information and field information in a database are collected into a data warehouse in a unified way through data acquisition, and business attributes, technical attributes and management attributes of metadata are defined through inventory and cataloging. In data consistency analysis and data comparison, a similarity table and a field need to be found according to the definition of metadata. By means of manual sorting, consistent metadata is found from a defined description of metadata in thousands of data tables and fields. Metadata in a plurality of bins often relates to a plurality of systems, departments and organization structures, and the definition caliber of each system for the same data is not uniform, so that accurate and complete matching is difficult.
On the other hand, in the prior art, when data consistency analysis is performed, one piece of metadata is compared with all pieces of metadata, so that the operand is overlarge, the resource consumption is serious, and the data processing efficiency is low.
The embodiment of the application provides a metadata analysis method, a device, electronic equipment and a storage medium, metadata are classified in advance, the metadata are mapped to corresponding data barrels according to classification results, and metadata with a certain degree of similarity are stored in each data barrel. When data analysis is carried out, a target data bucket in which metadata to be analyzed is located is searched first, then similarity calculation is carried out on the metadata to be analyzed and target metadata vectors in the target data bucket, and a data analysis result is obtained, and comparison of one piece of metadata with all pieces of metadata is not needed, so that the operation amount and labor cost are reduced, and the data analysis efficiency is improved.
Please refer to fig. 1, which illustrates a flowchart of a metadata analysis method according to an embodiment of the present application. The metadata analysis method provided by the embodiment of the application can be applied to electronic equipment, and the electronic equipment can comprise a terminal and a server; the terminal can be a smart phone, a tablet computer, a Personal digital assistant (Personal DIGITAL ASSITANT, PDA) and the like; the server may be an application server or a Web server. The metadata analysis method may include:
Step S110: and acquiring metadata to be analyzed.
Step S120: inquiring a target data bucket where metadata to be analyzed are located based on a preset hash map; the hash map stores the mapping relation between the metadata and the data barrel; the mapping relation is obtained by classifying metadata vectors corresponding to metadata and storing the metadata vectors of the same class into the same data bucket.
Step S130: and performing similarity calculation on the metadata to be analyzed and the target metadata vector in the target data bucket to obtain a data analysis result.
In step S110, metadata to be analyzed is metadata that is specified by a user and that is required to make a similarity recommendation. Metadata is data that describes and defines data, and may provide attributes, features, and other detailed information about the data. For example, metadata may include information about the source, format, structure, content, quality, usage rules, and processing of the data.
In step S120, a target data bucket in which metadata to be analyzed is located is queried in a preset hash map, and a mapping relationship between the metadata and the data bucket is stored in the hash map. A hash map is a data structure that can associate keys (values) with values and map the keys to unique index locations through a hash function, thereby enabling efficient data access and lookup operations. For example, the metadata may be used as a key (key), and the data bucket in which the metadata is located may be used as a value (value), so as to implement association of the metadata with the data bucket.
A Bucket (or "Bucket") is a basic element of a database that stores data, and may be understood as a unit of storage in the database, e.g., it is a contiguous block of memory space.
The following describes a process of obtaining a mapping relationship between metadata and a data bucket, firstly, collecting metadata, performing vector conversion on the collected metadata to obtain metadata vectors corresponding to each metadata, classifying the metadata vectors to obtain a classification result of the metadata vectors, and storing the metadata vectors of the same class into the same data bucket. Thus, the association relation between the metadata vector and the data bucket is obtained, and the mapping relation between the metadata corresponding to the metadata vector and the data bucket can be stored by utilizing the hash map.
In step S130, metadata vectors of the same category are stored in the same data bucket, and then the target data bucket in which the metadata to be analyzed is located includes at least one target metadata vector, and similarity calculation is performed on the metadata to be analyzed and the target metadata vectors in the target data bucket to obtain similarity data of the metadata to be analyzed and each target metadata vector, and a data analysis result can be obtained based on the similarity data.
The data analysis results may characterize consistency analysis results or similarity recommendation data of metadata to be analyzed, and the like. For example, a target metadata vector with similarity data greater than a preset threshold may be used as the data analysis result; the similarity data can be ranked, and the target metadata vector with the preset bit number before the ranking result is used as the data analysis result.
In the implementation process, metadata are classified in advance, and mapped to corresponding data buckets according to classification results, and metadata with a certain degree of similarity are stored in each data bucket. When data analysis is carried out, a target data bucket in which metadata to be analyzed is located is searched first, then similarity calculation is carried out on the metadata to be analyzed and target metadata vectors in the target data bucket, and a data analysis result is obtained, and comparison of one piece of metadata with all pieces of metadata is not needed, so that the operation amount and labor cost are reduced, and the data analysis efficiency is improved.
Optionally, in an embodiment of the present application, before querying a target data bucket where metadata to be analyzed is located based on a preset hash map, the method further includes: vector conversion is carried out on metadata acquired in advance, and metadata vectors corresponding to each metadata are obtained; obtaining at least two pre-constructed hyperplanes; classifying the metadata vector by using a hyperplane to obtain a vector classification result; based on the vector classification result, the mapping relation between the metadata and the data bucket is stored by utilizing the hash map.
In the specific implementation process: before inquiring a target data bucket where metadata to be analyzed are located based on a preset hash map, storing the mapping relation between the metadata and the data bucket by using the hash map.
Specifically, for example, metadata is collected in advance using a metadata collection tool that can extract metadata by scanning a data source (e.g., database, file system, API, etc.). Or can be directly connected to the database system and obtain metadata information of the database.
The process of vector converting the collected metadata includes: and analyzing the acquired metadata to obtain metadata attributes, and combining the metadata attributes to generate metadata features, wherein the combination mode can be text splicing. And carrying out vector conversion on the obtained metadata characteristics to obtain metadata vectors corresponding to each metadata. Vector conversion can be realized by using word2vec technology, and can also be realized by using other modes according to requirements.
Please refer to fig. 2, which illustrates a hyperplane diagram according to an embodiment of the present application.
At least two hyperplanes are constructed, a hyperplane being one plane in a high-dimensional space with dimensions less than the dimensions of the space supporting the plane. In two dimensions, the hyperplane is a straight line; in three-dimensional space, a hyperplane is a plane.
In an embodiment of the present application, the hyperplane may be two-dimensional embedding (embedded). As shown in FIG. 2, in two dimensions, a hyperplane can be thought of as a linear classifier that is capable of mapping different metadata vectors to different regions on a plane and classifying them in that region to obtain a vector classification result.
It will be appreciated that if a two-dimensional hyperplane cannot separate the metadata vectors, then some non-linear approach, such as KERNEL TRICK, may be required to map the metadata vectors into a higher-dimensional space to classify the metadata vectors.
The number of hyperplanes may be determined based on the number of metadata vectors, or based on classification accuracy. The larger the number of hyperplanes is, the higher the classification accuracy is, and the lower the probability of similar metadata is; conversely, the smaller the number of hyperplanes, the lower the accuracy of representation and the higher the probability of similar metadata. By way of example, assuming 700 metadata vectors, one hyperplane is expected to be built per 100 pieces of data, the number of hyperplanes may be 7. As shown in FIG. 2, the hyperplane may be represented under a rectangular coordinate system of the two-dimensional platform, and the number of hyperplanes may be 7, respectively S1-S7.
After the vector classification result is obtained, the metadata vectors of the same category are stored in the same data bucket, and the mapping relation between the metadata and the data bucket is stored by utilizing the hash map based on the association relation between the metadata vectors and the data bucket.
In the implementation process, metadata is converted into metadata vectors, and the metadata vectors can be divided into categories with similar characteristics by selecting proper hyperplanes in a data space, so that the classification of the metadata vectors is realized. And storing the mapping relation between the metadata and the data bucket by utilizing the hash map, thereby improving the calculation efficiency.
Optionally, in an embodiment of the present application, classifying metadata vectors by using a hyperplane to obtain a vector classification result includes: performing dot product calculation on each metadata vector and each hyperplane to obtain a hash value corresponding to each metadata vector; and classifying the metadata vector based on the hash value corresponding to the metadata vector to obtain a vector classification result.
In the specific implementation process: the metadata vector is classified by using the hyperplane, for example, the metadata vector is respectively calculated by dot product with each hyperplane, a dot product result of the metadata vector and each hyperplane is obtained, and a hash value corresponding to the metadata vector is obtained based on the dot product result.
Dot Product (Dot Product) computation, also known as vector Product, number Product, refers to an operation of two vectors mathematically. In euclidean space, dot product refers to the sum of the products of corresponding elements of two equal length vectors at respective positions. Dot products can be used to determine the similarity between two vectors or the cosine of the angle between the vectors and the two-dimensional hyperplane.
For example, 6 hyperplanes are total, and the metadata vector performs dot product calculation with the 6 hyperplanes respectively to obtain 6 dot product results; and combining, sorting or other operations are carried out on the 6 dot product results, and a hash value corresponding to the metadata vector is generated. And respectively carrying out the calculation on each metadata vector to generate a hash value corresponding to each metadata vector.
Based on the hash value corresponding to the metadata vector, the metadata vector is classified to obtain a vector classification result, for example, the metadata vector with the same hash value may be used as a class, or the hash value may be divided into data segments, and the metadata vector corresponding to the hash value of each data segment may be used as a class.
In the implementation process, the cosine value of the included angle between the vector and the two-dimensional hyperplane is calculated by utilizing the dot product, so that the similarity between the vector and the two-dimensional hyperplane is determined, the metadata vector is classified by utilizing the hyperplane, a vector classification result is obtained, and the data analysis efficiency is improved.
Optionally, in an embodiment of the present application, performing dot product calculation on each metadata vector and each hyperplane to obtain a hash value corresponding to each metadata vector includes: respectively carrying out dot product calculation on the metadata vector and each hyperplane by using a vector dot product formula to obtain a dot product result of the metadata vector and each hyperplane; and ordering dot product results corresponding to the metadata vectors according to a preset hyperplane sequence to obtain hash values corresponding to the metadata vectors.
In the specific implementation process: and respectively carrying out dot product calculation on the metadata vector and each hyperplane by using a vector dot product formula to obtain a dot product result of the metadata vector and each hyperplane. As one embodiment, the metadata vector includes a vector abscissa, a vector ordinate, and a vector ordinate; the hyperplane comprises a hyperplane abscissa, a hyperplane ordinate and a hyperplane ordinate; the vector dot product formula includes:
Vn·Sn=X1·X2+Y1·Y2+Z1·Z2
Wherein Vn is a metadata vector; sn is a hyperplane; vn.Sn is the dot product result; x 1 is the vector abscissa; y 1 is the vector ordinate; z 1 is the vector vertical coordinate; x 2 is the hyperplane abscissa; y 2 is the hyperplane ordinate; z 2 is the hyperplane vertical coordinate.
The dot product of the metadata vector with each hyperplane can be obtained using the above formula, where the dot product vn·sn can represent the product of the projection of the metadata vector Vn onto the hyperplane Sn and the Sn length. If the dot product result is greater than 0, the angle difference between the two is not more than 90, that is, the metadata vector may be projected on the hyperplane, and when the dot product result is greater than 0, the dot product result may be recorded as 1, which represents the similarity as 1.
Similarly, a dot product result equal to 0 indicates that the angles of the two are perpendicular to each other, and a dot product result less than zero indicates that the angle difference between the two is greater than 90 degrees. In both cases, the metadata vector cannot or is difficult to project on the hyperplane, and when the dot product result is not greater than 0, the dot product result may be recorded as 0, which represents a similarity of 0.
The dot product results corresponding to the metadata vectors are ordered according to a preset hyperplane sequence, for example, 6 hyperplanes are total, the number sequence is S1-S6, according to the dot product formula, 6 dot product results corresponding to the metadata vectors and S1-S6 can be obtained respectively, and 6 dot product results are ordered according to the hyperplane number sequence S1-S6, so that hash values corresponding to the metadata vectors are obtained.
To facilitate recording, the dot product result and hash value of each metadata vector and each hyperplane may be tabulated. The dot product result and hash value of the metadata vector and hyperplane as shown in table 1.
S1 S2 S3 S4 S5 S6 Hash value
V1 1 1 1 0 0 0 111000
V2 1 1 1 0 1 0 111010
V3 1 1 1 0 0 0 111000
... ... ... ... ... ... ... ....
Vn 0 1 1 1 0 1 011101
TABLE 1 dot product result and hash value of metadata vector and hyperplane
In table 1, the first row of data is 6 hyperplane numbers and hash values, respectively, and the second column of data is n metadata vectors. The second row of data is the dot product result of the metadata vector V1 and 6 hyperplanes, respectively, and the hash value of the metadata vector V1 obtained from these 6 dot product results. The third row of data is the dot product result of the metadata vector V2 and 6 hyperplanes, respectively, and the hash value of the metadata vector V2 obtained from these 6 dot product results.
As can be seen from table 1, the hash values of the metadata vector V1 and the metadata vector V3 are identical, and if the category division rule is to divide the metadata vector with identical hash values into one category, the metadata vector V1 and the metadata vector V3 can be divided into metadata vectors of the same category.
In the implementation process, the vector dot product formula is utilized to respectively calculate dot products of the metadata vector and each hyperplane, dot product results are ordered according to a preset hyperplane sequence, hash values corresponding to the metadata vector are obtained, the dot product results can be represented by 1 or 0, the number of the hyperplanes can be set according to actual conditions, the hash values are generated, and more accurate support is provided for subsequent data analysis.
Optionally, in an embodiment of the present application, storing, based on the vector classification result, a mapping relationship between metadata and a data bucket using a hash map includes: based on the vector classification result, storing the metadata vectors of the same category into the same data bucket; and storing the mapping relation between the metadata corresponding to the metadata vector and the data bucket by utilizing the hash map.
In the specific implementation process: after the hash value of each metadata vector is obtained in the above embodiment, if the metadata vector with consistent hash value is divided into one category, a vector classification result is obtained. Or taking the metadata vector corresponding to the hash value of each data segment as a category, and obtaining a vector classification result.
And storing the metadata vectors of the same category into the same data bucket, so that the association relation between the metadata vectors and the data bucket is obtained, wherein each metadata vector corresponds to one metadata, therefore, the mapping relation between the metadata and the data bucket can be obtained based on the association relation between the metadata vectors and the data bucket, and the mapping relation between the metadata corresponding to the metadata vectors and the data bucket is stored by utilizing the hash map. For example, metadata may be used as a key (key) of the hash map, and a bucket in which the metadata is located may be stored as a value (value) of the hash map.
In the implementation process, after the metadata vectors are classified, the metadata vectors classified into the same category are data with certain similarity, so that before consistency analysis, data without relevance are filtered to a great extent, more accurate data are provided for further consistency analysis, and accuracy and efficiency of data analysis are improved.
Optionally, in an embodiment of the present application, performing similarity calculation on metadata to be analyzed and a target metadata vector in a target data bucket to obtain a data analysis result, where the method includes: obtaining a target metadata vector in a target data bucket; performing similarity calculation on the metadata to be analyzed and each target metadata vector to obtain similarity data corresponding to each target metadata vector; and sorting the similarity data, and obtaining a data analysis result based on the similarity sorting result.
In the specific implementation process: when the metadata to be analyzed is subjected to data analysis, the metadata to be analyzed can be utilized to search a target data bucket where the metadata to be analyzed are located in the hash map, and then a target metadata vector in the target data bucket is obtained. And carrying out similarity calculation on the metadata to be analyzed and each target metadata vector to obtain similarity data corresponding to each target metadata vector. The similarity calculation is a method for measuring the similarity between two objects, and the similarity calculation may include euclidean distance, cosine similarity, pearson correlation coefficient, manhattan distance, or the like. The embodiment of the present application is not limited thereto.
After the similarity data corresponding to each target metadata vector of the metadata to be analyzed is obtained, the similarity data is ordered, and a data analysis result is obtained based on the similarity ordering result, for example, the data of the front preset bit of the similarity ordering result is taken as a consistency analysis result of the metadata to be analyzed, or similar data recommendation is performed.
In the implementation process, similarity calculation is performed on the metadata to be analyzed and the target metadata vector in the target data bucket, and comparison of one metadata with all metadata is not needed, so that the operation amount and the labor cost are reduced, and the data analysis efficiency is improved.
In an alternative embodiment, metadata is collected by a metadata collection tool, and features of the metadata include, but are not limited to, in accordance with metadata management standards: field code, field name, data type, object class, property, etc. The attributes of the metadata are combined to form the metadata features, and the combination mode can be text splicing. And vector conversion is carried out on the metadata features by using word2vec technology, so that each metadata corresponds to only one vector.
At least two hyperplanes are constructed, each hyperplane also being a two-dimensional embedding, the hyperplane being determined based on the number of metadata vectors, the hyperplane may be randomly constructed, e.g., a number is randomly selected from a range as a slope or intercept, and the hyperplane is constructed based on the slope or intercept.
Respectively carrying out vector dot product on the metadata vectors V1-Vn and m hyperplanes to obtain dot product results, and if the dot product results are larger than 0, recording the dot product results as 1, wherein the representative similarity is 1; if the dot product result is not greater than 0, the dot product result is recorded as 0, and the representative similarity is 0.
And ordering dot product results corresponding to the metadata vectors according to a preset hyperplane sequence to obtain hash values corresponding to the metadata vectors. If the metadata vectors with consistent hash values are divided into one category, a vector classification result is obtained.
And storing the metadata vectors of the same category into the same data bucket, so that the association relation between the metadata vectors and the data bucket is obtained, and storing the mapping relation between the metadata corresponding to the metadata vectors and the data bucket by utilizing the hash map.
When the consistency analysis is required to be carried out on the metadata to be analyzed, a target data bucket where the metadata to be analyzed are located can be searched in the hash map, and then a target metadata vector in the target data bucket is obtained. And carrying out similarity calculation on the metadata to be analyzed and each target metadata vector to obtain similarity data corresponding to each target metadata vector. And sequencing the similarity data to obtain N target metadata vectors which are ranked in front from the similarity data as consistency analysis results of the metadata to be analyzed or recommendation of the similarity data.
Please refer to fig. 3, which illustrates a schematic structure diagram of a metadata analysis apparatus according to an embodiment of the present application; the embodiment of the application provides a metadata analysis device 200, which comprises:
An obtaining module 210, configured to obtain metadata to be analyzed;
The query data bucket module 220 is configured to query a target data bucket where metadata to be analyzed is located based on a preset hash map; the hash map stores the mapping relation between the metadata and the data barrel; the mapping relation is obtained by classifying metadata vectors corresponding to metadata and storing the metadata of the same class into the same data bucket;
And the analysis module 230 is configured to perform similarity calculation on metadata to be analyzed and a target metadata vector in a target data bucket, so as to obtain a data analysis result.
Optionally, in the embodiment of the present application, the metadata analysis device further includes a data mapping module, configured to perform vector conversion on metadata collected in advance, to obtain metadata vectors corresponding to each metadata; obtaining at least two pre-constructed hyperplanes; classifying the metadata vector by using a hyperplane to obtain a vector classification result; based on the vector classification result, the mapping relation between the metadata and the data bucket is stored by utilizing the hash map.
Optionally, in the embodiment of the present application, the metadata analysis device and the data mapping module are further configured to perform dot product calculation on each metadata vector and each hyperplane to obtain a hash value corresponding to each metadata vector; and classifying the metadata vector based on the hash value corresponding to the metadata vector to obtain a vector classification result.
Optionally, in the embodiment of the present application, the metadata analysis device, the data mapping module, and the vector dot product formula are further configured to perform dot product calculation on the metadata vector and each hyperplane respectively, so as to obtain a dot product result of the metadata vector and each hyperplane; and ordering dot product results corresponding to the metadata vectors according to a preset hyperplane sequence to obtain hash values corresponding to the metadata vectors.
Optionally, in an embodiment of the present application, the metadata analysis device, the metadata vector includes a vector abscissa, a vector ordinate, and a vector ordinate; the hyperplane comprises a hyperplane abscissa, a hyperplane ordinate and a hyperplane ordinate; the vector dot product formula includes:
Vn·Sn=X1·X2+Y1·Y2+Z1·Z2
Wherein Vn is a metadata vector; sn is a hyperplane; vn.Sn is the dot product result; x 1 is the vector abscissa; y 1 is the vector ordinate; z 1 is the vector vertical coordinate; x 2 is the hyperplane abscissa; y 2 is the hyperplane ordinate; z 2 is the hyperplane vertical coordinate.
Optionally, in the embodiment of the present application, the metadata analysis device, the data mapping module are further configured to store metadata vectors of a same class into a same data bucket based on a vector classification result; and storing the mapping relation between the metadata corresponding to the metadata vector and the data bucket by utilizing the hash map.
Optionally, in the embodiment of the present application, the metadata analysis device, the analysis module 230 is specifically configured to obtain a target metadata vector in a target data bucket; performing similarity calculation on the metadata to be analyzed and each target metadata vector to obtain similarity data corresponding to each target metadata vector; and sorting the similarity data, and obtaining a data analysis result based on the similarity sorting result.
It should be understood that the apparatus corresponds to the above metadata analysis method embodiment, and is capable of executing the steps involved in the above method embodiment, and specific functions of the apparatus may be referred to the above description, and detailed descriptions thereof are omitted herein as appropriate to avoid redundancy. The device includes at least one software functional module that can be stored in memory in the form of software or firmware (firmware) or cured in an Operating System (OS) of the device.
Please refer to fig. 4, which illustrates a schematic structural diagram of an electronic device according to an embodiment of the present application. An electronic device 300 provided in an embodiment of the present application includes: a processor 310 and a memory 320, the memory 320 storing machine-readable instructions executable by the processor 310, which when executed by the processor 310 perform the method as described above.
The embodiment of the application also provides a storage medium, wherein a computer program is stored on the storage medium, and the computer program is executed by a processor to execute the method.
The storage medium may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as static random access Memory (Static Random Access Memory, SRAM), electrically erasable Programmable Read-Only Memory (ELECTRICALLY ERASABLE PROGRAMMABLE READ-Only Memory, EEPROM), erasable Programmable Read-Only Memory (Erasable Programmable Read Only Memory, EPROM), programmable Read-Only Memory (PROM), read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk, or optical disk.
In the embodiments of the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The apparatus embodiments described above are merely illustrative, for example, of the flowcharts and block diagrams in the figures that illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, the functional modules in the embodiments of the present application may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.
The foregoing description is merely an optional implementation of the embodiment of the present application, but the scope of the embodiment of the present application is not limited thereto, and any person skilled in the art may easily think about changes or substitutions within the technical scope of the embodiment of the present application, and the changes or substitutions are covered by the scope of the embodiment of the present application.

Claims (10)

1. A method of metadata analysis, comprising:
Acquiring metadata to be analyzed;
Inquiring a target data bucket where the metadata to be analyzed are located based on a preset hash map; the hash map stores the mapping relation between metadata and a data barrel; the mapping relation is obtained by classifying the metadata vectors corresponding to the metadata and storing the metadata vectors of the same class into the same data bucket;
and performing similarity calculation on the metadata to be analyzed and the target metadata vector in the target data bucket to obtain a data analysis result.
2. The method of claim 1, wherein prior to querying the target data bucket in which the metadata to be analyzed is located based on the preset hash map, the method further comprises:
vector conversion is carried out on the metadata acquired in advance, and the metadata vector corresponding to each piece of metadata is obtained;
obtaining at least two pre-constructed hyperplanes;
classifying the metadata vector by utilizing the hyperplane to obtain a vector classification result;
And storing the mapping relation between the metadata and the data bucket by utilizing the hash map based on the vector classification result.
3. The method of claim 2, wherein classifying the metadata vector using the hyperplane to obtain a vector classification result comprises:
performing dot product calculation on each metadata vector and each hyperplane to obtain a hash value corresponding to each metadata vector;
And classifying the metadata vector based on the hash value corresponding to the metadata vector to obtain the vector classification result.
4. The method according to claim 2, wherein performing dot product calculation on each metadata vector and each hyperplane to obtain a hash value corresponding to each metadata vector includes:
Respectively carrying out dot product calculation on the metadata vector and each hyperplane by using a vector dot product formula to obtain a dot product result of the metadata vector and each hyperplane;
And sequencing the dot product results corresponding to the metadata vectors according to the preset hyperplane sequence to obtain the hash values corresponding to the metadata vectors.
5. The method of claim 4, wherein the metadata vector comprises a vector abscissa, a vector ordinate, and a vector ordinate; the hyperplane comprises a hyperplane abscissa, a hyperplane ordinate and a hyperplane ordinate; the vector dot product formula includes:
Vn·Sn=X1·X2+Y1·Y2+Z1·Z2
Wherein Vn is a metadata vector; sn is a hyperplane; vn.Sn is the dot product result; x 1 is the vector abscissa; y 1 is the vector ordinate; z 1 is the vector vertical coordinate; x 2 is the hyperplane abscissa; y 2 is the hyperplane ordinate; z 2 is the hyperplane vertical coordinate.
6. The method of claim 2, wherein storing the mapping relationship of the metadata and the data bucket using the hash map based on the vector classification result comprises:
Storing the metadata vectors of the same category into the same data bucket based on the vector classification result;
and storing the mapping relation between the metadata corresponding to the metadata vector and the data bucket by using the hash map.
7. The method according to any one of claims 1-6, wherein the performing similarity calculation on the metadata to be analyzed and the target metadata vector in the target data bucket to obtain a data analysis result includes:
Obtaining the target metadata vector in the target data bucket;
Performing similarity calculation on the metadata to be analyzed and each target metadata vector to obtain similarity data corresponding to each target metadata vector;
And sequencing the similarity data, and obtaining the data analysis result based on the similarity sequencing result.
8. A metadata analysis apparatus, comprising:
The acquisition module is used for acquiring metadata to be analyzed;
The query data bucket module is used for querying a target data bucket where the metadata to be analyzed are located based on a preset hash map; the hash map stores the mapping relation between metadata and a data barrel; the mapping relation is obtained by classifying metadata vectors corresponding to the metadata and storing the metadata of the same class into the same data bucket;
And the analysis module is used for carrying out similarity calculation on the metadata to be analyzed and the target metadata vector in the target data bucket to obtain a data analysis result.
9. An electronic device, comprising: a processor and a memory storing machine-readable instructions executable by the processor to perform the method of any one of claims 1 to 7 when executed by the processor.
10. A computer-readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, performs the method according to any of claims 1 to 7.
CN202410033602.0A 2024-01-09 2024-01-09 Metadata analysis method and device, electronic equipment and storage medium Pending CN118069637A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410033602.0A CN118069637A (en) 2024-01-09 2024-01-09 Metadata analysis method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410033602.0A CN118069637A (en) 2024-01-09 2024-01-09 Metadata analysis method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN118069637A true CN118069637A (en) 2024-05-24

Family

ID=91104692

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410033602.0A Pending CN118069637A (en) 2024-01-09 2024-01-09 Metadata analysis method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN118069637A (en)

Similar Documents

Publication Publication Date Title
CN110569328B (en) Entity linking method, electronic device and computer equipment
US9141853B1 (en) System and method for extracting information from documents
US7565348B1 (en) Determining a document similarity metric
US7711719B1 (en) Massive multi-pattern searching
US9275068B2 (en) De-duplication deployment planning
CN113868351B (en) Address clustering method and device, electronic equipment and storage medium
CN108280226B (en) Data processing method and related equipment
CN111782595A (en) Mass file management method and device, computer equipment and readable storage medium
CN112163409A (en) Similar document detection method, system, terminal device and computer readable storage medium
US20170185671A1 (en) Method and apparatus for determining similar document set to target document from a plurality of documents
CN115145871A (en) File query method and device and electronic equipment
CN110825817B (en) Enterprise suspected association judgment method and system
CN109902129B (en) Insurance agent classifying method and related equipment based on big data analysis
US20190294594A1 (en) Identity Data Enhancement
CN114610955A (en) Intelligent retrieval method and device, electronic equipment and storage medium
Vadicamo et al. Re-ranking via local embeddings: A use case with permutation-based indexing and the nSimplex projection
CN118069637A (en) Metadata analysis method and device, electronic equipment and storage medium
CN113792169B (en) Digital archive management method and system based on big data application
CN109885710B (en) User image depicting method based on differential evolution algorithm and server
WO2021145030A1 (en) Video search system, video search method, and computer program
WO2022070340A1 (en) Video search system, video search method, and computer program
CN110909112A (en) Data extraction method, device, terminal equipment and medium
CN111597368A (en) Data processing method and device
CN112100670A (en) Big data based privacy data grading protection method
CN117131245B (en) Method for realizing directory resource recommendation mechanism by using knowledge graph technology

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination