CN110795524A - Main data mapping processing method and device, computer equipment and storage medium - Google Patents
Main data mapping processing method and device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN110795524A CN110795524A CN201911052600.1A CN201911052600A CN110795524A CN 110795524 A CN110795524 A CN 110795524A CN 201911052600 A CN201911052600 A CN 201911052600A CN 110795524 A CN110795524 A CN 110795524A
- Authority
- CN
- China
- Prior art keywords
- data
- target data
- similarity
- text information
- identification information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses a main data mapping processing method, a main data mapping processing device, computer equipment and a storage medium, wherein the main data mapping processing method comprises the steps of obtaining target data to be mapped, wherein the target data comprises a first attribute file; calculating first similarity values of a plurality of first attribute files and a plurality of second attribute files of preset reference data, wherein the first attribute files are mapped one by one with the second attribute files to obtain a plurality of first similarity values; generating a total similarity value representing the coincidence degree of the target data and the reference data according to the plurality of first similarity values; and mapping the target data according to the total similarity value. According to the method and the device, similarity calculation is performed on different attribute files respectively, and then the total similarity is calculated, so that the obtained total similarity value is more objective in similarity calculation, the artificial interference is eliminated, and the mapping mode is more convenient and faster.
Description
Technical Field
The application relates to the technical field of enterprise informatization data processing, in particular to a main data mapping processing method and device, computer equipment and a storage medium.
Background
The main data is a high-value core business entity which can be shared by cross-business in an enterprise, and is key data of the enterprise, such as: personnel, products, suppliers, materials, etc. The master data management helps enterprises to establish master data single views and perform data sharing.
And the main data management integrates the main data of each business system of the enterprise and then carries out data governance. An important technical means of data management is main data mapping, and the purpose of the main data mapping is to find out two or more pieces of repeated and suspected repeated data, screen and modify the data, and establish a contrast relation with standard main data, so that the data quality shared by the main data can be improved.
The existing main data mapping technology mainly comprises the following steps: firstly, writing SQL by using the database capacity, removing duplicate data by using a 'leave' or a 'LIKE' or a specific function of a where statement, and directly updating the mapping relation by writing SQL after manual comparison; and secondly, manually removing duplicate and comparing by using tools such as EXCEL and the like, establishing a mapping relation with the standard main data, and then directly importing the data into the system.
The above two conventional schemes have the following defects: 1) the process that data similarity judgment is the combination of business and technology is omitted, both the technical means and the business means are needed, and in general situations, after the duplication is removed by the technical means, business personnel are needed to audit to determine whether duplication removal or modification is needed; 2) it is difficult to find a suspected duplicate, different-word but synonymous master data record, such as supplier address, using only traditional database capabilities: shenyang City and Shenyang City in Liaoning province are addresses. 3) Two master data records are sometimes subjected to a comprehensive comparison of the contents of multiple fields of attributes to determine the similarity of the two, rather than a single attribute.
Disclosure of Invention
Based on the above problems, the present application discloses a method, an apparatus, a computer device and a storage medium for main data mapping processing, which employ a computer to perform objective, accurate and fast similarity identification and data mapping on multiple data and multiple attribute files.
According to a first aspect, an embodiment of the present application provides a master data mapping processing method, including:
acquiring target data to be mapped, wherein the target data comprises a first attribute file;
calculating first similarity values of the first attribute files and second attribute files of preset reference data, wherein the number of the first attribute files is multiple, the number of the second attribute files is multiple, and the first attribute files and the second attribute files are mapped one by one so as to obtain the first similarity values;
generating a total similarity value representing the coincidence degree of the target data and the reference data according to the plurality of first similarity values;
and mapping the target data according to the total similarity value.
Optionally, the first attribute file includes first identification information and first text information, the first identification information and the first text information are mapped one to one, the second attribute file includes second identification information and second text information, the second identification information and the second text information are mapped one to one, where the first identification information is a type parameter representing the target data, and the second identification information is a type parameter representing the reference data; the method for calculating the first similarity value between the first attribute file and the second attribute file of the preset reference data comprises the following steps:
extracting first text information and second text information which are respectively mapped by first identification information and second identification information with the same type parameters;
and comparing the extracted first text information with the extracted second text information to obtain the first similarity value.
Optionally, the method for comparing the extracted first text information with the extracted second text information to obtain the first similarity value includes:
calling a rule database, and searching and determining a comparison rule matched with the first identification information in the rule database;
and calculating a first similarity value of the first text information and the second text information according to the comparison rule.
Optionally, the method for generating an overall similarity value representing the degree of coincidence between the target data and the reference data according to the plurality of first similarity values includes:
acquiring a weight value mapped by the first identification information;
multiplying the weight value mapped by the first identification information with the corresponding first similarity value to obtain a second similarity value;
and adding the second similarity values corresponding to all the first identification information contained in the target data to obtain the total similarity value.
Optionally, the comparison rule includes an equity algorithm, where the equity algorithm is to determine whether the first text information and the second text information are identical.
Optionally, the comparison rule includes a similarity algorithm, where the similarity algorithm is to determine a probability of similarity between the first text information and the second text information.
Optionally, when there are a plurality of target data, the method for mapping the target data according to the total similarity value includes:
sorting the sizes of the total similarity values of the target data;
according to the sorting result, extracting the target tools with the total similarity value larger than or equal to a preset threshold value to generate a similar data list;
and selecting one or more target data from the similar data list and mapping the target data and the reference data in association.
In another aspect, the present application discloses a master data mapping processing apparatus, including:
an acquisition module: configured to perform obtaining target data to be mapped, wherein the target data comprises a first property file;
a first calculation module: the device comprises a first attribute file, a second attribute file and a plurality of data processing units, wherein the first attribute file and the second attribute file are configured to calculate first similarity values of the first attribute file and a second attribute file of preset reference data, the first attribute file and the second attribute file are mapped one by one, and therefore the first similarity values are obtained;
a second calculation module: configured to perform generating an overall similarity value characterizing a degree of coincidence of the target data with the reference data from a plurality of the first similarity values;
an execution module: is configured to perform a mapping process on the target data according to the total similarity value.
Optionally, the first attribute file includes first identification information and first text information, the first identification information and the first text information are mapped one to one, the second attribute file includes second identification information and second text information, the second identification information and the second text information are mapped one to one, where the first identification information is a type parameter representing the target data, and the second identification information is a type parameter representing the reference data; the first computing module includes:
an extraction module: configured to perform extracting first text information and second text information to which first identification information and second identification information having the same type parameter are respectively mapped;
a first comparison module: configured to perform a comparison of the extracted first and second text information to obtain the first similarity value.
Optionally, the first comparison module includes:
a rule matching module configured to execute a calling rule database, and search the rule database for a comparison rule determined to match the first identification information; (ii) a
A first calculation submodule: is configured to perform a calculation of a first similarity value of the first text information and the second text information according to the comparison rule.
Optionally, the second computing module includes:
a weight acquisition module: configured to perform obtaining a weight value to which the first identification information is mapped;
a product module: the similarity calculation module is configured to multiply the weight value mapped by the first identification information and the corresponding first similarity value to obtain a second similarity value;
a second calculation submodule: the similarity calculation method is configured to perform addition of second similarity values corresponding to all first identification information contained in the target data to obtain the total similarity value.
Optionally, the comparison rule includes an equity algorithm, where the equity algorithm is to determine whether the first text information and the second text information are identical.
Optionally, the comparison rule includes a similarity algorithm, where the similarity algorithm is to determine a probability of similarity between the first text information and the second text information.
Optionally, when there are a plurality of target data, the executing module includes:
a sorting module: configured to perform sorting of the magnitude of the total similarity value of the target data;
a list generation module: the target tool generating device is configured to extract the target tools with the total similarity value larger than or equal to a preset threshold value according to the sorting result to generate a similar data list;
a mapping module: is configured to perform a selection of one or more of the target data from the similar data list to associate a mapping with the reference data.
Embodiments of the present application also provide, according to the third aspect, a computer device, which includes a memory and a processor, where the memory stores computer-readable instructions, and the computer-readable instructions, when executed by the processor, cause the processor to execute the steps of the main data mapping processing method.
Embodiments of the present application also provide, according to a fourth aspect, a storage medium storing computer-readable instructions, which, when executed by one or more processors, cause the one or more processors to perform the steps of the master data mapping processing method described above.
The beneficial effects of the embodiment of the application are that: the application discloses a main data mapping processing method, a device, computer equipment and a storage medium, which can identify attribute file information in the data to be mapped and compare the attribute file information with attribute file information in reference data by obtaining target data to be mapped, calculate the similarity value of the data, assist a user in data mapping by the similarity value, enable the main data mapping to be more convenient and faster by adopting the mapping mode, enable different attribute files to have different characteristics, respectively carry out similarity calculation on different attribute files, and then calculate the total similarity, enable the obtained total similarity value to be more objective, and eliminate artificial interference.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a schematic flow chart illustrating a primary data mapping processing method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a method for calculating a first similarity value according to an embodiment of the present application;
FIG. 3 is a specific embodiment of various profiles of the present application;
FIG. 4 is a diagram illustrating a method for obtaining a first similarity value according to text information according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a method for calculating a total similarity value according to an embodiment of the present application;
FIG. 6 is a flowchart illustrating a total similarity calculation process according to an embodiment of the present application;
FIG. 7 is a schematic diagram of a target master data mapping processing method according to an embodiment of the present application;
FIG. 8 is a schematic overall flow chart of target master data mapping according to an embodiment of the present application;
FIG. 9 is a data audit display interface according to an embodiment of the present application;
FIG. 10 is a data similarity and mapping report display interface according to an embodiment of the present application;
FIG. 11 is a block diagram of a master data mapping processing apparatus according to an embodiment of the present application;
FIG. 12 is a block diagram of the basic structure of a computer device according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
With the development of computers, many of the tasks originally completed manually are performed by computers, which perform operations and information processing according to specified regulations, and thus have low error rate and high speed. Based on the characteristic of the computer, the present application discloses a method for mapping and processing main data, please refer to fig. 1, which specifically includes:
s1000, obtaining target data to be mapped, wherein the target data comprise a first attribute file;
s2000, calculating first similarity values of the first attribute files and second attribute files of preset reference data, wherein the number of the first attribute files is multiple, the number of the second attribute files is multiple, and the first attribute files and the second attribute files are mapped one by one to obtain the first similarity values;
in the application, the target data to be mapped is any data which needs to be subjected to similarity comparison and mapping with reference data, the reference data is main data, is a standard information input format which is input in advance and aims at each product, and comprises a plurality of attribute files, each attribute file comprises identification information and text information, and the identification information and the text information are mapped one by one. The entry rule of the target data is generally the same as that of the reference data, and therefore, a plurality of attribute files are also included in the target data, and identification information and corresponding text information are also included under each attribute file. For the purpose of distinguishing, in the present application, an attribute file of target data is referred to as a first attribute file, identification information under the first attribute file is referred to as first identification information, and text information mapped by the first identification information is referred to as first text information; and the attribute file of the reference data is called a second attribute file, the identification information under the second attribute file is called second identification information, and the text information mapped by the second identification information is called second text information.
In an embodiment, referring to fig. 2, the first attribute file includes first identification information and first text information, the first identification information and the first text information are mapped one to one, the second attribute file includes second identification information and second text information, and the second identification information and the second text information are mapped one to one, where the first identification information is a type parameter representing the target data, and the second identification information is a type parameter representing the reference data; the method for calculating the first similarity value between the first attribute file and the second attribute file of the preset reference data comprises the following steps:
s2100, extracting first text information and second text information respectively mapped by first identification information and second identification information with the same type parameters;
s2200, comparing the extracted first text information with the second text information to obtain the first similarity value.
In this application, the application scenario of the main data mapping process may be: in a plurality of pieces of target data to be mapped, sequentially comparing first identification information in each piece of target data to be mapped with second identification information in reference data to see whether the first identification information is the same or not, if the first identification information is the same, comparing first text information of the target data with second text information of the reference data to obtain a first similarity, for example, taking main data mapping processing in a management process of medicine data as an example, the medicine management data usually needs to record contents of types such as names, codes, models, manufacturers, production dates and the like of medicines, the types of information are called identification information, specific contents under the identification information are called text information, the identification information and the text information are combined to be called as an attribute file, a plurality of types are contained in one piece of target data to be mapped, specific information is mapped under each type, and the first text information and the second text information under each same type in the target data and the reference data are obtained And judging whether the target data and the reference data are the same or similar according to the similarity of the information, and taking the judgment as a basis for judging whether the mapping relation can be established between the target data and the reference data.
In an embodiment, the importance and the recognition rule of different identification information in the process of determining whether the target data is similar to the reference data are different, for example, please refer to fig. 3, which is a diagram for specific applications, when the identification information of the attribute file is "other information", and only some information irrelevant to the substantial content of the differentiated product, such as complaint content and the number of complaints, etc., is recorded in the text information mapped by the identification information, even if the text information content is different, the target data to which the attribute file belongs cannot be considered to be different from the reference data, and this type of attribute file may be referred to as an invalid attribute file. The attribute files corresponding to the invalid attribute files are valid attribute files, that is, subject data which can be used for mapping and judging data and has influence, such as codes, product names, models, manufacturer names, manufacturer addresses, affiliated business systems and other attribute files, when model information in the target data is different from model information in the reference data, even if the product names are the same, information of manufacturers and the like is the same, the product names and the like can be regarded as different data, and when text information under other attribute files in the target data is the same as text information under the same attribute files in the reference data and is only different in product names, the target data and the reference data can be the same or different, taking drug management as an example, for drugs, the drug has the components of a scientific name, a Chinese name and an English name, such as 'penicillin', also called 'penicillin', and the English name of 'benzypenicilin', the description of "penicillin", "penicillin" or "benzylpenillilin" in the product name is actually a thing, except that the drug name and the code representing the data code are different, and when the text information under other identification information is the same, the two belong to associated data, and a mapping process should be performed, for example, the code of the target data in fig. 3 is AS123, which is the same AS the data of the standard code MD12345 in the reference data in the information of the product model, the manufacturer address, and the like, and the two data can be regarded AS the same data and have a mapping relationship. For some unrelated names, such as "penicillin" and "amoxicillin", the names are different, and no mapping relationship can be established for different products. In addition, for the codes, in order to facilitate commodity management, a manufacturer compiles different codes for different products in the process of producing the products, for example, the code for the penicillin-type product is A123, and the code for the amoxicillin-type product is B123, so that when the codes are identified to be different, the corresponding other attribute files can be regarded as different data even if the codes are not identified.
Based on the difference between the attribute files, in the application, the similarity judgment needs to be performed on the text information under each corresponding identification information in the target data and the reference data respectively, so as to identify whether the target data and the reference data are the same or similar as a whole, and the similarity of the text information corresponding to a single attribute file is called as a first similarity.
In an embodiment, referring to fig. 4, the method for comparing the extracted first text message and the extracted second text message to obtain the first similarity value includes:
s2210, calling a rule database, and searching and determining a comparison rule corresponding to the first identification information in the rule database;
s2220, calculating a first similarity value of the first text message and the second text message according to the comparison rule.
Different first identification information has different characteristics, first text information in some first identification information can be defined as the same as second text information only if the first text information is completely the same as the second text information, the first text information of some first identification information can be defined as the same as the second text information through matching of similar meaning words or associated words even if the first text information is not the same as the second text information, for example, when data under attribute files such as 'model', 'code' and the like are identified, a rule for judging whether the text information is completely the same is adopted for judgment, if the text information is completely the same, if the text information is not completely the same, the text information is different; when the first text information in the first identification information of the product name is identified, whether the first text information is similar or not can be adopted, so that the first text information under the product name is compared with the similar name in the preset name database, and whether the first text information and the similar name are the same or not is judged. When the first text information of the first identification information of the "manufacturer address" is identified, since the addresses are classified into different levels of country, province, city, district (county), county, town, village, group, etc., the greater the level is, the greater the same probability is, the smaller the level is, the smaller the same probability is, and when the address information of the smaller level is the same, the greater the probability is that the address is the same. However, for the address, the information that may be input is not complete, for example, only the urban area is input, or only the town is input, but the corresponding urban area is not input, but there is a correlation between the town and the urban area and the province level, so a keyword recognition method can be used to determine whether the data under the subject is the same, for example, whether there are related keywords such as province, city, district (county), county, town, village, group, road, street, etc. in the input address, when there are several keywords, it is determined whether the word before the keyword is the same as that in the reference data, and if the keywords are the same, it is determined whether the keyword is a road or a street, if the keywords are not a road or a street, it indicates that the range of the input address area is too large to be recognized, and if the keywords are roads or streets, the text information before the keyword is the same, it can be determined that the keywords are the road or street, and the text information before the keyword.
Therefore, different first identification information has different calculation rules when calculating the first similarity value, and to calculate the similarity of the target data more objectively, different comparison calculation strategies, that is, different calculation rules, need to be adopted according to different identification information. In the application, a rule database is established, each identification information is mapped to different similarity comparison rules, when a first similarity value needs to be calculated, the rule database is called first, and then the comparison rules corresponding to the attribute files are matched in the rule database.
S3000, generating a total similarity value representing the coincidence degree of the target data and the reference data according to the plurality of first similarity values;
since the target data has a plurality of attribute files, when calculating the total similarity value, the total similarity value can be obtained by directly adding the first similarity values of each identification information calculated in step S2000.
In the process of determining the total similarity value of the whole target data, in order to more objectively represent the total similarity value, referring to fig. 5, according to the characteristics of the identification information and the importance degree of the identification information in the whole target data, the method for generating the total similarity value representing the overlap ratio between the target data and the reference data according to the plurality of first similarity values includes:
s3100, acquiring a weight value mapped by the first identification information;
s3200, multiplying the weight value mapped by the first identification information by the corresponding first similarity value to obtain a second similarity value;
and S3300, adding the second similarity values corresponding to all the first identification information contained in the target data to obtain the total similarity value.
Setting a weighted value for each first identification information, wherein the sum of weighted values of all first identification information in the target data is 1, obtaining respective first similarity value through a comparison rule matched with the first identification information, multiplying the first similarity of the first identification information by the corresponding weighted value to obtain a second similarity value, and adding all the second similarity values to obtain the total similarity value of the target data to be mapped. For example, the comparison rule includes a similarity algorithm and an congruent algorithm, please refer to fig. 6, and in a certain target data, the first identification information obtained by analyzing is: attribute 1 and attribute 2 … …, and then distributing a comparison rule and a corresponding weight value to each first identification information, obtaining attribute 1, and adopting a similarity algorithm comparison rule, wherein the weight value is B1, the attribute 2 adopts An congruent algorithm rule, the weight value is B2, the attribute n also adopts a similarity calculation rule, the weight value is Bn, obtaining a first similarity value of the calculated attribute 1 according to the similarity calculation rule as A1, obtaining a first similarity value of the calculated attribute 2 according to the congruent algorithm rule as A2, and obtaining a first similarity value of the attribute n according to the similarity calculation rule as An, wherein B1+ B2+ … … + Bn is 1, and obtaining a total similarity of A1 + B1+ A2B 2+ … … + An Bn.
The comparison rule disclosed above includes an equity algorithm, wherein the equity algorithm is to judge whether the text information in the attribute file is completely the same as the text information of the attribute file corresponding to the reference data, and outputs 1 in the same way and 0 in the same way. The congruent algorithm may directly use the "═ of" in the where statement of the database SQL to make a judgment or may be implemented using a common program.
The comparison rule comprises a similarity algorithm, wherein the similarity algorithm is used for judging whether the text information in the attribute file is similar to the text information of the attribute file corresponding to the reference data or not, and the output result of the similarity is a numerical value between 1 and 0; the similarity algorithm includes, but is not limited to, cosine similarity algorithm, and big data correlation technique may be used if the data size is large. In one embodiment, in the process of using a cosine similarity algorithm, text information is extracted, all words are listed, word segmentation coding is performed, word segmentation vectorization is performed, and then a cosine function is used for measuring the similarity of two texts.
The text participle coding can adopt open source technology such as Ikanalyzer, Jcseg, Jieba and the like of JAVA or PYTHON. The word segmentation vector quantization is a form for converting data after text word segmentation into a computer recognition form, and comprises a plurality of schemes, such as vectorization by taking characters or words as a unit, or vectorization by taking sentences as a unit, wherein a text vectorization method by taking the characters or words as a unit comprises algorithms such as a word set model, a bag of words model, n-gram, TF-IDF, word2vec and the like; the vectorization method using sentences as units comprises algorithms such as LSA, NMF, pLSA, LDA and the like.
For example, taking TF-IDF (Term Frequency-Inverse text Frequency) as an example, it is composed of two parts, TF and IDF; the former TF is the word frequency, the word frequency vectorization is that the occurrence frequency statistics of each word in the text is carried out, and the later IDF is the 'inverse text frequency', because the word frequency of the 'word' which almost all texts can appear is high, but the importance of the word is lower than that of some words with low word frequency, the importance of the word is reflected by the IDF, and the word characteristic value which is only expressed by the word frequency is corrected. So it is reasonable to express the quantification of a word as (the weight of the word frequency X word), and its calculation rule is: TF-idf (x) ((x) idf (x)), where x represents a word or word to be counted, TF (x) is the frequency of word change, and idf (x) is the weight of the word.
Obtaining two word frequency vectorization arrays of a vectorization value X of the first text information and a vectorization value Y of the second text information of the reference data after word segmentation vectorization, wherein a cos value obtained by calculation is as follows:
where n is the number of words in the first text information, Xi represents the vectorized value of the ith word in the first text information, and Yi represents the vectorized value of the ith word in the second text information in the reference data.
The calculated COS value range is [ 1, 1 ], when the value COS is less than 0, the value is 0, so the similarity is [ 0, 1 ], the higher the COS value is, the more similar the COS value is, the lower the similarity is, and when the calculated data is negative, the COS value is 0.
And S4000, mapping the target data according to the total similarity value.
After the total similarity value of the target data to be mapped is obtained through step S3000, the mapping process may be performed on the relevant data. The total similarity value is a numerical value for judging whether contents of two or more data are the same or similar, the higher the total similarity value is, the more similar the data between the two are, and the lower the total similarity value is, the larger the difference between the two groups of data is.
The mapping process can be actually understood as a data auditing process, and the data auditing mode includes two modes of manual processing and computer batch processing, which are selected according to the quantity of the target data, for example, when the target data has a plurality of corresponding total similarity values greater than or equal to a preset threshold, if manual auditing mapping is adopted, the efficiency is low, so that the mapping can be carried out by adopting a computer batch auditing mode, and in one embodiment, the computer batch auditing mapping is that all target data with the total similarity values greater than or equal to the preset threshold are marked and are automatically mapped with the reference data one by one. The preset threshold is a minimum total similarity value for determining whether the target data and the reference data need to be mapped, for example, the preset threshold is set to 98%, when the total similarity value is greater than or equal to 98%, the similarity is high, and the next mapping process needs to be performed, and when the total similarity value is less than 98%, the similarity is not high, the difference between the related parameter information of the two data is large, and the two data do not belong to the same or similar data, and the mapping process may not be performed. When the user selects not to carry out batch audit mapping, the method can be divided into whether manual mapping is carried out or not, when the manual mapping is carried out, the user modifies the reference data according to the audit conditions and carries out mapping association manually, and when the user selects not to carry out manual mapping, the computer carries out mapping processing on the target data according to the obtained total similarity value.
Referring to fig. 7, the method for mapping the target data according to the total similarity value includes:
s4100, sorting the total similarity values of the target data;
s4200, extracting the target data with the total similarity value larger than or equal to a preset threshold value according to the sorting result to generate a similar data list;
s4300, selecting one or more pieces of target data from the similar data list to be mapped with the reference data in an associated manner.
Ranking the total similarity value of each acquired target data, setting a preset threshold, and dividing the data needing mapping processing and the data not allowed to be mapped by using the preset threshold, in an embodiment, listing the target data with the total similarity value being greater than or equal to the preset threshold to generate a similar data list, and in the mapping processing process, only displaying the target data in the similar data list to reduce the workload of subsequent mapping processing, reduce the number of the subsequent mapping processing data, make the mapping processing interface simpler, and sorting the target data meeting the preset threshold by using a computer, so that a user can conveniently select the data needing mapping according to the total similarity value of the target data and establish a mapping relationship.
In an embodiment, the calculation of the total similarity of the target data is performed through a WEB page, and in an embodiment, referring to fig. 8, the overall flow of the mapping method of the target data includes:
s4310, start: starting to execute the mapping operation on the target data;
s4320, setting conditions, judging that the total similarity value is greater than or equal to a certain threshold value, if the total similarity value is less than the certain threshold value, directly ending the process, and entering S4370, and if the total similarity value is greater than or equal to the certain threshold value, entering S4330;
s4330, determining whether batch audit mapping is required, if so, entering step S4340, otherwise, entering step S4350;
s4340, establishing a mapping relation for all target data batches which are larger than or equal to a certain threshold;
s4350, selecting a judgment mode, judging whether manual judgment is needed, and if the manual judgment is needed, entering step S4351, and if the manual judgment is not needed, entering step S4352;
s4352, sorting the total similarity values;
s4353, modifying and mapping the target data according to the arrangement sequence of the total similarity values;
s4360, if the audit data is correct, the process goes to S4370, and if not, the process goes to S4310 again for mapping and screening;
s4370, auditing, and generating a similarity and mapping report;
s4380, end the process.
After the manual mapping is carried out, the specific steps are as follows:
s4351, the user searches the target data needing to be modified and mapped for modification and mapping, and the step is completed and the step is entered into S4360.
In a specific embodiment, please refer to fig. 9, the display interface for auditing data by a user is respectively matched with first identification information of "standard code", "product name", "model", "vendor address", and other information ", where corresponding data" cc "," a "," b "," c "," ac ", and the like under the first identification information are first text information, and in unapproved data, all target data with a total similarity value greater than 80% are automatically matched by a computer, and all target data with a total similarity value greater than 80% can be one-key mapped by only selecting an" audit pass "button. Further, in the display interface, an example interface for judging without manual work is disclosed, the computer matches the result according to the corresponding algorithm and sorts the result according to the total similarity data, corresponding codes are listed, and information such as matching degree, product name, model, manufacturer address and the like corresponding to the codes is displayed so as to be convenient for a user to check and carry out manual mapping. Or the data is matched by 'self matching', and related data is completely and manually searched for mapping and matching.
When the audit is completed, the generated generation similarity and mapping report is as shown in fig. 10, and the exemplary report includes a file name, an upload date, an upload number, an upload failure number, a matching rule and weight information thereof, a number of each matching degree and a distribution diagram or a column and a tree diagram thereof, and further, the data may be generated into an EXCEL table through a "download report EXCEL".
It should be noted that the generated similarity and mapping report includes various forms such as a text report, a graphic report, a table report, and the like, so as to visually display the related similarity processing result and mapping result.
On the other hand, referring to fig. 11, the present application discloses a master data mapping processing apparatus, including:
an acquisition module: configured to perform obtaining target data to be mapped, wherein the target data comprises a first property file;
a first calculation module: the device comprises a first attribute file, a second attribute file and a plurality of data processing units, wherein the first attribute file and the second attribute file are configured to calculate first similarity values of the first attribute file and a second attribute file of preset reference data, the first attribute file and the second attribute file are mapped one by one, and therefore the first similarity values are obtained;
a second calculation module: configured to perform generating an overall similarity value characterizing a degree of coincidence of the target data with the reference data from a plurality of the first similarity values;
an execution module: is configured to perform a mapping process on the target data according to the total similarity value.
Optionally, the first attribute file includes first identification information and first text information, the first identification information and the first text information are mapped one to one, the second attribute file includes second identification information and second text information, the second identification information and the second text information are mapped one to one, where the first identification information is a type parameter representing the target data, and the second identification information is a type parameter representing the reference data; the first computing module includes:
an extraction module: configured to perform extracting first text information and second text information to which first identification information and second identification information having the same type parameter are respectively mapped;
a first comparison module: configured to perform a comparison of the extracted first and second text information to obtain the first similarity value.
Optionally, the first comparison module includes:
a rule matching module configured to execute a calling rule database, and search the rule database for a comparison rule determined to match the first identification information; (ii) a
A first calculation submodule: is configured to perform a calculation of a first similarity value of the first text information and the second text information according to the comparison rule.
Optionally, the second computing module includes:
a weight acquisition module: configured to perform obtaining a weight value to which the first identification information is mapped;
a product module: the similarity calculation module is configured to multiply the weight value mapped by the first identification information and the corresponding first similarity value to obtain a second similarity value;
a second calculation submodule: the similarity calculation method is configured to perform addition of second similarity values corresponding to all first identification information contained in the target data to obtain the total similarity value.
Optionally, the comparison rule includes an equity algorithm, where the equity algorithm is to determine whether the first text information and the second text information are identical.
Optionally, the comparison rule includes a similarity algorithm, where the similarity algorithm is to determine a probability of similarity between the first text information and the second text information.
Optionally, when there are a plurality of target data, the executing module includes:
a sorting module: configured to perform sorting of the magnitude of the total similarity value of the target data;
a list generation module: the target tool generating device is configured to extract the target tools with the total similarity value larger than or equal to a preset threshold value according to the sorting result to generate a similar data list;
a mapping module: is configured to perform a selection of one or more of the target data from the similar data list to associate a mapping with the reference data.
Since the data mapping processing apparatus is a device in which the main data mapping processing methods are in one-to-one correspondence, the implementation principle is the same as that of the main data mapping processing method, and details are not repeated here.
FIG. 12 is a block diagram of a basic structure of a computer device according to an embodiment of the present invention.
The computer device includes a processor, a non-volatile storage medium, a memory, and a network interface connected by a system bus. The non-volatile storage medium of the computer device stores an operating system, a database and computer readable instructions, the database can store control information sequences, and the computer readable instructions can enable the processor to realize a main data mapping processing method when being executed by the processor. The processor of the computer device is used for providing calculation and control capability and supporting the operation of the whole computer device. The memory of the computer device may have stored therein computer readable instructions that, when executed by the processor, may cause the processor to perform a master data mapping processing method. The network interface of the computer device is used for connecting and communicating with the terminal. Those skilled in the art will appreciate that the architecture shown in fig. 12 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
The computer equipment receives the state information of the prompt behavior sent by the associated client, namely whether the associated terminal starts the prompt or not and whether the borrower closes the prompt task or not. And the relevant terminal can execute corresponding operation according to the preset instruction by verifying whether the task condition is achieved or not, so that the relevant terminal can be effectively supervised. Meanwhile, when the prompt information state is different from the preset state instruction, the server side controls the associated terminal to ring continuously so as to prevent the problem that the prompt task of the associated terminal is automatically terminated after being executed for a period of time.
The present invention also provides a storage medium storing computer-readable instructions, which when executed by one or more processors, cause the one or more processors to perform the main data mapping processing method according to any one of the above embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the computer program is executed. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.
Claims (10)
1. A master data mapping processing method is characterized by comprising the following steps:
acquiring target data to be mapped, wherein the target data comprises a first attribute file;
calculating first similarity values of the first attribute files and second attribute files of preset reference data, wherein the number of the first attribute files is multiple, the number of the second attribute files is multiple, and the first attribute files and the second attribute files are mapped one by one so as to obtain the first similarity values;
generating a total similarity value representing the coincidence degree of the target data and the reference data according to the plurality of first similarity values;
and mapping the target data according to the total similarity value.
2. The data mapping method according to claim 1, wherein the first attribute file includes first identification information and first text information, the first identification information is mapped with the first text information one by one, the second attribute file includes second identification information and second text information, the second identification information is mapped with the second text information one by one, wherein the first identification information is a type parameter representing the target data, and the second identification information is a type parameter representing the reference data; the method for calculating the first similarity value between the first attribute file and the second attribute file of the preset reference data comprises the following steps:
extracting first text information and second text information which are respectively mapped by first identification information and second identification information with the same type parameters;
and comparing the extracted first text information with the extracted second text information to obtain the first similarity value.
3. The method of claim 2, wherein the comparing the extracted first text message and the extracted second text message to obtain the first similarity value comprises:
calling a rule database, and searching and determining a comparison rule matched with the first identification information in the rule database;
and calculating a first similarity value of the first text information and the second text information according to the comparison rule.
4. The master data mapping processing method according to claim 3, wherein the method of generating a total similarity value representing a degree of coincidence between the target data and the reference data according to the plurality of first similarity values includes:
acquiring a weight value mapped by the first identification information;
multiplying the weight value mapped by the first identification information with the corresponding first similarity value to obtain a second similarity value;
and adding the second similarity values corresponding to all the first identification information contained in the target data to obtain the total similarity value.
5. The master data mapping processing method according to claim 3, wherein the comparison rule includes an equity algorithm, wherein the equity algorithm is to determine whether the first text information and the second text information are identical.
6. The method according to any one of claims 3 to 5, wherein the comparison rule includes a similarity algorithm, wherein the similarity algorithm is a method of determining a probability of similarity between the first text information and the second text information.
7. The method according to claim 1, wherein when there are a plurality of target data, the method for mapping the target data according to the total similarity value includes:
sorting the sizes of the total similarity values of the plurality of target data;
according to the sorting result, extracting the target data of which the total similarity value is greater than or equal to a preset threshold value to generate a similar data list;
and selecting one or more target data from the similar data list and mapping the target data and the reference data in association.
8. A data mapping processing apparatus, comprising:
an acquisition module: configured to perform obtaining target data to be mapped, wherein the target data comprises a plurality of first attribute files;
a first calculation module: configured to perform a calculation of a first similarity value of each first property file with a second property file of preset reference data;
a second calculation module: configured to perform generating an overall similarity value characterizing a degree of coincidence of the target data with the reference data in accordance with the first similarity value;
an execution module: is configured to perform a mapping process on the target data according to the total similarity value.
9. A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, cause the processor to perform the steps of the master data mapping processing method according to any of claims 1 to 7.
10. A storage medium having stored thereon computer-readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the master data mapping processing method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911052600.1A CN110795524B (en) | 2019-10-31 | 2019-10-31 | Main data mapping processing method and device, computer equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911052600.1A CN110795524B (en) | 2019-10-31 | 2019-10-31 | Main data mapping processing method and device, computer equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110795524A true CN110795524A (en) | 2020-02-14 |
CN110795524B CN110795524B (en) | 2022-07-05 |
Family
ID=69442356
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911052600.1A Active CN110795524B (en) | 2019-10-31 | 2019-10-31 | Main data mapping processing method and device, computer equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110795524B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112001451A (en) * | 2020-08-27 | 2020-11-27 | 上海擎感智能科技有限公司 | Data redundancy processing method, system, medium and device |
CN112183574A (en) * | 2020-08-21 | 2021-01-05 | 深圳市银之杰科技股份有限公司 | File authentication and comparison method and device, terminal and storage medium |
CN112506917A (en) * | 2020-10-30 | 2021-03-16 | 福建亿能达信息技术股份有限公司 | Dictionary mapping method, device, system, equipment and medium for main data |
CN113065088A (en) * | 2021-03-29 | 2021-07-02 | 重庆富民银行股份有限公司 | Data preprocessing method based on feature scaling |
CN113807940A (en) * | 2020-06-17 | 2021-12-17 | 马上消费金融股份有限公司 | Information processing and fraud identification method, device, equipment and storage medium |
CN115470198A (en) * | 2022-08-11 | 2022-12-13 | 北京百度网讯科技有限公司 | Database information processing method and device, electronic equipment and storage medium |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1780955A1 (en) * | 2005-10-28 | 2007-05-02 | Siemens Aktiengesellschaft | Monitoring method and apparatus of processing of a data stream with high rate/flow |
CN102521386A (en) * | 2011-12-22 | 2012-06-27 | 清华大学 | Method for grouping space metadata based on cluster storage |
CN105787933A (en) * | 2016-02-19 | 2016-07-20 | 武汉理工大学 | Water front three-dimensional reconstruction apparatus and method based on multi-view point cloud registration |
US20160283906A1 (en) * | 2006-01-13 | 2016-09-29 | CareerBuilder, LLC | Method and system for matching data sets of non-standard formats |
CN107203529A (en) * | 2016-03-16 | 2017-09-26 | 中国移动通信集团河北有限公司 | Multi-service correlation analysis method and device based on metadata graph structural similarity |
CN107766881A (en) * | 2017-09-30 | 2018-03-06 | 中国地质大学(武汉) | A kind of method for searching based on fundamental classifier, equipment and storage device |
CN108737399A (en) * | 2018-05-09 | 2018-11-02 | 桂林电子科技大学 | A kind of Snort alert data polymerizations based on footmark random read take |
CN110147487A (en) * | 2017-10-17 | 2019-08-20 | 阿里巴巴集团控股有限公司 | A kind of method and system, the processing equipment of determining object temperature |
CN110362601A (en) * | 2019-06-19 | 2019-10-22 | 平安国际智慧城市科技股份有限公司 | Mapping method, device, equipment and the storage medium of metadata standard |
CN110377558A (en) * | 2019-06-14 | 2019-10-25 | 平安科技(深圳)有限公司 | Document searching method, device, computer equipment and storage medium |
-
2019
- 2019-10-31 CN CN201911052600.1A patent/CN110795524B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1780955A1 (en) * | 2005-10-28 | 2007-05-02 | Siemens Aktiengesellschaft | Monitoring method and apparatus of processing of a data stream with high rate/flow |
US20160283906A1 (en) * | 2006-01-13 | 2016-09-29 | CareerBuilder, LLC | Method and system for matching data sets of non-standard formats |
CN102521386A (en) * | 2011-12-22 | 2012-06-27 | 清华大学 | Method for grouping space metadata based on cluster storage |
CN105787933A (en) * | 2016-02-19 | 2016-07-20 | 武汉理工大学 | Water front three-dimensional reconstruction apparatus and method based on multi-view point cloud registration |
CN107203529A (en) * | 2016-03-16 | 2017-09-26 | 中国移动通信集团河北有限公司 | Multi-service correlation analysis method and device based on metadata graph structural similarity |
CN107766881A (en) * | 2017-09-30 | 2018-03-06 | 中国地质大学(武汉) | A kind of method for searching based on fundamental classifier, equipment and storage device |
CN110147487A (en) * | 2017-10-17 | 2019-08-20 | 阿里巴巴集团控股有限公司 | A kind of method and system, the processing equipment of determining object temperature |
CN108737399A (en) * | 2018-05-09 | 2018-11-02 | 桂林电子科技大学 | A kind of Snort alert data polymerizations based on footmark random read take |
CN110377558A (en) * | 2019-06-14 | 2019-10-25 | 平安科技(深圳)有限公司 | Document searching method, device, computer equipment and storage medium |
CN110362601A (en) * | 2019-06-19 | 2019-10-22 | 平安国际智慧城市科技股份有限公司 | Mapping method, device, equipment and the storage medium of metadata standard |
Non-Patent Citations (1)
Title |
---|
杨辉等: "基于输入样本和主数据的编辑规则挖掘算法", 《计算机系统应用》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113807940A (en) * | 2020-06-17 | 2021-12-17 | 马上消费金融股份有限公司 | Information processing and fraud identification method, device, equipment and storage medium |
CN113807940B (en) * | 2020-06-17 | 2024-04-12 | 马上消费金融股份有限公司 | Information processing and fraud recognition method, device, equipment and storage medium |
CN112183574A (en) * | 2020-08-21 | 2021-01-05 | 深圳市银之杰科技股份有限公司 | File authentication and comparison method and device, terminal and storage medium |
CN112183574B (en) * | 2020-08-21 | 2024-05-28 | 深圳市银之杰科技股份有限公司 | File authentication and fake comparison method and device, terminal and storage medium |
CN112001451A (en) * | 2020-08-27 | 2020-11-27 | 上海擎感智能科技有限公司 | Data redundancy processing method, system, medium and device |
CN112506917A (en) * | 2020-10-30 | 2021-03-16 | 福建亿能达信息技术股份有限公司 | Dictionary mapping method, device, system, equipment and medium for main data |
CN112506917B (en) * | 2020-10-30 | 2022-05-10 | 福建亿能达信息技术股份有限公司 | Dictionary mapping method, device, system, equipment and medium for main data |
CN113065088A (en) * | 2021-03-29 | 2021-07-02 | 重庆富民银行股份有限公司 | Data preprocessing method based on feature scaling |
CN115470198A (en) * | 2022-08-11 | 2022-12-13 | 北京百度网讯科技有限公司 | Database information processing method and device, electronic equipment and storage medium |
CN115470198B (en) * | 2022-08-11 | 2023-09-22 | 北京百度网讯科技有限公司 | Information processing method and device of database, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN110795524B (en) | 2022-07-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110795524B (en) | Main data mapping processing method and device, computer equipment and storage medium | |
US10747762B2 (en) | Automatic generation of sub-queries | |
CN111767057B (en) | Data processing method and device | |
US20080162455A1 (en) | Determination of document similarity | |
CN103425687A (en) | Retrieval method and system based on queries | |
CN104679646B (en) | A kind of method and apparatus for detecting SQL code defect | |
CN103262076A (en) | Analytical data processing | |
CN111553151A (en) | Question recommendation method and device based on field similarity calculation and server | |
US20230205755A1 (en) | Methods and systems for improved search for data loss prevention | |
CN111125116B (en) | Method and system for positioning code field in service table and corresponding code table | |
CN109408643B (en) | Fund similarity calculation method, system, computer equipment and storage medium | |
Dakrory et al. | Automated ETL testing on the data quality of a data warehouse | |
CN113722352B (en) | Intelligent data verification method, system and storage medium for price reporting scheme | |
CN113434542B (en) | Data relationship identification method and device, electronic equipment and storage medium | |
CN115422371A (en) | Software test knowledge graph-based retrieval method | |
CN111191430B (en) | Automatic table building method and device, computer equipment and storage medium | |
CN113360517A (en) | Data processing method and device, electronic equipment and storage medium | |
CN117971873A (en) | Method and device for generating Structured Query Language (SQL) and electronic equipment | |
CN117077668A (en) | Risk image display method, apparatus, computer device, and readable storage medium | |
TWI785724B (en) | Method for creating data warehouse, electronic device, and storage medium | |
CN111143356A (en) | Report retrieval method and device | |
CN110941952A (en) | Method and device for perfecting audit analysis model | |
CN115062023A (en) | Wide table optimization method and device, electronic equipment and computer readable storage medium | |
US9208224B2 (en) | Business content hierarchy | |
CN114860759A (en) | Data processing method, device and equipment and readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: 801-2, floor 8, building 3, No. 22, Ronghua Middle Road, Beijing Economic and Technological Development Zone, Daxing District, Beijing Applicant after: Wanghai Kangxin (Beijing) Technology Co., Ltd Address before: Room 07, Room 2, Building B, 12 Hongda North Road, Beijing Daxing District, Beijing Applicant before: Beijing Neusoft Wang Hai Technology Co., Ltd. |
|
GR01 | Patent grant | ||
GR01 | Patent grant |