CN113836131B

CN113836131B - Big data cleaning method and device, computer equipment and storage medium

Info

Publication number: CN113836131B
Application number: CN202111151699.8A
Authority: CN
Inventors: 吴智炜
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2024-02-02
Anticipated expiration: 2041-09-29
Also published as: CN113836131A

Abstract

The application discloses a big data cleaning method, a big data cleaning device, computer equipment and a storage medium, and belongs to the technical field of big data. The method comprises the steps of configuring a mapping relation between service data types and cleaning rules, generating a cleaning rule matching table, and constructing a distributed cluster, wherein the distributed cluster comprises a plurality of sub-servers, each sub-server respectively and correspondingly processes one service type data, distributing the cleaning rules to the corresponding sub-servers according to the service data types, determining a target service data type corresponding to original data, searching a target sub-server corresponding to the target service data type, and carrying out data cleaning on the original data at the target sub-server to obtain cleaning data. In addition, the present application relates to blockchain technology in which raw data may be stored. The method and the device realize automatic cleaning of different types of service data by constructing the distributed cluster, have stronger universality and adaptability, and are beneficial to unified management of data cleaning rules.

Description

Big data cleaning method and device, computer equipment and storage medium

Technical Field

The application belongs to the technical field of big data, and particularly relates to a big data cleaning method, a big data cleaning device, computer equipment and a storage medium.

Background

Data cleansing refers to the last procedure to find and correct identifiable errors in a data file, including checking for data consistency, processing invalid and missing values, etc. Unlike questionnaire reviews, the cleaning of entered data is typically done by a computer rather than manually.

At present, aiming at a datamation cleaning scene, the data quantity, the data effective period or the data management rule of data generated by different business departments or different business scenes can be completely different, the existing data cleaning scheme generally needs to develop corresponding data cleaning rules independently aiming at different business requirements, but the cleaning scheme consumes larger manpower and physics in the development stage of the data cleaning rules, and can not realize multiplexing on some shared data cleaning rules, so that development resource waste is caused, and meanwhile, the management of the data cleaning rules is also not facilitated.

Disclosure of Invention

The embodiment of the application aims to provide a big data cleaning method, a big data cleaning device, computer equipment and a storage medium, so as to solve the technical problems that the data cleaning rule development resource waste is caused because some common data cleaning rules in the existing big data cleaning scheme cannot be reused, and the data cleaning rule is not easy to manage.

In order to solve the above technical problems, the embodiments of the present application provide a big data cleaning method, which adopts the following technical scheme:

a big data cleaning method comprising:

creating a preset number of cleaning rules, configuring a mapping relation between service data types and the cleaning rules, and generating a cleaning rule matching table;

constructing a distributed cluster, wherein the distributed cluster comprises a plurality of sub-servers, and each sub-server respectively and correspondingly processes one service type data;

uploading the cleaning rule matching table to the distributed cluster, and distributing the cleaning rule to a corresponding sub-server according to the service data type;

receiving original data and determining a target service data type corresponding to the original data;

searching a target sub-server corresponding to the target service data type in the distributed cluster, and performing data cleaning on the original data by utilizing the target sub-server to obtain cleaning data.

Further, the step of receiving the original data and determining the target service data type corresponding to the original data specifically includes:

receiving original data, importing the original data into the distributed cluster, and determining a field to be cleaned in the original data;

And extracting keywords of the field to be cleaned, and determining a target data type corresponding to the original data based on the keywords.

Further, the step of receiving the original data, importing the original data into the distributed cluster, and determining a field to be cleaned in the original data specifically includes:

acquiring a demand document corresponding to the original data, wherein specific requirements for data cleaning are recorded in the demand document;

identifying the data structure of the original data to obtain the structure information of the original data;

dividing the original data based on the structure information to obtain a plurality of data fields;

and carrying out semantic recognition on each data field, and obtaining a field to be cleaned in the original data based on the semantic recognition and the requirement document.

Further, the step of extracting the keywords of the field to be cleaned and determining the target data type corresponding to the original data based on the keywords specifically includes:

carrying out keyword recognition on the fields to be cleaned to obtain keywords of all the fields to be cleaned;

integrating the extracted keywords to generate a keyword combination of the original data;

And determining the type of the keyword combination representation to the target service data type corresponding to the original data.

Further, the step of integrating the extracted keywords to generate a keyword combination of the original data specifically includes:

calculating the weight of each keyword based on a preset TF-IDF algorithm;

sequencing the weights of all the keywords to obtain a keyword weight sequence;

and combining the keywords based on the keyword weight sequence to generate the keyword combination of the original data.

Further, the step of searching the target sub-server corresponding to the target service data type in the distributed cluster, and performing data cleaning on the original data by using the target sub-server to obtain cleaning data specifically includes:

acquiring a service data label corresponding to each sub-server;

in the distributed cluster, comparing the target service data type with the service data labels corresponding to each sub-server one by one;

and determining a target sub-server corresponding to the original data according to the comparison result, and performing data cleaning on the original data through the target sub-server to obtain cleaning data.

Further, the step of performing data cleaning on the original data by the target sub-server to obtain cleaning data specifically includes:

formatting the original data to obtain formatted data;

detecting repeated data in the formatted data, and cleaning the repeated data to obtain duplicate removal data;

and detecting error data in the duplicate removal data, and cleaning the error data to obtain cleaning data.

In order to solve the above technical problems, the embodiment of the present application further provides a big data cleaning device, which adopts the following technical scheme:

a big data cleaning device comprising:

the rule configuration module is used for creating a preset number of cleaning rules, configuring the mapping relation between the service data types and the cleaning rules and generating a cleaning rule matching table;

the cluster construction module is used for constructing a distributed cluster, wherein the distributed cluster comprises a plurality of sub-servers, and each sub-server respectively and correspondingly processes one service type data;

the rule distribution module is used for uploading the cleaning rule matching table to the distributed cluster and distributing the cleaning rule to the corresponding sub-server according to the service data type;

The data preprocessing module is used for receiving the original data and determining a target service data type corresponding to the original data;

and the data cleaning module is used for searching a target sub-server corresponding to the target service data type in the distributed cluster, and utilizing the target sub-server to perform data cleaning on the original data to obtain cleaning data.

In order to solve the above technical problems, the embodiments of the present application further provide a computer device, which adopts the following technical schemes:

a computer device comprising a memory having stored therein computer readable instructions which when executed by a processor implement the steps of the big data cleaning method as described above.

In order to solve the above technical problems, embodiments of the present application further provide a computer readable storage medium, which adopts the following technical solutions:

a computer readable storage medium having stored thereon computer readable instructions which when executed by a processor perform the steps of the big data cleaning method as described above.

Compared with the prior art, the embodiment of the application has the following main beneficial effects:

The application discloses a big data cleaning method, a big data cleaning device, computer equipment and a storage medium, and belongs to the technical field of big data. The method comprises the steps of constructing a distributed cluster, and configuring corresponding data cleaning rules for all sub-servers in the distributed cluster according to service data types, wherein each sub-server is configured with the data cleaning rules corresponding to the service data types. When service data cleaning is needed, the server distributes the service data to be cleaned to corresponding sub-servers in the distributed cluster for processing by identifying the data type of the service data to be cleaned and according to the data type of the service data to be cleaned. The method and the device have the advantages that the distributed clusters are built to realize automatic cleaning of different types of service data, the universality and the adaptability are high, the consumption of the development stage of the data cleaning rules can be effectively reduced, the multiplexing rate of the public data cleaning rules is improved, and the unified management of the data cleaning rules is facilitated.

Drawings

For a clearer description of the solution in the present application, a brief description will be given below of the drawings that are needed in the description of the embodiments of the present application, it being obvious that the drawings in the following description are some embodiments of the present application, and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.

FIG. 1 illustrates an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 illustrates a flow chart of one embodiment of a big data cleansing method according to the present application;

FIG. 3 shows a schematic structural view of one embodiment of a big data cleaning device according to the present application;

fig. 4 shows a schematic structural diagram of one embodiment of a computer device according to the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the applications herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "comprising" and "having" and any variations thereof in the description and claims of the present application and in the description of the figures above are intended to cover non-exclusive inclusions. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

In order to better understand the technical solutions of the present application, the following description will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the accompanying drawings.

As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as a web browser application, a shopping class application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc., may be installed on the terminal devices 101, 102, 103.

The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablet computers, electronic book readers, MP3 players (Moving Picture ExpertsGroup Audio Layer III, dynamic video expert compression standard audio plane 3), MP4 (Moving PictureExperts Group Audio Layer IV, dynamic video expert compression standard audio plane 4) players, laptop and desktop computers, and the like.

The server 105 may be a server that provides various services, such as a background server that provides support for pages displayed on the terminal devices 101, 102, 103, and may be a stand-alone server, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms.

It should be noted that, the big data cleaning method provided in the embodiments of the present application is generally executed by a server, and accordingly, the big data cleaning device is generally disposed in the server.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow chart of one embodiment of a big data cleansing method according to the present application is shown. The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions. The big data cleaning method comprises the following steps:

s201, creating a preset number of cleaning rules, configuring a mapping relation between service data types and the cleaning rules, and generating a cleaning rule matching table.

Specifically, before the distributed cluster is built, a preset number of cleaning rules are created through a server, wherein the cleaning rules comprise formatting rules, deduplication rules, correction rules and the like. And defining a mapping relation between the service data types and the cleaning rules according to service scene requirements, for example, cleaning the text data at least needs to perform word segmentation, formatting, duplication removal and other operations, so that corresponding word segmentation rules, formatting rules and duplication removal rules at least need to be configured for the text data, and finally integrating the mapping relation between all the service data types and the cleaning rules to generate a cleaning rule matching table.

In this embodiment, the cleaning rule matching table is generated by creating a cleaning rule and configuring a mapping relationship between service data types and the cleaning rule, so as to facilitate unified management of data cleaning rules.

S202, a distributed cluster is constructed, wherein the distributed cluster comprises a plurality of sub-servers, and each sub-server respectively and correspondingly processes one service type data.

The distributed cluster in the application can be built based on a Spark cluster architecture, wherein the Spark cluster architecture comprises an open source distributed storage system Tachyon, an open source distributed resource management framework Mesos, a resource manager YARN, a massive parallel query engine BlinkDB and the like, and the Tachyon is a memory-based distributed file system, so that data can be conveniently shared by all tasks, and the load of JVM in the calculation process can be reduced; the Mesos is a cluster manager, and uses a program coordination service Zookeeper to realize cluster fault tolerance; blinkDB is a massively parallel query engine that allows for improved query response times by leveraging data accuracy.

Specifically, the server builds a distributed cluster for realizing data cleaning based on a Spark cluster architecture, wherein the distributed cluster comprises a plurality of sub-servers, each sub-server respectively and correspondingly processes one service type data.

S203, uploading the cleaning rule matching table to the distributed cluster, and distributing the cleaning rule to the corresponding sub-server according to the service data type.

Specifically, after the server completes the basic construction of the distributed cluster, the server uploads the cleaning rule matching table to the distributed cluster, and distributes the corresponding cleaning rule on the cleaning rule matching table to the corresponding sub-server according to the service data type. For example, in a specific embodiment of the present application, the cleansing rule corresponding to the text data is assigned to the sub-server a, and the cleansing rule corresponding to the numerical data is assigned to the sub-server B. In a more specific embodiment of the present application, the service data is policy service data, a cleaning rule corresponding to text data in the policy service data is allocated to the sub-server A1, and a cleaning rule corresponding to numerical data in the policy service data is allocated to the sub-server B1.

In the embodiment, the distributed cluster is constructed, and corresponding data cleaning rules are configured for each sub-server in the distributed cluster according to the service data types, so that different types of service data cleaning is realized, and the method has strong universality and adaptability.

S204, receiving the original data and determining the target service data type corresponding to the original data.

Specifically, when a data cleaning requirement exists, the server receives a data cleaning instruction and receives the original data and the requirement document uploaded by the client, wherein the requirement document records the specific requirement of data cleaning. The server imports the original data uploaded by the client into the distributed cluster, and pre-determines the target data type corresponding to the original data, searches the sub-server corresponding to the service data type in the distributed cluster, and imports the original data into the sub-server for data cleaning.

In this embodiment, the electronic device (such as the server shown in fig. 1) on which the big data cleaning method operates may receive the data cleaning instruction through a wired connection manner or a wireless connection manner. It should be noted that the wireless connection may include, but is not limited to, 3G/4G connections, wiFi connections, bluetooth connections, wiMAX connections, zigbee connections, UWB (ultra wideband) connections, and other now known or later developed wireless connection means.

S205, searching a target sub-server corresponding to the target service data type in the distributed cluster, and performing data cleaning on the original data by utilizing the target sub-server to obtain cleaning data.

Specifically, after the server assigns the cleaning rule to the corresponding sub-server according to the service data type, a corresponding service data tag is generated for each sub-server, for example, in the above embodiment, the service data tag corresponding to the sub-server A1 is "policy service data-text data", and the service data tag corresponding to the sub-server B1 is "policy service data-numerical data". When the service data is cleaned, after determining a target service data type corresponding to original data, the server compares the target service data type with service data labels corresponding to each sub-server one by one, when the target service data type is matched with the service data label corresponding to one of the sub-servers, the sub-server is used as a target sub-server, and the original data is input into the target sub-server for data cleaning, so that cleaning data is obtained.

In the above embodiment, the present application configures a corresponding data cleaning rule for each sub-server in the distributed cluster by constructing the distributed cluster and according to the service data type, where each sub-server configures a data cleaning rule corresponding to the service data type. When service data cleaning is needed, the server distributes the service data to be cleaned to corresponding sub-servers in the distributed cluster for processing by identifying the data type of the service data to be cleaned and according to the data type of the service data to be cleaned. The method and the device have the advantages that the distributed clusters are built to realize automatic cleaning of different types of service data, the universality and the adaptability are high, the consumption of the development stage of the data cleaning rules can be effectively reduced, the multiplexing rate of the public data cleaning rules is improved, and the unified management of the data cleaning rules is facilitated.

Specifically, after receiving the original data, the server imports the original data into a distributed cluster, performs field segmentation on the original data in the distributed cluster, determines a field to be cleaned in the original data according to a required document in the segmented field, extracts keywords of the field to be cleaned, and performs semantic analysis on the extracted keywords to determine a target data type corresponding to the original data.

Specifically, the server identifies a data structure of the original data to obtain structural information of the original data, performs field segmentation on the original data based on the structural information to obtain a plurality of data fields, for example, a specific data structure of the original data is a multi-segment structure, and performs field segmentation on the original data according to multi-segment structure distribution to obtain a plurality of data fields, wherein the data fields obtained after the field segmentation on the original data comprise fields to be cleaned and data fields which do not need to be cleaned. And finally, carrying out semantic recognition on each data field, and determining the field to be cleaned in the original data based on the semantic recognition result and the requirement document.

In the above embodiment, the present application divides the original data by acquiring the structure information of the original data and according to the structure information of the original data, and dividing the original data into a plurality of standard data fields so as to facilitate the subsequent semantic recognition to acquire the fields to be cleaned in the original data.

Specifically, the server performs keyword recognition on the fields to be cleaned to obtain keywords of all the fields to be cleaned, wherein the keyword recognition can be realized by adopting OCR field scanning. After the keyword extraction is completed, the server calculates the weight of each keyword, integrates the keywords based on the calculated weight, generates a keyword combination of the original data, and determines the target service data type corresponding to the original data based on the keyword combination.

In the above embodiment, the present application obtains the keywords of all the fields to be cleaned by scanning with the OCR fields, and determines the target service data type corresponding to the original data by calculating the weights of the keywords and according to the weights of the keywords in the original data.

calculating the weight of each keyword based on a preset TF-IDF algorithm;

sequencing the weights of all the keywords to obtain a keyword weight sequence;

Specifically, the server calculates the weight of each keyword based on a preset TF-IDF algorithm, performs descending order arrangement on the weights of all keywords to obtain a keyword weight sequence, selects keywords with top ranking from the keyword weight sequence according to the requirement of the requirement document, and combines the keywords with top ranking to obtain the keyword combination of the original data.

Among them, TF-IDF (term frequency-inverse document frequency) is a common weighting technique for information retrieval and information exploration. TF-IDF is a statistical method used to evaluate the importance of a word to one of a set of documents or a corpus. The importance of a word increases proportionally with the number of times it appears in the file, but at the same time decreases inversely with the frequency with which it appears in the corpus. Various forms of TF-IDF weighting are often applied by search engines as a measure or rating of the degree of correlation between documents and user queries. In addition to TF-IDF, search engines on the internet use a link analysis based rating method to determine the order in which documents appear in the search results.

The method for calculating the weight of each keyword based on the preset TF-IDF algorithm specifically comprises the following steps:

calculating word frequency of the keywords and calculating inverse document frequency of the keywords;

based on the word frequency of the keywords and the inverse document frequency of the keywords, a first word segmentation weight of the keywords is calculated.

The method for calculating the word frequency of the keywords and the inverse document frequency of the keywords specifically comprises the following steps:

determining a field to be cleaned where the keyword is located, and obtaining a target field;

counting the occurrence times of the keywords in the target field to obtain a first word segmentation number, and counting the sum of the occurrence times of all the keywords in each field to be cleaned to obtain a second word segmentation number;

calculating word frequency of the keyword based on the first word segmentation number and the second word segmentation number;

counting the number of target fields to obtain the number of first documents, and counting the total number of fields to be cleaned to obtain the number of second documents;

the inverse document frequency of the keyword is calculated based on the first document number and the second document number.

Specifically, the word frequency TF is calculated as follows:

wherein tf is _i,j For keyword t _i Word frequency, n of _i,j For keyword t _i In a certain field d to be cleaned _j The number of occurrences of (sigma) _k n _k,j The sum of the occurrence times of k keywords in all the fields to be cleaned.

The calculation formula of the inverse text frequency IDF is as follows:

wherein idf _i,j For keyword t _i Is the inverse text frequency index of (D) is the total number of fields to be cleaned, |{ j: t _i ∈d _j Comprises keyword t _i To be cleaned.

In this embodiment, by calculating the weights of the keywords, arranging the weights of the keywords in a descending order, selecting the keywords ranked at the top in the ordering result, and combining the keywords, the important keywords in the original data are combined together by calculating and ordering the weights of the keywords, so as to more accurately determine the target service data type corresponding to the original data.

acquiring a service data label corresponding to each sub-server;

Specifically, after the server distributes the cleaning rule to the corresponding sub-server according to the service data type, a corresponding service data label is generated for each sub-server, when the service data is cleaned, the server determines the target service data type corresponding to the original data, compares the target service data type with the service data label corresponding to each sub-server one by one, and when the target service data type is matched with the service data label corresponding to one of the sub-servers, the sub-server is used as the target sub-server, and the original data is input to the target sub-server for data cleaning, so that the cleaning data is obtained.

In this embodiment, when service data cleaning is performed, the server compares the target service data type with the service data tag corresponding to each sub-server one by one, and inputs the original data to the corresponding sub-server according to the comparison result to perform data cleaning.

formatting the original data to obtain formatted data;

Specifically, the server formats the original data by calling a data formatting rule stored in the target sub-server, so that the original data are unified and regular into standard data meeting requirements, then, a data deduplication rule is called to perform deduplication on repeated data in the original data, redundant repeated data are removed, finally, error data existing in the deduplication data are retrieved, and a data correction rule is called to perform error content cleaning on the error data, so that final cleaning data are generated.

In this embodiment, the server selects a corresponding data cleaning rule according to the cleaning requirement on the requirement document, and cleans the original data through the data cleaning rule to obtain cleaning data.

The application discloses a big data cleaning method, and belongs to the technical field of big data. The method comprises the steps of constructing a distributed cluster, and configuring corresponding data cleaning rules for all sub-servers in the distributed cluster according to service data types, wherein each sub-server is configured with the data cleaning rules corresponding to the service data types. When service data cleaning is needed, the server distributes the service data to be cleaned to corresponding sub-servers in the distributed cluster for processing by identifying the data type of the service data to be cleaned and according to the data type of the service data to be cleaned. The method and the device have the advantages that the distributed clusters are built to realize automatic cleaning of different types of service data, the universality and the adaptability are high, the consumption of the development stage of the data cleaning rules can be effectively reduced, the multiplexing rate of the public data cleaning rules is improved, and the unified management of the data cleaning rules is facilitated.

It is emphasized that to further ensure the privacy and security of the original data, the original data may also be stored in a blockchain node.

The blockchain referred to in the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.

Those skilled in the art will appreciate that implementing all or part of the processes of the methods of the embodiments described above may be accomplished by way of computer readable instructions, stored on a computer readable storage medium, which when executed may comprise processes of embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

With further reference to fig. 3, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a big data cleaning device, where the embodiment of the device corresponds to the embodiment of the method shown in fig. 2, and the device may be applied to various electronic devices specifically.

As shown in fig. 3, the big data cleaning device according to the present embodiment includes:

the rule configuration module 301 is configured to create a preset number of cleaning rules, configure a mapping relationship between service data types and the cleaning rules, and generate a cleaning rule matching table;

The cluster construction module 302 is configured to construct a distributed cluster, where the distributed cluster includes a plurality of sub-servers, and each of the sub-servers processes data of a service type correspondingly;

a rule distribution module 303, configured to upload the cleaning rule matching table to the distributed cluster, and distribute the cleaning rule to a corresponding sub-server according to the service data type;

the data preprocessing module 304 is configured to receive original data and determine a target service data type corresponding to the original data;

and the data cleaning module 305 is configured to search a target sub-server corresponding to the target service data type in the distributed cluster, and perform data cleaning on the original data by using the target sub-server to obtain cleaning data.

Further, the data preprocessing module 304 specifically includes:

the field identification sub-module is used for receiving the original data, importing the original data into the distributed cluster and determining a field to be cleaned in the original data;

and the keyword recognition sub-module is used for extracting keywords of the field to be cleaned and determining a target data type corresponding to the original data based on the keywords.

Further, the field identification submodule specifically includes:

a required document obtaining unit, configured to obtain a required document corresponding to the original data, where a specific requirement for data cleaning is recorded in the required document;

the structure identification unit is used for identifying the data structure of the original data and obtaining the structure information of the original data;

the field segmentation unit is used for segmenting the original data based on the structure information to obtain a plurality of data fields;

the semantic recognition unit is used for carrying out semantic recognition on each data field and obtaining a field to be cleaned in the original data based on the semantic recognition and the required document.

Further, the keyword recognition submodule specifically includes:

the keyword recognition unit is used for recognizing keywords of the field to be cleaned and obtaining keywords of all the fields to be cleaned;

the keyword combination unit is used for integrating the extracted keywords and generating keyword combinations of the original data;

and the service type judging unit is used for determining the type of the keyword combination representation to the target service data type corresponding to the original data.

Further, the keyword combination unit specifically includes:

A weight calculating subunit, configured to calculate a weight of each keyword based on a preset TF-IDF algorithm;

the weight sorting subunit is used for sorting the weights of all the keywords to obtain a keyword weight sequence;

and the keyword combination subunit is used for combining the keywords based on the keyword weight sequence and generating the keyword combination of the original data.

Further, the data cleansing module 305 specifically includes:

the service tag acquisition sub-module is used for acquiring the service data tag corresponding to each sub-server;

the tag comparison sub-module is used for comparing the target service data type with the service data tags corresponding to each sub-server one by one in the distributed cluster;

and the data cleaning sub-module is used for determining a target sub-server corresponding to the original data according to the comparison result, and carrying out data cleaning on the original data through the target sub-server to obtain cleaning data.

Further, the data cleaning submodule specifically includes:

the formatting unit is used for formatting the original data to obtain formatted data;

the first cleaning unit is used for detecting repeated data in the formatted data and cleaning the repeated data to obtain duplicate removal data;

And the second cleaning unit is used for detecting error data in the duplicate removal data and cleaning the error data to obtain cleaning data.

The application discloses big data belt cleaning device belongs to big data technical field. The method comprises the steps of constructing a distributed cluster, and configuring corresponding data cleaning rules for all sub-servers in the distributed cluster according to service data types, wherein each sub-server is configured with the data cleaning rules corresponding to the service data types. When service data cleaning is needed, the server distributes the service data to be cleaned to corresponding sub-servers in the distributed cluster for processing by identifying the data type of the service data to be cleaned and according to the data type of the service data to be cleaned. The method and the device have the advantages that the distributed clusters are built to realize automatic cleaning of different types of service data, the universality and the adaptability are high, the consumption of the development stage of the data cleaning rules can be effectively reduced, the multiplexing rate of the public data cleaning rules is improved, and the unified management of the data cleaning rules is facilitated.

In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 4, fig. 4 is a basic structural block diagram of a computer device according to the present embodiment.

The computer device 4 comprises a memory 41, a processor 42, a network interface 43 communicatively connected to each other via a system bus. It should be noted that only computer device 4 having components 41-43 is shown in the figures, but it should be understood that not all of the illustrated components are required to be implemented and that more or fewer components may be implemented instead. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculations and/or information processing in accordance with predetermined or stored instructions, the hardware of which includes, but is not limited to, microprocessors, application specific integrated circuits (Application Specific Integrated Circuit, ASICs), programmable gate arrays (fields-Programmable Gate Array, FPGAs), digital processors (Digital Signal Processor, DSPs), embedded devices, etc.

The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.

The memory 41 includes at least one type of readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the storage 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the computer device 4. Of course, the memory 41 may also comprise both an internal memory unit of the computer device 4 and an external memory device. In this embodiment, the memory 41 is typically used to store an operating system and various application software installed on the computer device 4, such as computer readable instructions of a big data cleaning method. Further, the memory 41 may be used to temporarily store various types of data that have been output or are to be output.

The processor 42 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute computer readable instructions stored in the memory 41 or process data, such as computer readable instructions for executing the big data cleansing method.

The network interface 43 may comprise a wireless network interface or a wired network interface, which network interface 43 is typically used for establishing a communication connection between the computer device 4 and other electronic devices.

The application discloses computer equipment belongs to big data technical field. The method comprises the steps of constructing a distributed cluster, and configuring corresponding data cleaning rules for all sub-servers in the distributed cluster according to service data types, wherein each sub-server is configured with the data cleaning rules corresponding to the service data types. When service data cleaning is needed, the server distributes the service data to be cleaned to corresponding sub-servers in the distributed cluster for processing by identifying the data type of the service data to be cleaned and according to the data type of the service data to be cleaned. The method and the device have the advantages that the distributed clusters are built to realize automatic cleaning of different types of service data, the universality and the adaptability are high, the consumption of the development stage of the data cleaning rules can be effectively reduced, the multiplexing rate of the public data cleaning rules is improved, and the unified management of the data cleaning rules is facilitated.

The present application also provides another embodiment, namely, a computer-readable storage medium storing computer-readable instructions executable by at least one processor to cause the at least one processor to perform the steps of the big data cleansing method as described above.

The application discloses a storage medium, which belongs to the technical field of big data. The method comprises the steps of constructing a distributed cluster, and configuring corresponding data cleaning rules for all sub-servers in the distributed cluster according to service data types, wherein each sub-server is configured with the data cleaning rules corresponding to the service data types. When service data cleaning is needed, the server distributes the service data to be cleaned to corresponding sub-servers in the distributed cluster for processing by identifying the data type of the service data to be cleaned and according to the data type of the service data to be cleaned. The method and the device have the advantages that the distributed clusters are built to realize automatic cleaning of different types of service data, the universality and the adaptability are high, the consumption of the development stage of the data cleaning rules can be effectively reduced, the multiplexing rate of the public data cleaning rules is improved, and the unified management of the data cleaning rules is facilitated.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method described in the embodiments of the present application.

The subject application is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

It is apparent that the embodiments described above are only some embodiments of the present application, but not all embodiments, the preferred embodiments of the present application are given in the drawings, but not limiting the patent scope of the present application. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a more thorough understanding of the present disclosure. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing, or equivalents may be substituted for elements thereof. All equivalent structures made by the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the protection scope of the application.

Claims

1. A big data cleansing method, comprising:

searching a target sub-server corresponding to the target service data type in the distributed cluster, and performing data cleaning on the original data by utilizing the target sub-server to obtain cleaning data;

the step of searching a target sub-server corresponding to the target service data type in the distributed cluster, and performing data cleaning on the original data by using the target sub-server to obtain cleaning data specifically comprises the following steps:

acquiring a service data label corresponding to each sub-server;

2. The data cleansing method according to claim 1, wherein the step of receiving the original data and determining the target service data type corresponding to the original data specifically comprises:

3. The data cleansing method according to claim 2, wherein the steps of receiving the original data, importing the original data into the distributed cluster, and determining a field to be cleansed in the original data specifically include:

4. The method for cleaning data according to claim 2, wherein the step of extracting the keywords of the field to be cleaned and determining the target data type corresponding to the original data based on the keywords specifically comprises:

5. The method for cleaning data as recited in claim 4, wherein the step of integrating the extracted keywords to generate a keyword combination of the original data comprises:

calculating the weight of each keyword based on a preset TF-IDF algorithm;

sequencing the weights of all the keywords to obtain a keyword weight sequence;

6. The method for cleaning data according to claim 1, wherein the step of cleaning the raw data by the target sub-server to obtain cleaning data specifically comprises:

formatting the original data to obtain formatted data;

7. A big data cleaning device, comprising:

the data cleaning module is used for searching a target sub-server corresponding to the target service data type in the distributed cluster, and performing data cleaning on the original data by utilizing the target sub-server to obtain cleaning data;

The data cleaning module specifically comprises:

8. A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which when executed by the processor implement the steps of the big data cleaning method of any of claims 1 to 6.

9. A computer readable storage medium having stored thereon computer readable instructions which when executed by a processor implement the steps of the big data cleaning method according to any of claims 1 to 6.