CN109299183A

CN109299183A - A kind of data processing method, device, terminal device and storage medium

Info

Publication number: CN109299183A
Application number: CN201811386474.9A
Authority: CN
Inventors: 火莽; 火一莽; 许山川; 王生玉
Original assignee: Qinghai Public Security Bureau; Beijing Ruian Technology Co Ltd
Current assignee: Qinghai Public Security Bureau; Beijing Ruian Technology Co Ltd
Priority date: 2018-11-20
Filing date: 2018-11-20
Publication date: 2019-02-01

Abstract

The invention discloses a kind of data processing method, device, terminal device and storage mediums.This method comprises: obtaining two or more initial data, which includes field information and data content；Field identification is converted raw data into, and extracts major key from field information；Without intersection then raw data base is written in initial data by the field identification in the corresponding field identification of initial data and raw data base；The initial data that raw data base is written is standardized pretreatment, obtains cleaning data；The major key in the major key and cleaning database of data is cleaned without intersection, then will clean data write-in cleaning database.The present invention in data processing, without by the corresponding major key of data all in raw data base and cleaning database in major key be compared, to improve data-handling efficiency.

Description

A kind of data processing method, device, terminal device and storage medium

Technical field

The present embodiments relate to data processing technique more particularly to a kind of data processing method, device, terminal device and Storage medium.

Background technique

ETL is the abbreviation of English Extract-Transform-Load, and Chinese is data pick-up, conversion and load. ETL is the important ring for constructing data warehouse, and ETL is the data for will extract from heterogeneous data source, through over cleaning and is turned It changes, loads data into purpose data warehouse (cleaning database), the basis as on-line analytical processing, data mining.

In ETL treatment process, when there is high-volume data access in source data warehouse, in conventional ETL work, directly By the data access in source data warehouse, then by cleaning conversion links, data update is inserted into purpose data warehouse.But When the data update in face of big data quantity, and in the application scenarios for requiring timeliness, above-mentioned processing scheme easily becomes number According to the bottleneck of timeliness, so that the data in source data warehouse can not be updated in purpose data warehouse in time.

Summary of the invention

In view of this, the present invention provides a kind of data processing method, device, terminal device and storage medium, to improve number According to treatment effeciency.

In a first aspect, the embodiment of the invention provides a kind of data processing methods, comprising:

Two or more initial data are obtained, the initial data includes field information and data content；

The initial data is converted into field identification, and extracts major key from the field information；

Field identification in the corresponding field identification of the initial data and raw data base, then will be described original without intersection The raw data base is written in data；

The initial data that the raw data base is written is standardized pretreatment, obtains cleaning data；

Major key in the major key and cleaning database of the cleaning data is without intersection, then by the cleaning data The cleaning database is written.

Further, before the two or more initial data of acquisition, further includes:

Raw data file is obtained, and format judgement is carried out to the raw data file；

The raw data file is JSON file, is parsed to the raw data file, to obtain JSON data The initial data of format.

Further, it is described the initial data that the raw data base is written is standardized pretreatment before, also Include:

Obtain the write time of initial data write-in raw data base or the creation time of raw data file；

Using said write time or creation time as the batch identification of initial data.

It is further, described that the initial data that the raw data base is written is standardized pretreatment, comprising:

Inquire the corresponding batch identification of initial data in the raw data base；

Pretreatment is standardized to the corresponding initial data of newest batch identification.

Further, after the cleaning database by cleaning data write-in, further includes:

Obtain the last push time of cleaning data；

It will be greater than the last push time and be less than the cleaning data-pushing of present system time to associated In application platform.

Further, the initial data is converted into field identification, specifically:

The initial data is converted into corresponding cryptographic Hash.

Second aspect, the embodiment of the invention also provides a kind of data processing equipments, comprising:

First obtains module, and for obtaining two or more initial data, the initial data includes field letter Breath and data content；

Extraction module is converted, for the initial data to be converted to field identification, and is mentioned from the field information Take major key；

First writing module, for the field identification in the corresponding field identification of the initial data and raw data base without Then the raw data base is written in the initial data by intersection；

Preprocessing module obtains clear for the initial data that the raw data base is written to be standardized pretreatment Wash data；

Second writing module, for the major key in the major key and cleaning database of the cleaning data without friendship Then the cleaning database is written in the cleaning data by collection.

Further, the data processing equipment further include:

Format judgment module, for obtaining raw data file before obtaining two or more initial data, And format judgement is carried out to the raw data file；

Parsing module, being used for the raw data file is JSON file, is parsed to the raw data file, with Obtain the initial data of JSON data format.

Further, the data processing equipment, further includes:

Second obtains module, for the initial data that the raw data base is written to be standardized pretreatment Before, obtain the write time of initial data write-in raw data base or the creation time of raw data file；

Determining module, for using said write time or creation time as the batch identification of initial data.

Further, the preprocessing module, comprising:

Query unit, for inquiring the corresponding batch identification of initial data in the raw data base；

Pretreatment unit, for being standardized pretreatment to the corresponding initial data of newest batch identification.

Further, the data processing equipment, further includes:

Third obtains module, for after the cleaning database is written in the cleaning data, obtaining cleaning data The last push time；

Pushing module is pushed away for will be greater than the last push time and be less than the cleaning data of present system time It send into associated application platform.

Further, described that the initial data is converted into field identification, it is specifically used for:

The initial data is converted into corresponding cryptographic Hash.

The third aspect, the embodiment of the invention also provides a kind of terminal devices, comprising: at memory and one or more Manage device；

The memory, for storing one or more programs；

When one or more of programs are executed by one or more of processors, so that one or more of processing Device realizes data processing method as described in relation to the first aspect.

Fourth aspect, it is described the embodiment of the invention also provides a kind of storage medium comprising computer executable instructions Computer executable instructions by computer processor when being executed for executing data processing method as described in relation to the first aspect.

The present invention by obtain it is two or more include field information and data content initial data；It will be former Beginning data are converted to field identification, and major key is extracted from field information；If the corresponding field identification of initial data with Without intersection raw data base is written in initial data by the field identification in raw data base；The original of raw data base will be written Data are standardized pretreatment, obtain cleaning data；If cleaning the main key in the major key and cleaning database of data Cleaning data write-in cleaning database is not necessarily to data all in raw data base by word in data processing without intersection Major key in corresponding major key and cleaning database is compared, to improve data-handling efficiency.

Detailed description of the invention

Fig. 1 is a kind of flow chart for data processing method that the embodiment of the present invention one provides；

Fig. 2 is a kind of display schematic diagram for initial data write-in raw data base that the embodiment of the present invention one provides；

Fig. 3 is a kind of display schematic diagram for cleaning data write-in cleaning database that the embodiment of the present invention one provides；

Fig. 4 is a kind of flow chart of data processing method provided by Embodiment 2 of the present invention；

Fig. 5 is a kind of schematic diagram of data handling procedure provided by Embodiment 2 of the present invention；

Fig. 6 is a kind of display schematic diagram of determining batch identification provided by Embodiment 2 of the present invention；

Fig. 7 is a kind of structural block diagram of data processing system provided by Embodiment 2 of the present invention；

Fig. 8 is a kind of flow chart for data processing method that the embodiment of the present invention three provides；

Fig. 9 is a kind of component connection schematic diagram for data processing that the embodiment of the present invention three provides；

Figure 10 is a kind of structural block diagram for data processing equipment that the embodiment of the present invention four provides；

Figure 11 is a kind of structural schematic diagram for terminal device that the embodiment of the present invention five provides.

Specific embodiment

The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining the present invention rather than limiting the invention.It also should be noted that in order to just Only the parts related to the present invention are shown in description, attached drawing rather than entire infrastructure.

It should be noted that in this programme data processing method all embodiments, be by kettle tool set Component is in Kettle come the ETL data handling procedure realized.Wherein, Kettle is a ETL tool, is allowed to from different numbers It is managed according to the data in library, and describes the data processing to be carried out by providing a patterned user environment Journey.

Embodiment one

Fig. 1 is a kind of flow chart for data processing method that the embodiment of the present invention one provides, the number provided in the present embodiment It can be executed by terminal device according to processing method, which can be realized by way of software and/or hardware, the terminal Equipment can be two or more physical entities and constitute, and is also possible to a physical entity and constitutes.Terminal device in the present embodiment Performance support is provided for application platform associated by server for server for handling initial data.

With reference to Fig. 1, which specifically comprises the following steps:

S110, two or more initial data are obtained.

Wherein, initial data includes field information and data content.

In embodiment, initial data can be understood as the different data information obtained from heterogeneous data source.Wherein, isomery Data source refers to the data between different data base management systems.It is to be appreciated that server can be from the same time Different data sources obtains initial data, includes multiple tables of data in initial data, and in every number of initial data According to including field information and data content in table, wherein field information can be understood as each word in each tables of data Section name, and data content can be understood as in tables of data corresponding data information in each field name.It is to be understood that each word Duan Mingjun is corresponding with different data contents.Wherein, initial data is located in advance without standardization such as field filter, format conversions The data information of reason.

S120, field identification is converted raw data into, and extracts major key from field information.

Wherein, field identification is used to indicate whether that the data content to file where initial data is modified, if occurring Modification, then field identification changes；Conversely, if there is no modification, field identification does not also change.Illustratively, Field identification can be cryptographic Hash, can also be MD5 Message Digest 5 (MD5 Message Digest Algorithm) value.Its In, cryptographic Hash is also a kind of hash function, is used for message or data compression at abstract, so that data volume becomes smaller, by data Format is fixed up；MD5 is a kind of Cryptographic Hash Function being widely used, and can produce one 128 hashed values, is used for Ensure that information transmission is complete consistent.In embodiment, field identification is converted raw data into, is to detect initial data institute Whether modify in the data content of file, is analyzed without compare one by one to the data content in initial data, from And accelerate the detection speed of initial data.

It should be noted that there are multiple tables of data in initial data, have in each tables of data a plurality of Record can extract one or more field conducts to uniquely identify the record of a certain item in tables of data from tables of data The major key of the tables of data, so that the service speed of database can be accelerated by major key when searching some tables of data.

Field identification in the corresponding field identification of S130, initial data and raw data base is without intersection, then by original number According to write-in raw data base.

Wherein, raw data base can be understood as the database for storing initial data.In embodiment, each original File where data all preserves unique field identification in raw data base.If the corresponding field identification of initial data with Field identification in raw data base is identical, then shows that the data content in the initial data is not modified, i.e. the original number The data content in data content and raw data base in be it is duplicate, then be not required to by the initial data be written initial data In library；Conversely, if the field identification in the corresponding field identification of initial data and raw data base does not have intersection, i.e. initial data In data content and raw data base in data content be different, then by the initial data be written raw data base in, To provide data performance support for application platform associated by server.Fig. 2 is that one kind that the embodiment of the present invention one provides is original The display schematic diagram of data write-in raw data base.With reference to Fig. 2, it is assumed that get four initial data, then by each original number It is then right according to being converted to corresponding field identification, respectively field identification 1, field identification 2, field identification 3 and field identification 4 Field identification in the field identification and raw data base of four initial data is compared, and finds field identification 1 and word Segment identification 3 exists in raw data base, then deletes field identification 1 and the corresponding initial data of field identification 3, and It only will be in field identification 1 and the corresponding initial data of field identification 4 write-in raw data base.

It should be noted that in the present solution, using the word in the corresponding field identification of initial data and raw data base Segment identification is compared, and to determine whether initial data to be written in raw data base, and not uses initial data pair The major key answered is for the ease of tracing initial data.It is to be understood that when needing to trace initial data, to find pair When the old version record answered, it is compared according to the major key in the major key and raw data base of initial data, When the major key of initial data does not change, and the corresponding data content of other fields changes, not by the field In corresponding data content write-in raw data base, to cause the omission of initial data, corresponding history can not be found Colophon；And compared according to the field identification in the corresponding field identification of initial data in this programme and raw data base To analysis, if the corresponding any data content of initial data is modified, which also changes, so as to incite somebody to action The initial data is written in raw data base, ensure that the integrality of initial data, and can find corresponding old version note Record and initial data.

S140, the initial data that raw data base is written is standardized pretreatment, obtains cleaning data.

Wherein, standardization pretreatment includes a series of processes such as field filter, format conversion and data check.Implementing In example, after raw data base is written in initial data, the field information in initial data is screened, it then will screening The corresponding data content of the field information obtained afterwards is converted to preset format, and the data content after format is converted Data check is carried out, to obtain cleaning data.Wherein, field filter can be understood as that initial data will not be met in initial data The preset field information in library screens, with the initial data after being screened.For example, including field in initial data D, but be not provided with field D in raw data base in preset field information, then by other data in initial data When content is written in raw data base, Screening Treatment is carried out to the corresponding data content of field D.Then to Screening Treatment after Initial data formats, to be converted to the data information for presetting format.For example, preset data length threshold Value is 300, and number of data lines threshold value is 100；If the data length of initial data is 200, number of data lines 200 is then needed to original Data carry out deconsolidation process, to obtain the initial data less than or equal to data length threshold value and number of data lines threshold value；Then right Initial data after format conversion carries out data check, to guarantee the legitimacy of data.Specifically, after to format conversion Character string in initial data is verified, to filter out the forbidden character string in the initial data after format conversion, finally Obtain cleaning data.

Major key in S150, the major key for cleaning data and cleaning database then writes cleaning data without intersection Enter to clean database.

Wherein, cleaning database can be understood as the data warehouse for storing cleaning data, also be understood as cleaning number It is purpose data warehouse according to library.It in embodiment, can be directly to the major key for cleaning data in order to improve data write efficiency It is compared with the major key in cleaning database, if the main pass in the major key of cleaning data and cleaning database Key word does not repeat, and shows the not stored cleaning data for having step S140 to obtain in current cleaning database, need to be by the cleaning In data write-in cleaning database, to be updated processing to the data information in cleaning database；Conversely, if cleaning data Major key in major key and cleaning database repeats, and shows that being stored with step S140 in current cleaning database obtains Cleaning data, can directly to the cleaning data carry out Screening Treatment.

Fig. 3 is a kind of display schematic diagram for cleaning data write-in cleaning database that the embodiment of the present invention one provides.Fig. 3 is The process of cleaning data write-in cleaning database is illustrated on the basis of Fig. 2.With reference to Fig. 3, by field identification 2 and word The corresponding initial data of segment identification 4 is written after raw data base, pretreatment is standardized to initial data, to obtain Data are cleaned, corresponding major key, respectively major key 2 and major key 4 then will be found, by the two main keys Major key in word and cleaning database is compared, and discovery major key 4 exists in cleaning database, then will The corresponding cleaning data of major key 4 are deleted, and only the corresponding cleaning data of major key 2 are written in cleaning database.Wherein, Initial data corresponding to field identification 2 and major key 2 is identical；It is original corresponding to field identification 4 and major key 4 Data are also identical.

The technical solution of the present embodiment includes field information and data content by obtaining two or more Initial data；Field identification is converted raw data into, and extracts major key from field information；If initial data is corresponding Field identification and raw data base in field identification without intersection, raw data base is written into initial data；It will be written original The initial data of database is standardized pretreatment, obtains cleaning data；If cleaning the major key and cleaning data of data Cleaning data write-in cleaning database is not necessarily to initial data by the major key in library in data processing without intersection Major key in library in the corresponding major key of all data and cleaning database is compared, to improve data processing Efficiency.

Embodiment two

Fig. 4 is a kind of flow chart of data processing method provided by Embodiment 2 of the present invention.The present embodiment is in above-mentioned reality On the basis of applying example, further embody is made to data processing method.Fig. 5 is a kind of data provided by Embodiment 2 of the present invention The schematic diagram for the treatment of process.It should be noted that for the ease of being illustrated to data handling procedure.In embodiment, Only obtain 5 raw data files, respectively raw data file 1, raw data file 2, raw data file 3, original number According to file 4 and raw data file 5, as shown in Figure 5.

Referring to Fig. 4, which specifically comprises the following steps:

S201, raw data file is obtained, and format judgement is carried out to raw data file.

Wherein, raw data file can be understood as the file of storage initial data.In embodiment, by ETL Kettle is obtained and the consistent initial data of preliminary setting data format.It is to be understood that get raw data file it Afterwards, preliminary screening is carried out to the data format of initial data in raw data file, to obtain meeting preliminary setting data format Initial data.Wherein, the data format of initial data can be the data lattice such as EXCEL, JSON, text in raw data file Formula is not limited thereto.It is to be understood that getting the raw data file for being stored with initial data from heterogeneous data source Later, format judgement is carried out to the initial data in raw data file, to identify the original for meeting preliminary setting data format Beginning data and raw data file.It should be noted that the format of raw data file and the data format of initial data are It is identical.For example, the format of raw data file is generally the format of normal folder, but it is stored in the raw data file The format of each document and the data format of initial data of initial data are identical.It is to be understood that each initial data It may include having multiple documents for being stored with initial data in file.For example, document there are three including in raw data file, and The data format of initial data is EXCEL format in three documents, then the format of these three documents is just EXCEL format.

S202, raw data file are JSON file, are parsed to raw data file, to obtain JSON data format Initial data.

In embodiment, the preliminary setting data format of initial data is JSON data format in raw data file.It can To be interpreted as, after obtaining raw data file, format judgement is carried out to raw data file, if raw data file is The file of JSON format then parses the raw data file, to obtain the initial data of JSON data format.Certainly, The data format of initial data can also be set as other data formats, can be set according to business demand.Such as Fig. 5 institute Show, raw data file is parsed, to obtain the initial data of JSON data format, respectively initial data 1, original number According to 2, initial data 3, initial data 4 and initial data 5.

S203, two or more initial data are obtained.

Wherein, initial data includes field information and data content.In embodiment, since raw data file has 5, Then acquired initial data is corresponding also 5.

S204, field identification is converted raw data into, and extracts major key from field information.

In embodiment, each initial data is converted into corresponding field identification, which can be cryptographic Hash, It can be MD5 value.Corresponding major key is extracted in the field information of each initial data simultaneously.

Field identification in the corresponding field identification of S205, initial data and raw data base is without intersection, then by original number According to write-in raw data base.

Specifically, judge whether the field identification in the corresponding field identification of initial data and raw data base has friendship Collection, as shown in figure 5, the field identification in field identification 2 and raw data base has intersection, is then deleted corresponding to field identification 2 Initial data, to realize the data deduplication in raw data base；And by field identification 1, field identification 3, field identification 4 and field It identifies in 5 corresponding initial data write-in raw data bases.

The creation time of S206, the write time for obtaining initial data write-in raw data base or raw data file.

Wherein, the write time is system time locating when initial data to be written to raw data base；Creation time is pair Initial data is assembled and forms system time locating when raw data file.In embodiment, raw data file Write time of the creation time earlier than initial data write-in raw data base.It is to be understood that being got from heterogeneous data source When raw data file, completed the creation to raw data file, for the ease of count raw data file creation when Between, system time locating for raw data file can will be obtained from heterogeneous data source as when the creation of raw data file Between.After getting raw data file, the conversion of field identification and pair of field identification are carried out to raw data file Than analysis, then the initial data for meeting duplicate removal filtering rule is written in raw data base, when by the write-in raw data base Locating system time is as the write time.Wherein, duplicate removal filtering rule is understood that as according to the corresponding field of initial data Mark is compared and analyzed with the field identification in raw data base, to filter the rule of initial data.

S207, using write time or creation time as the batch identification of initial data.

Wherein, batch identification is used to identify the tandem that initial data updates in raw data base.In embodiment, In order to which newest initial data can be found as early as possible from raw data base, to the initial data in each write-in raw data base One corresponding batch identification is set, the write time that raw data base is written can be directlyed adopt as initial data in original number According to the batch identification in library, batch of the creation time of raw data file as initial data in raw data base can also be used Secondary mark.In order to more intuitively determine tandem that initial data updates in raw data base, batch according to batch identification Mark can directly adopt the time of numeralization to set.Such as, it is assumed that the creation time of raw data file is November 9 in 2018 28 minutes at 16 points in day afternoons, then corresponding batch identification is 201811091628, for another example, it is assumed that raw data base is written in initial data In write time be 6 minutes at 18 points in afternoons on November 9th, 2018, then corresponding batch identification is 201811091806.

Fig. 6 is a kind of display schematic diagram of determining batch identification provided by Embodiment 2 of the present invention.Assuming that between to be within 1 minute Every data statistics amount within 15 minutes in total, as shown in fig. 6, the creation time of raw data file 5 is the twoth minute； The creation time of raw data file 1 is the 6th second of the 7th minute；The creation time of raw data file 3 is the 9th minute 30th second；The creation time of raw data file 4 is the 18th second of the 12nd minute, solid arrow as shown in FIG. 6；And this four The write time of raw data file write-in raw data base is the 30th second of the 15th minute, dotted line arrow as shown in FIG. 6 Head.

Certainly, it is managed collectively for the ease of the batch identification to initial data, it is necessary to used by batch identification Setting is fixed in time, it can be understood as, the write time of raw data base is written as batch mark according to initial data Know, the batch identification of all initial data need to be counted with the write time；Similarly, according to raw data file Creation time then needs to count the batch identification of all initial data with creation time as batch identification, cannot incite somebody to action Write time and creation time carry out mixing statistics.

The corresponding batch identification of initial data in S208, inquiry raw data base.

It in embodiment, can be by the corresponding batch mark of the initial data after raw data base is written in initial data Know and be also written in raw data base, corresponding original can be found from raw data base according to batch identification as early as possible in order to subsequent Beginning data.Wherein, data query sentence can be used and search batch identification from raw data base, for example, data query sentence can Using the query statement in the databases such as structured query language (Structured Query Language, SQL), Oracle, Certainly, it is not limited thereto, can be selected according to business demand.It should be noted that for the ease of to batch A temporary data table being pre-created can be written in batch identification by the inquiry of mark, and temporary data table deposit is original In database.Certainly, batch can directly be passed through with the relationship between initial data and batch identification in the temporary data table Mark obtains corresponding initial data in raw data base.

S209, pretreatment is standardized to the corresponding initial data of newest batch identification, obtains cleaning data.

In embodiment, it in order to improve the speed being standardized to the initial data in raw data base, only needs The corresponding initial data of batch identification newest in raw data base is standardized.It is to be understood that being looked by data It askes sentence and obtains newest batch identification in current raw data base, and obtain the corresponding initial data of the newest batch identification, Then after being pre-processed by a series of standardization such as field filter, format conversion and data checks, it can be obtained and cleaned Cleaning data after filter.Wherein, it pretreated detailed process be standardized to initial data can be found in above-described embodiment and retouch It states, details are not described herein.As shown in figure 5, original corresponding to field identification 1, field identification 3, field identification 4 and field identification 5 Beginning data, corresponding to batch identification be it is newest in raw data base, then by this corresponding original number of four field identifications According to pretreatment is standardized, corresponding cleaning data 1, cleaning data 3, cleaning data 4 and cleaning data 5 are obtained.

It should be noted that using the write time of initial data write-in raw data base as former in the present embodiment The batch identification of beginning data, then batch identification corresponding to field identification 1, field identification 3, field identification 4 and field identification 5 be It is identical.

Major key in S210, the major key for cleaning data and cleaning database then writes cleaning data without intersection Enter to clean database.

As shown in figure 5, the major key in major key 3 corresponding to cleaning data 3 and cleaning database repeats, then delete Except cleaning data 3, and will only clean in data 1, cleaning data 4 and the cleaning write-in cleaning database of data 5.

The technical solution of the present embodiment is sentenced by carrying out format to raw data file on the basis of the above embodiments It is disconnected, to obtain the initial data of JSON data format, meanwhile, initial data corresponding to batch identification newest in raw data base It is standardized pretreatment, to obtain cleaning data, and the main key in the major key of cleaning data and cleaning database When word is without intersection, by cleaning data write-in cleaning database, realizes and only the initial data for presetting format is obtained It takes, and pretreatment only is standardized to the corresponding initial data of newest batch identification, simplify data handling procedure, thus Improve data processing speed.

On the basis of the above embodiments, in order to timely more to the data progress in application platform associated by server Newly, after step S210, further includes:

S211, the last push time for obtaining cleaning data.

Wherein, the last push time can be understood as the last to clean the cleaning data-pushing in database extremely The time of application platform associated by server.In embodiment, data query sentence can be directlyed adopt to push the last time Time carries out inquiry acquisition.Specifically, it will be cleaned after data are sent to associated application platform in the last time, it will be nearest The primary push time is counted and is stored into preset time temporary data table, to transfer use subsequent.Certainly, When being obtained to push time the last time, can directly be looked by the query statement in the databases such as SQL, Oracle Inquiry obtains.

S212, it will be greater than the last push time and be less than present system time cleaning data-pushing to being closed In the application platform of connection.

In embodiment, when the application platform associated by server is in unlatching use state, in order to clean in time The cleaning data updated in database are sent to application platform associated by server, and the last time that can obtain cleaning data pushes away The time is sent, and acquires present system time, when being greater than the last push time to acquire and be less than current system Between all cleaning data, then by this it is all cleaning data by data communication mode push to associated by application platform In.Wherein, the modes such as wireless network, cable network can be used in data communication mode, are not limited thereto.Wherein, it applies Platform can be by the application program installed in client, wherein client can be desktop computer, laptop, smart phone etc. Equipment.

Fig. 7 is a kind of structural block diagram of data processing system provided by Embodiment 2 of the present invention.As shown in fig. 7, the data Processing system includes: server 310 and application platform 320；Wherein, server 310 is used to obtain initial data, and to original number According to being handled, to obtain cleaning data；Application platform 320 can be desktop computer, smart phone, laptop.In server 310 will clean on data-pushing to application platform 320, and application platform 320 is according to cleaning data to the data in own database It is updated.

Certainly, in data processing, can the key link of data processing add successfully with the process of false judgment. For example, can terminate since being written raw data base in initial data to data-pushing to associated application platform will be cleaned, It can be used as the key link in data handling procedure, when detecting that data handling procedure when the error occurs, directly passes through Error message is sent to related development personnel by the mail components in Kettle, so that developer carries out data handling procedure Real time monitoring.

Embodiment three

Fig. 8 is a kind of flow chart for data processing method that the embodiment of the present invention three provides.The present embodiment is in above-mentioned reality On the basis of applying example, data handling procedure is illustrated with the various components in Kettle.With reference to Fig. 8, the data processing side Method step specific as follows:

S410, setting time started.

It wherein, include task module in the component of Kettle, and each task module may include multiple processes, together When each process parallel work-flow may be implemented.Meanwhile may include multiple components in each process, realize serial operation.Fig. 9 It is a kind of component connection schematic diagram for data processing that the embodiment of the present invention three provides.In embodiment, the data handling procedure It can be considered a process, and include multiple components as shown in Figure 9 in the process, each component is executable not Same data processing step.As shown in figure 9, component 510 is a beginning component, for the time parameter method that task starts is arranged, For example, timing or time interval etc..Wherein, it periodically can be understood as starting a task at the time of setting；And time interval It can be understood as starting a task every a period of time.Wherein, a task can be understood as a data handling procedure.

S420, the number of raw data file is counted, and judges whether it is 0, if 0, then follow the steps S470； If not 0, then follow the steps S430.

In embodiment, after receiving the clicking trigger for starting component, raw data file is obtained, and to original number Format judgement is carried out according to file, if raw data file is JSON file, the number of raw data file is counted, if former Beginning data file is not 0, thens follow the steps S430；If raw data file is 0, S470 is thened follow the steps.Wherein, step S420 Detailed process can be realized by component 520 as shown in Figure 9.

S430, initial data is written in raw data base.

In embodiment, it is parsed to raw data file, to obtain the initial data of JSON data format, then The field information and data content in initial data are obtained, and converts raw data into field identification, and from field information Major key is extracted, judges whether the field identification in the corresponding field identification of initial data and raw data base has intersection, if There is no intersection, then it represents that there is no the initial data in raw data base, which is written in raw data base.Wherein, The step can realize that component 530 is raw data base by component 530 as shown in Figure 9.

It should be noted that by process in the initial data write-in raw data base in raw data file, it can Referring in the prior art pass through Kettle in component by JSON file be inserted into database in process.Specifically can include: JSON Parsing obtains variable, field selection, replacement NULL value, increases check column, obtains system information.Wherein, JSON is parsed It is to be parsed to raw data file, to obtain the initial data of JSON data format；Then it obtains set in variable component The variate-value set, and selected by field, required field is renamed and screened；And by the null value in initial data It is substituted for NULL character string, in order to be combined the setting of major key；Then it determines the field information of cryptographic Hash, and obtains System information, for example, system time information etc..

S440, cleaning data write-in is cleaned in database.

In embodiment, after raw data base is written in initial data, the corresponding batch identification of inquiry initial data, It is pre- that a series of standardization such as field filter, format conversion and data check are carried out to the corresponding initial data of newest batch identification Then processing the major key for cleaning data is compared with the major key in cleaning database with obtaining cleaning data Analysis will cleaning data write-in cleaning if the major key of cleaning data is not present in the major key in cleaning database In database.Wherein, component 540 as shown in Figure 9 is cleaning database.It is to be understood that the step is from component 530 The corresponding initial data of newest batch identification is obtained, is realized during cleaning database to write-in component 540.

S450, the data volume for the cleaning data for increasing newly or updating in cleaning database is counted, and judged whether it is 0。

In embodiment, the data in cleaning database are inquired, increases newly or update determines in cleaning database Cleaning data data volume, increase newly or update cleaning data data volume be 0, then follow the steps S470；If newly-increased or more The data volume of new cleaning data is not 0, thens follow the steps S460.Wherein, which is by component 550 as shown in Figure 9 Come what is realized.

S460, will be newly-increased or the cleaning data-pushing that updates to associated application platform.

In embodiment, after the cleaning data in cleaning database update or is newly-increased, answering associated by server When in the open state with platform, automatically by cleaning data-pushing that is newly-increased or updating to associated application platform, to update Data information in application platform.Wherein, which realized by component 560 as shown in Figure 9.

S470, work flow is exited.

In embodiment, if the number of raw data file is 0, show not get new initial data, then directly It connects and exits work flow.Meanwhile when the data volume of newly-increased or update cleaning data is 0 in cleaning database, show to clean There is no cleaning data that are newly-increased or updating in database, then directly exits work flow.Wherein, which is by such as Fig. 9 institute The component 570 that shows is realized.

S480, miscue information is sent to related development personnel.

In embodiment, in order to guarantee that developer can recognize the errors present in data handling procedure in time, in number Added successfully and false judgment according to the key link for the treatment of process, for example, from step 420- step S460, it is middle add successfully with it is wrong Miscue information is then sent to related development personnel when occurring mistake in data handling procedure by erroneous judgement.

The technical solution of the present embodiment, in data processing, it is not necessary that data all in raw data base are corresponding Major key in major key and cleaning database is compared, and improves data-handling efficiency.

Example IV

Figure 10 is a kind of structural block diagram for data processing equipment that the embodiment of the present invention four provides.At the data of the present embodiment Reason device is configured in server, and with reference to Figure 10, which includes: the first acquisition module 610, conversion extraction mould Block 620, the first writing module 630, preprocessing module 640 and the second writing module 650.

Wherein, first module 610 is obtained, for obtaining two or more initial data, which includes Field information and data content；

Extraction module 620 is converted, for converting raw data into field identification, and extracts main pass from field information Key word；

First writing module 630, for the field identification in the corresponding field identification of initial data and raw data base without Then raw data base is written in initial data by intersection；

Preprocessing module 640 is cleaned for the initial data that raw data base is written to be standardized pretreatment Data；

Second writing module 650, the major key in major key and cleaning database for cleaning data without intersection, It then will cleaning data write-in cleaning database.

Technical solution provided in this embodiment includes in field information and data by obtaining two or more The initial data of appearance；Field identification is converted raw data into, and extracts major key from field information；If initial data Without intersection raw data base is written in initial data by the field identification in corresponding field identification and raw data base；It will write-in The initial data of raw data base is standardized pretreatment, obtains cleaning data；If cleaning the major key and cleaning of data Major key in database is without intersection, and by cleaning data write-in cleaning database, in data processing, being not necessarily to will be original Major key in database in the corresponding major key of all data and cleaning database is compared, to improve data Treatment effeciency

On the basis of the above embodiments, the data processing equipment further include:

Format judgment module, for obtaining raw data file before obtaining two or more initial data, And format judgement is carried out to raw data file；

Parsing module is JSON file for raw data file, parses to raw data file, to obtain JSON The initial data of data format.

On the basis of the above embodiments, the data processing equipment, further includes:

Second obtains module, for obtaining before the initial data that raw data base is written is standardized pretreatment Take the write time of initial data write-in raw data base or the creation time of raw data file；

Determining module, for using write time or creation time as the batch identification of initial data.

On the basis of the above embodiments, the preprocessing module 640, comprising:

Query unit, for inquiring the corresponding batch identification of initial data in raw data base；

Third obtains module, for obtaining nearest the one of cleaning data after it will clean data write-in cleaning database The secondary push time；

Pushing module, for will be greater than the last push time and be less than the cleaning data-pushing of present system time extremely In associated application platform.

On the basis of the above embodiments, the initial data is converted into field identification, is specifically used for:

Convert raw data into corresponding cryptographic Hash.

Data processing method provided by any embodiment of the invention can be performed in above-mentioned data processing equipment, has the side of execution The corresponding functional module of method and beneficial effect.

Embodiment five

Figure 11 is a kind of structural schematic diagram for terminal device that the embodiment of the present invention five provides.With reference to Figure 11, which is set Standby includes: processor 710, memory 720, input unit 730 and output device 740.Processor 710 in the terminal device Quantity can be one or more, in Figure 11 by taking a processor 710 as an example.The quantity of memory 720 in the terminal device It can be one or more, in Figure 11 by taking a memory 720 as an example.The processor 710 of the terminal device, memory 720, Input unit 730 and output device 740 can be connected by bus or other modes, to be connected as by bus in Figure 11 Example.In embodiment, which can be server.

Memory 720 is used as a kind of computer readable storage medium, can be used for storing software program, journey can be performed in computer Sequence and module, the corresponding program instruction/module of equipment as described in any embodiment of that present invention is (for example, data processing equipment In first obtain module 610, conversion extraction module 620, the first writing module 630, preprocessing module 640 and second write-in mould Block 650).Memory 720 can mainly include storing program area and storage data area, wherein storing program area can store operation system Application program needed for system, at least one function；Storage data area, which can be stored, uses created data etc. according to equipment.This Outside, memory 720 may include high-speed random access memory, can also include nonvolatile memory, for example, at least one Disk memory, flush memory device or other non-volatile solid state memory parts.In some instances, memory 720 can be into one Step includes the memory remotely located relative to processor 710, these remote memories can pass through network connection to equipment.On The example for stating network includes but is not limited to internet, intranet, local area network, mobile radio communication and combinations thereof.

Input unit 730 can be used for receiving the number or character information of input, and generate the user setting with equipment And the related key signals input of function control.Output device 740 may include the audio frequency apparatuses such as loudspeaker.It needs to illustrate It is that the concrete composition of input unit 730 and output device 740 may be set according to actual conditions.

Software program, instruction and the module that processor 710 is stored in memory 720 by operation, thereby executing setting Standby various function application and data processing, that is, realize above-mentioned data processing method.

The terminal device of above-mentioned offer can be used for executing the data processing method that above-mentioned any embodiment provides, and have corresponding Function and beneficial effect.

Embodiment six

The embodiment of the present invention six also provides a kind of storage medium comprising computer executable instructions, and the computer can be held Row instruction by computer processor when being executed for executing a kind of data processing method, comprising:

Two or more initial data are obtained, which includes field information and data content；

Field identification is converted raw data into, and extracts major key from field information；

Then initial data is written without intersection for field identification in the corresponding field identification of initial data and raw data base Raw data base；

The initial data that raw data base is written is standardized pretreatment, obtains cleaning data；

The major key in the major key and cleaning database of data is cleaned without intersection, then will clean data write-in cleaning Database.

Certainly, a kind of storage medium comprising computer executable instructions, computer provided by the embodiment of the present invention The data processing method operation that executable instruction is not limited to the described above, can also be performed provided by any embodiment of the invention Relevant operation in data processing method, and have corresponding function and beneficial effect.

By the description above with respect to embodiment, it is apparent to those skilled in the art that, the present invention It can be realized by software and required common hardware, naturally it is also possible to which by hardware realization, but in many cases, the former is more Good embodiment.Based on this understanding, technical solution of the present invention substantially in other words contributes to the prior art Part can be embodied in the form of software products, which can store in computer readable storage medium In, floppy disk, read-only memory (Read-Only Memory, ROM), random access memory (Random such as computer Access Memory, RAM), flash memory (FLASH), hard disk or CD etc., including some instructions are with so that a computer is set Standby (can be robot, personal computer, server or the network equipment etc.) executes number described in any embodiment of that present invention According to processing method.

It is worth noting that, included each unit and module are only patrolled according to function in above-mentioned data processing equipment It volume is divided, but is not limited to the above division, as long as corresponding functions can be realized；In addition, each function list The specific name of member is also only for convenience of distinguishing each other, the protection scope being not intended to restrict the invention.

It should be appreciated that each section of the invention can be realized with hardware, software, firmware or their combination.Above-mentioned In embodiment, software that multiple steps or method can be executed in memory and by suitable instruction execution system with storage Or firmware is realized.It, and in another embodiment, can be under well known in the art for example, if realized with hardware Any one of column technology or their combination are realized: having a logic gates for realizing logic function to data-signal Discrete logic, with suitable combinational logic gate circuit specific integrated circuit, programmable gate array (PGA), scene Programmable gate array (FPGA) etc..

In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example Point is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms are not Centainly refer to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be any One or more embodiment or examples in can be combined in any suitable manner.

Note that the above is only a better embodiment of the present invention and the applied technical principle.It will be appreciated by those skilled in the art that The invention is not limited to the specific embodiments described herein, be able to carry out for a person skilled in the art it is various it is apparent variation, It readjusts and substitutes without departing from protection scope of the present invention.Therefore, although being carried out by above embodiments to the present invention It is described in further detail, but the present invention is not limited to the above embodiments only, without departing from the inventive concept, also It may include more other equivalent embodiments, and the scope of the invention is determined by the scope of the appended claims.

Claims

1. a kind of data processing method characterized by comprising

Field identification in the corresponding field identification of the initial data and raw data base is without intersection, then by the initial data The raw data base is written；

Then the cleaning data are written without intersection for major key in the major key and cleaning database of the cleaning data The cleaning database.

2. data processing method according to claim 1, which is characterized in that obtain two or more originals described Before beginning data, further includes:

The raw data file is JSON file, is parsed to the raw data file, to obtain JSON data format Initial data.

3. data processing method according to claim 2, which is characterized in that the raw data base will be written described Initial data is standardized before pretreatment, further includes:

4. data processing method according to claim 3, which is characterized in that the original that the raw data base will be written Beginning data are standardized pretreatment, comprising:

5. data processing method according to claim 1, which is characterized in that it is described will the cleaning data write-in described in It cleans after database, further includes:

Obtain the last push time of cleaning data；

It will be greater than the last push time and less than the cleaning data-pushing of present system time to associated application In platform.

6. data processing method according to claim 1, which is characterized in that described that the initial data is converted to field Mark, specifically:

The initial data is converted into corresponding cryptographic Hash.

7. a kind of data processing equipment characterized by comprising

First obtains module, for obtaining two or more initial data, the initial data include field information and Data content；

Extraction module is converted, for the initial data to be converted to field identification, and extracts master from the field information Keyword；

First writing module, for the field identification in the corresponding field identification of the initial data and raw data base without friendship Then the raw data base is written in the initial data by collection；

Preprocessing module obtains cleaning number for the initial data that the raw data base is written to be standardized pretreatment According to；

Second writing module, for the major key in the major key and cleaning database of the cleaning data without intersection, then The cleaning database is written into the cleaning data.

8. data processing equipment according to claim 7, which is characterized in that described device further include:

Format judgment module carries out format judgement for obtaining raw data file, and to the raw data file；

Parsing module is JSON file for the raw data file, parses to the raw data file, to obtain The initial data of JSON data format.

9. a kind of terminal device characterized by comprising memory and one or more processors；

The memory, for storing one or more programs；

When one or more of programs are executed by one or more of processors, so that one or more of processors are real Now such as data processing method as claimed in any one of claims 1 to 6.

10. a kind of storage medium comprising computer executable instructions, which is characterized in that the computer executable instructions by For executing such as data processing method as claimed in any one of claims 1 to 6 when computer processor executes.