CN116126872B - Correlation method, device and computer readable medium for real-time dimension table - Google Patents
Correlation method, device and computer readable medium for real-time dimension table Download PDFInfo
- Publication number
- CN116126872B CN116126872B CN202310410695.XA CN202310410695A CN116126872B CN 116126872 B CN116126872 B CN 116126872B CN 202310410695 A CN202310410695 A CN 202310410695A CN 116126872 B CN116126872 B CN 116126872B
- Authority
- CN
- China
- Prior art keywords
- real
- data
- time
- hbase
- cache database
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2282—Tablespace storage structures; Management thereof
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/23—Updating
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24552—Database cache management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24568—Data stream processing; Continuous queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Signal Processing For Digital Recording And Reproducing (AREA)
Abstract
The embodiment of the invention discloses an association method, a device and a computer readable medium for a real-time dimension table, wherein the method is applied to first equipment; one embodiment at least comprises: firstly, reading a real-time data stream of a target service; processing the real-time data stream to obtain a plurality of processing windows; secondly, aiming at any ID number in the processing window: acquiring a real-time dimension table corresponding to the ID attribute of the ID number; determining a cache database and an HBase backup database corresponding to the real-time dimension table; judging whether an ID number exists in the cache database; if the ID number exists, reading an association value corresponding to the ID number from a cache database; if not, reading the association value corresponding to the ID number from the HBase stock library. Therefore, by setting the cache database, the time of the millions/s real-time data stream association dimension table can be optimized from the minute level to the second level, so that the data delay is reduced, and the timeliness of the real-time data stream association dimension table is improved.
Description
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a method and a device for associating real-time dimension tables and a computer readable medium.
Background
In the process of constructing own real-time number bins and real-time indexes, each enterprise needs to integrate multidimensional indexes, and data corresponding to the indexes often exist in a plurality of tables or message queues of a relational service library. Full-scale real-time scenes often can only be used as statistics, and cannot provide real-time services. The stream is associated with the stream, and the real-time dimension table data association delay is caused by the fact that the real-time data stream flow of the associated real-time dimension table is too large, so that the timeliness problem of service requirements cannot be met.
For example: in practical applications, the data of the real-time dimension table is generally placed in the HBase main database, and the real-time data association real-time dimension table is generally directly associated with the data in the HBase database. If millions of real-time data streams per second are directly related to the HBase main library, the time for performing the related operation on the real-time data streams for 1min usually takes hours; this results in severe delays in data association, which affects the timeliness of the traffic requirements.
Therefore, when real-time dimension table association is performed on a real-time data stream with a relatively large flow, an effective and rapid association method is needed to solve the problem of data association delay in the prior art.
Disclosure of Invention
Aiming at the problems existing in the prior art, the embodiment of the invention provides a method and a device for associating a real-time dimension table and a computer readable medium, which can rapidly and accurately associate millions of real-time data streams to the real-time dimension table, and improve timeliness of associating the real-time data streams with the dimension table.
According to a first aspect of an embodiment of the present invention, there is provided an association method for a real-time dimension table, which is applied to a first device; the method comprises the following steps: reading a real-time data stream of a target service; processing the real-time data stream to obtain a plurality of processing windows; wherein each processing window has a corresponding ID attribute; the processing window comprises a plurality of ID numbers with the same ID attribute; for any ID number in the processing window: acquiring a real-time dimension table corresponding to the ID attribute of the ID number; determining a cache database and an HBase backup database corresponding to the real-time dimension table; judging whether the ID number exists in the cache database; if so, reading an association value corresponding to the ID number from the cache database; and if the association value does not exist, reading the association value corresponding to the ID number from the HBase stock library.
Optionally, the processing is performed on the real-time data stream to obtain a plurality of processing windows; comprising the following steps: for any one of the real-time data streams: performing row-column conversion on the real-time data to obtain a plurality of ID numbers; each ID number has a corresponding ID attribute; and based on a preset time window, carrying out window aggregation on all IDs in the real-time data stream according to the ID attribute to generate a plurality of processing windows.
Optionally, the determining the cache database and the HBase database corresponding to the real-time dimension table includes: reading data from the real-time dimension table, and filtering the read data based on preset conditions to obtain target data; writing the target data into a locally corresponding cache area to generate a first trigger instruction; generating a cache database corresponding to the real-time dimension table based on the first trigger instruction; and updating the HBase backup library based on the cache database to obtain the HBase backup library corresponding to the real-time dimension table.
Optionally, updating the HBase backup library based on the cache database to obtain the HBase backup library corresponding to the real-time dimension table includes: based on an HBase main library, monitoring the cache database; if the monitoring result represents that the cache database has data different from the HBase main database; writing the updated data in the cache database into the HBase main library; and synchronously updating the HBase backup library based on the data updating result of the HBase main library to obtain the HBase backup library corresponding to the real-time dimension table.
Optionally, the method further comprises: based on the first trigger instruction, monitoring updated data in the cache database; and if the monitoring result indicates that the storage time of the updated data in the cache database is longer than the preset time, clearing the updated data from the cache database.
Optionally, the method further comprises: generating a second trigger instruction based on the clearing operation of the updated data in the cache database; and reading data from the current real-time dimension table based on the second trigger instruction, and updating the data of the cache database based on a reading result.
Optionally, the cache database includes update data stored in a preset time and a common data table.
Optionally, the method further comprises: and writing the association value into a distributed database.
According to a second aspect of the embodiment of the present invention, there is also provided an association apparatus for a real-time dimension table; the device comprises: the reading module is used for reading the real-time data stream of the target service; the processing module is used for processing the real-time data stream to obtain a plurality of processing windows; wherein each processing window has a corresponding ID attribute; the processing window comprises a plurality of ID numbers with the same ID attribute; the association module is used for aiming at any ID number in the processing window: acquiring a real-time dimension table corresponding to the ID attribute of the ID number; determining a cache database and an HBase backup database corresponding to the real-time dimension table; judging whether the ID number exists in the cache database; if so, reading an association value corresponding to the ID number from the cache database; and if the association value does not exist, reading the association value corresponding to the ID number from the HBase stock library.
According to a third aspect of embodiments of the present invention, there is also provided a computer readable medium having stored thereon a computer program which, when executed by a processor, implements the method according to the first aspect.
The embodiment of the invention provides an association method for a real-time dimension table, which is applied to first equipment; the method comprises the following steps: firstly, reading a real-time data stream of a target service; processing the real-time data stream to obtain a plurality of processing windows; wherein each processing window has a corresponding ID attribute; the processing window comprises a plurality of ID numbers with the same ID attribute; secondly, for any ID number in the processing window: acquiring a real-time dimension table corresponding to the ID attribute of the ID number; determining a cache database and an HBase backup database corresponding to the real-time dimension table; judging whether the ID number exists in the cache database; if so, reading an association value corresponding to the ID number from the cache database; and if the association value does not exist, reading the association value corresponding to the ID number from the HBase stock library. Therefore, by setting the cache database, the time of the real-time data flow association dimension table of millions/s can be optimized from the minute level to the second level, so that the data delay is reduced, and the timeliness of the real-time data flow association dimension table of millions/s is improved.
Drawings
Some specific embodiments of the invention will be described in detail hereinafter by way of example and not by way of limitation with reference to the accompanying drawings. The same reference numbers will be used throughout the drawings to refer to the same or like parts or portions. It will be appreciated by those skilled in the art that the drawings are not necessarily drawn to scale. In the accompanying drawings:
FIG. 1 is a flow chart of a method for associating real-time dimension tables according to an embodiment of the present invention;
FIG. 2 is a flow chart of processing a real-time data stream according to an embodiment of the invention;
FIG. 3 is a flow chart of determining a cache database and an HBase database corresponding to a real-time dimension table according to an embodiment of the present invention;
FIG. 4 is a flowchart of a method for associating real-time dimension tables according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an association device for real-time dimension tables according to an embodiment of the present invention.
Detailed Description
In order to make the objects, features and advantages of the present invention more comprehensible, the technical solutions according to the embodiments of the present invention will be clearly described in the following with reference to the accompanying drawings, and it is obvious that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Fig. 1 is a schematic flow chart of a method for associating real-time dimension tables according to an embodiment of the invention.
An association method for a real-time dimension table is applied to first equipment; the method at least comprises the following steps:
s101, reading real-time data flow of a target service;
s102, processing the real-time data stream to obtain a plurality of processing windows; wherein each processing window has a corresponding ID attribute; the processing window comprises a plurality of ID numbers with the same ID attribute;
s103, aiming at any ID number in a processing window: acquiring a real-time dimension table corresponding to the ID attribute of the ID number; determining a cache database and an HBase backup database corresponding to the real-time dimension table; judging whether an ID number exists in the cache database; if the ID number exists, reading an association value corresponding to the ID number from a cache database; if not, reading the association value corresponding to the ID number from the HBase stock library.
In S101 to S102, firstly, a real-time data stream of a target service is read through a link (Apache link, open source stream processing framework); wherein the real-time data stream comprises a plurality of real-time data; for example: "several" is used herein to indicate an amount of data above the millions. Then for any real-time data in the real-time data stream: carrying out semantic analysis on the real-time data stream, and extracting a plurality of ID numbers corresponding to the user request based on semantic analysis results; obtaining an ID genus corresponding to each ID number; thus, each real-time data includes several ID attributes; and finally, dividing the real-time data stream into a plurality of processing windows according to a preset time window based on the ID attribute.
For example: for millions of real-time data streams, carrying out semantic analysis on the real-time data streams to obtain two ID numbers, namely a name ID number and an order ID number; respectively acquiring ID attributes corresponding to the two ID numbers, wherein the ID attribute corresponding to the name ID number is a user ID, and the ID attribute corresponding to the order ID number is a content ID; finally, based on user ID and content ID, aggregating ID numbers in real-time data stream according to ID attribute window; and dividing the processing window by the ID numbers after window aggregation according to the time window of 1 min. Because each ID attribute has a corresponding processing window, the real-time data stream can be divided according to time window to obtain two processing windows with different ID attributes.
In S103, the data read from the real-time dimension table by the link is stored in either the cache database or the HBase master database; wherein the HBase backup library is a backup of the HBase main library.
It should be noted that, the data stored in the cache database is updated to the HBase master library based on the trigger of the preset time.
For example: acquiring a first real-time dimension table corresponding to the user ID and a second real-time dimension table corresponding to the content ID; the first real-time maintenance table stores the corresponding relation between the user ID and the identity card, and the second real-time maintenance table stores the corresponding relation between the content ID and the purchasing list. The method comprises the steps that data read from a first real-time dimension table by the Flink and data read from a second real-time dimension table are respectively stored in different areas of a cache, so that a first cache database and a second cache database are obtained; the method comprises the steps of carrying out data updating on an HBase main library based on updating data of a first cache database, and simultaneously carrying out updating on the HBase main library based on updating data of a second cache database; thus, the update data of the first cache database and the update data of the second cache database are finally stored in the HBase main database.
For any name ID number in the first processing window: the Flink firstly judges whether the name ID number exists in the cache database; if so, reading an identity card number corresponding to the name ID number from a first cache database at the moment; if the name ID number does not exist, the real-time maintenance table data stored in the first cache database is indicated to be updated to the HBase main library, and at the moment, the identity card number corresponding to the name ID number is read from the HBase standby library.
For any order ID number in the second processing window: the Flink firstly judges whether the order ID number exists in a cache database; if yes, reading a purchase list corresponding to the order ID number from a second cache database at the moment; if the real-time maintenance table data does not exist, the real-time maintenance table data stored in the second cache database is updated to the HBase main library, and a purchase list corresponding to the order ID number is read from the HBase standby library.
Here, the real-time data stream is used to indicate a data stream of millions of levels/s updated in real-time. The processing process of the real-time data stream is not limited, and the real-time data stream can be processed based on the method and can also be processed by utilizing the Flink.
In the embodiment, when the real-time data stream is associated, the real-time data stream is firstly associated from a cache database, and if the cache database does not exist, the real-time data stream is associated from an HBase database; therefore, by setting a cache database in front of the HBase main database, not only the association quantity of the real-time data stream is shared, but also the data association speed is improved; therefore, the time of the millions/s real-time data stream association dimension table can be optimized from the minute level to the second level, so that the data delay is reduced, and the timeliness of the real-time data stream association dimension table is improved.
Fig. 2 is a schematic flow chart of processing a real-time data stream according to an embodiment of the invention.
In a preferred embodiment, the real-time data stream is processed to obtain a plurality of processing windows; at least comprises the following steps:
s201, for any real-time data in the real-time data stream: performing row-to-row conversion on the real-time data to obtain a plurality of ID numbers; each ID number has a corresponding ID attribute;
s202, based on a preset time window, window aggregation is carried out on all IDs in the real-time data stream according to the ID attribute, and a plurality of processing windows are generated.
Specifically, performing row-to-row conversion on real-time data by utilizing the Flink to obtain a plurality of ID numbers; for any one of several ID attributes: window aggregation is carried out on all ID numbers corresponding to the ID attribute in the real-time data stream according to a preset time window, and a processing window is generated; based on the number of ID attributes, a number of processing windows are generated.
For example: performing row-to-row conversion on the real-time data by utilizing the Flink to obtain a name ID number and an order ID number; wherein the "name ID number" corresponds to the user ID and the "order ID number" corresponds to the content ID; according to a preset time window, carrying out window aggregation classification on all ID numbers in the real-time data stream based on the ID attribute, and generating a first processing window corresponding to the user ID and a second processing window corresponding to the content ID; the first processing window is all ID numbers corresponding to the user ID, and the second processing window is all IDs corresponding to the content ID.
In the embodiment, the read real-time data is first listed in special columns through the Flink to obtain a plurality of ID numbers; then, based on a preset time window, window aggregation is carried out on all IDs in the real-time data stream according to the ID attribute, and a processing window corresponding to the ID attribute is generated; dividing all ID numbers in the real-time data stream into processing windows with different ID attributes according to the ID attributes and the time window; therefore, the ID numbers can be associated with the corresponding real-time dimension table based on the ID attribute, and timeliness of the real-time data stream associated dimension table is improved.
Fig. 3 is a schematic flow chart of determining a cache database and an HBase database corresponding to a real-time dimension table according to an embodiment of the present invention.
In a preferred embodiment, determining the cache database and the HBase database corresponding to the real-time dimension table at least includes the following steps:
s301, reading data from a real-time dimension table, and filtering the read data based on preset conditions to obtain target data;
s302, writing target data into a locally corresponding cache area to generate a first trigger instruction;
s303, generating a cache database corresponding to the real-time dimension table based on the first trigger instruction;
s304, updating the HBase backup library based on the cache database to obtain the HBase backup library corresponding to the real-time dimension table.
Specifically, updating the HBase backup library based on the cache database to obtain the HBase backup library corresponding to the real-time dimension table, including: based on the HBase main library, monitoring the cache database; if the monitoring result represents that the data different from the HBase main library exists in the cache database; writing the updated data in the cache database into the HBase main library; based on the data updating result of the HBase main library, synchronously updating the HBase standby library to obtain the HBase standby library corresponding to the real-time maintenance table.
For example: firstly, the Flink reads data from a real-time dimension table, and filters the read data according to preset conditions, so that non-compliant data in the read data are deleted, and first target data and second target data are obtained; secondly, the Flink stores the first target data and the second target data in a first cache area and a second cache area corresponding to the first target data locally respectively, so that a first cache database corresponding to the first target data and a second cache database corresponding to the second target data are obtained; then, based on the HBase main library, the HBase Proxy monitors the update data of the first cache database; if the monitoring result indicates that the first cache database has data different from the HBase main database, writing the updated data in the first cache database into the HBase main database; if the monitoring result indicates that the first cache database does not have data different from the HBase main database, the updated data in the first cache database is written into the HBase main database. Simultaneously, the HBase Proxy monitors the update data of the second cache database; if the monitoring result indicates that the second cache database has data different from the HBase main database, writing the updated data in the second cache database into the HBase main database; if the monitoring result indicates that the second cache database does not have data different from the HBase main database, indicating that the updated data in the second cache database is written into the HBase main database; and finally, ensuring the data consistency of the HBase main library and the HBase standby library through the BingLog log.
According to the embodiment, the data consistency of the cache database and the HBase main database is monitored through the HBase Proxy, so that updated data in the cache database can be updated to the HBase main database in time, and the data consistency of the cache database and the HBase main database is further ensured. According to the embodiment, by setting the HBase backup library, the safety of real-time maintenance table data storage can be ensured, and the data of the HBase main library and the HBase backup library can be kept consistent, so that the correlation of real-time data streams is facilitated.
In addition, in the embodiment, update data is written into the HBase main library, and data is read from the HBase standby library when data association is performed; therefore, the two processes of reading data and writing data are separated, the pressure of processing the data by the server is reduced, and the efficiency of associating the real-time data flow with the dimension table is improved.
In a preferred embodiment, the method further comprises: based on the first trigger instruction, monitoring updated data in the cache database; and if the monitoring result indicates that the storage time of the updated data in the cache database is longer than the preset time, clearing the updated data from the cache database.
For example: when the Flink writes the target data into the locally corresponding cache area, the Flink starts to record the storage time of the target data in the cache area; and when the storage time is longer than the preset time, clearing the update data in the cache database.
Here, the preset time is determined according to an actual service scenario; for example: the preset time is 24 hours.
Therefore, in the embodiment, the storage time of the update data in the cache database is set, and the update data in the cache database is deleted when the storage time of the update data is longer than the preset time, so that the outdated update data is prevented from wasting the cache area, the utilization rate of the cache database is improved, and the timeliness of the real-time data stream association dimension table is improved.
In a preferred embodiment, the method further comprises: generating a second trigger instruction based on the clearing operation of the updated data in the cache database; and reading data from the current real-time dimension table based on the second trigger instruction, and updating the data of the cache database based on a reading result.
Specifically, when the first device receives an instruction that update data in the cache database is cleared, the first device controls the Flink to read the data from the current real-time dimension table, and writes the read data into the cache database after filtering the read data. For example: if the update data 24h stored in the cache database is cleared once, the Flink 24h performs a read operation on the real-time data. Therefore, the cache database can update the stored data of the real-time dimension table according to the preset time, and the timeliness of real-time data association is improved.
In a preferred embodiment, the cache database includes update data stored for a preset time and a common data table.
Specifically, if the common data table contains update data, updating the data in the common data table based on the update data; if the common data table does not contain the update data, the common data table does not need to be updated. When deleting the update data in the cache database, the update data in the common data table is deleted based on the preset time, but the update data in the common data table is deleted based on the update operation of the common data table instead of the preset time. That is, the deletion of the update data in the common data table is triggered based on the corresponding update data, not the preset time.
In a preferred embodiment, the method further comprises: and writing the association value into a distributed database.
The link writes the association value into a distributed database. Herein, a distributed database is used to indicate other distributed databases than the one storing real-time dimension tables.
The method described above for this embodiment will be described in detail with reference to specific applications.
Reading a real-time data stream of a target service; for any real-time data in the real-time data stream: performing row-to-row conversion on the real-time data to obtain a plurality of ID numbers; wherein, each ID number has a corresponding ID attribute; and based on a preset time window, carrying out window aggregation on all IDs in the real-time data stream according to the ID attribute to generate a plurality of processing windows.
For any ID number in the processing window: acquiring a real-time dimension table corresponding to the ID attribute of the ID number; determining a cache database and an HBase backup database corresponding to the real-time dimension table; judging whether the ID number exists in the cache database; if so, reading an association value corresponding to the ID number from the cache database; if not, reading an association value corresponding to the ID number from the HBase stock library; the cache database comprises update data stored in preset time and a common data table; when the common data table includes update data, the common data table is updated based on the update of the corresponding update data.
And writing the association value into a distributed database.
Determining a cache database and an HBase backup library corresponding to the real-time dimension table, comprising: reading data from the real-time dimension table, and filtering the read data based on preset conditions to obtain target data; writing the target data into a locally corresponding cache area to generate a first trigger instruction; generating a cache database corresponding to the real-time dimension table based on the first trigger instruction; based on an HBase main library, monitoring the cache database; if the monitoring result indicates that the cache database has data different from the HBase main database, writing the updated data in the cache database into the HBase main database; and synchronously updating the HBase backup library based on the data updating result of the HBase main library to obtain the HBase backup library corresponding to the real-time dimension table.
Based on a first trigger instruction, monitoring updated data in the cache database; and if the monitoring result indicates that the storage time of the updated data in the cache database is longer than the preset time, clearing the updated data from the cache database. Generating a second trigger instruction based on the clearing operation of the updated data in the cache database; and reading data from the current real-time dimension table based on the second trigger instruction, and updating the data of the cache database based on a reading result.
Fig. 4 is a schematic flow chart of a method for associating real-time dimension tables according to an embodiment of the present invention.
The Flink reads real-time data, and carries out special columns on the real-time data to obtain two ID numbers, namely a name ID number and an order ID number; then, the Flink acquires the ID attribute of the name ID number as a user ID, and acquires the ID attribute of the order ID number as a content ID; then, based on the user ID, the Flink carries out window aggregation on the real-time data stream according to a 1-minute time window to obtain a first processing window corresponding to the user ID; and meanwhile, based on the content ID, the Flink carries out window aggregation on the real-time data stream according to a 1-minute time window to obtain a second processing window corresponding to the content ID.
Acquiring a first real-time dimension table corresponding to the user ID and a second real-time dimension table corresponding to the content ID; the first real-time maintenance table stores the corresponding relation between the name ID number and the identity card, and the second real-time maintenance table stores the corresponding relation between the order ID number and the purchasing list. The Flink reads the data of the first real-time dimension table, and writes the read data into a user Redis cache after data processing is carried out on the read data; and meanwhile, the Flink reads the data of the second real-time dimension table, and writes the read data into a content Redis cache after data processing. Meanwhile, the Flink monitors the storage time of the update data in the user Redis cache and/or the content Redis cache respectively, and if the monitoring result indicates that the storage time reaches the preset time, the update data of the user Redis cache and/or the content Redis cache is cleared.
The HBase Proxy monitors the user Redis cache and the content Redis cache respectively; if the monitoring result indicates that the user Redis cache has different data of the HBase main library, updating the HBase main library based on the updating data of the user Redis cache, and synchronously updating the HBase standby library; if the monitoring result indicates that the content Redis cache has different data of the HBase main library, updating the HBase main library based on the updating data of the content Redis cache, and synchronously updating the HBase standby library.
For any name ID number in the first processing window: judging whether the name ID number exists in a user Redis cache; if so, reading an identity card number corresponding to the name ID number from a user Redis cache; if not, reading an identity card number corresponding to the name ID number from the HBase stock; for any order ID number in the second processing window: judging whether an order ID number exists in the content Redis cache; if yes, reading a purchase list corresponding to the order ID number from a content Redis cache; if not, the purchase list corresponding to the order ID number is read from the HBase stock library. And finally, outputting a plurality of identification card numbers and/or purchase lists through a Sink, and writing the identification card numbers and/or purchase lists into a distributed database.
In this embodiment, peak clipping and valley filling operations are performed in real-time calculation, for example: based on the operation of reading data from the cache database or reading data from the HBase backup library by the real-time data stream, and the operation of writing data to the cache database by the Flink; the two operation times are staggered. Therefore, the first equipment can be prevented from generating avalanche, and the timeliness of the real-time data flow association dimension table is improved.
Fig. 5 is a schematic structural diagram of an association device for real-time dimension tables according to an embodiment of the present invention.
An association device for a real-time dimension table is applied to first equipment; the apparatus 500 includes: a reading module 501, configured to read a real-time data stream of a target service; a processing module 502, configured to process the real-time data stream to obtain a plurality of processing windows; wherein each processing window has a corresponding ID attribute; the processing window comprises a plurality of ID numbers with the same ID attribute; an association module 503, configured to, for any ID number in the processing window: acquiring a real-time dimension table corresponding to the ID attribute of the ID number; determining a cache database and an HBase backup database corresponding to the real-time dimension table; judging whether the ID number exists in the cache database; if so, reading an association value corresponding to the ID number from the cache database; and if the association value does not exist, reading the association value corresponding to the ID number from the HBase stock library.
In a preferred embodiment, the processing module comprises: a row special column unit, configured to, for any one of the real-time data in the real-time data stream: performing row-column conversion on the real-time data to obtain a plurality of ID numbers; each ID number has a corresponding ID attribute; and the window aggregation unit is used for aggregating all IDs in the real-time data stream according to the ID attribute based on a preset time window to generate a plurality of processing windows.
In a preferred embodiment, the association module comprises: the filtering unit is used for reading data from the real-time dimension table and filtering the read data based on preset conditions to obtain target data; the first generation unit is used for writing the target data into a locally corresponding cache area and generating a first trigger instruction; the second generation unit is used for generating a cache database corresponding to the real-time dimension table based on the first trigger instruction; and the obtaining unit is used for updating the HBase stock library based on the cache database to obtain the HBase stock library corresponding to the real-time dimension table.
In a preferred embodiment, the obtaining unit comprises: the monitoring subunit is used for monitoring the cache database based on the HBase main database; the writing subunit is used for indicating that data different from the HBase main library exists in the cache database if the monitoring result indicates that the data is different from the HBase main library; writing the updated data in the cache database into the HBase main library; and the updating subunit is used for synchronously updating the HBase backup library based on the data updating result of the HBase main library to obtain the HBase backup library corresponding to the real-time dimension table.
In a preferred embodiment, the device further comprises: the monitoring module is used for monitoring the update data in the cache database based on the first trigger instruction; and the clearing module is used for clearing the updated data from the cache database if the monitoring result represents that the storage time of the updated data in the cache database is longer than the preset time.
In a preferred embodiment, the device further comprises: the generating module is used for generating a second trigger instruction based on the clearing operation of the update data in the cache database; and the updating module is used for reading data from the current real-time dimension table based on the second trigger instruction and updating the data of the cache database based on a reading result.
In a preferred embodiment, the cache database includes update data stored in a preset time and a common data table; when the common data table includes update data, the common data table is updated based on the update of the corresponding update data.
In a preferred embodiment, the device further comprises: and the writing module is used for writing the association value into the distributed database.
The device can execute the association method for the real-time dimension table provided by the embodiment of the invention, and has the corresponding functional modules and beneficial effects of executing the association method for the real-time dimension table. Technical details not described in detail in this embodiment may be referred to the association method for real-time dimension tables provided in an embodiment of the present invention.
The present invention also provides an electronic device including: a processor; a memory for storing the processor-executable instructions; the processor is configured to read the executable instruction from the memory and execute the instruction to implement the association method for real-time dimension tables according to the present invention.
In addition to the methods and apparatus described above, embodiments of the present application may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform steps in a method according to various embodiments of the present application described in the "exemplary methods" section of the present specification.
The computer program product may write program code for performing the operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server.
Furthermore, embodiments of the present application may also be a computer-readable storage medium, having stored thereon computer program instructions, which when executed by a processor, cause the processor to perform steps in a method according to embodiments of the present application described in the above-mentioned "exemplary method" section of the present application.
The computer readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The basic principles of the present application have been described above in connection with specific embodiments, however, it should be noted that the advantages, benefits, effects, etc. mentioned in the present application are merely examples and not limiting, and these advantages, benefits, effects, etc. are not to be considered as necessarily possessed by the various embodiments of the present application. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, as the application is not intended to be limited to the details disclosed herein as such.
The block diagrams of the devices, apparatuses, devices, systems referred to in this application are only illustrative examples and are not intended to require or imply that the connections, arrangements, configurations must be made in the manner shown in the block diagrams. As will be appreciated by one of skill in the art, the devices, apparatuses, devices, systems may be connected, arranged, configured in any manner. Words such as "including," "comprising," "having," and the like are words of openness and mean "including but not limited to," and are used interchangeably therewith. The terms "or" and "as used herein refer to and are used interchangeably with the term" and/or "unless the context clearly indicates otherwise. The term "such as" as used herein refers to, and is used interchangeably with, the phrase "such as, but not limited to.
It is also noted that in the apparatus, devices and methods of the present application, the components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered as equivalent to the present application.
The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit the embodiments of the application to the form disclosed herein. Although a number of example aspects and embodiments have been discussed above, a person of ordinary skill in the art will recognize certain variations, modifications, alterations, additions, and subcombinations thereof.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (10)
1. An association method for a real-time dimension table is characterized by being applied to first equipment; the method comprises the following steps:
reading a real-time data stream of a target service;
processing the real-time data stream to obtain a plurality of processing windows; wherein each processing window has a corresponding ID attribute; the processing window comprises a plurality of ID numbers with the same ID attribute;
for any ID number in the processing window: acquiring a real-time dimension table corresponding to the ID attribute of the ID number; determining a cache database and an HBase backup database corresponding to the real-time dimension table; judging whether the ID number exists in the cache database; if so, reading an association value corresponding to the ID number from the cache database; and if the association value does not exist, reading the association value corresponding to the ID number from the HBase stock library.
2. The method according to claim 1, wherein said processing of said real-time data stream results in a number of processing windows; comprising the following steps:
for any one of the real-time data streams: performing row-column conversion on the real-time data to obtain a plurality of ID numbers; each ID number has a corresponding ID attribute;
and based on a preset time window, carrying out window aggregation on all IDs in the real-time data stream according to the ID attribute to generate a plurality of processing windows.
3. The method of claim 1, wherein the determining the cache database and the HBase preparation library corresponding to the real-time dimension table comprises:
reading data from the real-time dimension table, and filtering the read data based on preset conditions to obtain target data;
writing the target data into a locally corresponding cache area to generate a first trigger instruction;
generating a cache database corresponding to the real-time dimension table based on the first trigger instruction;
and updating the HBase backup library based on the cache database to obtain the HBase backup library corresponding to the real-time dimension table.
4. The method of claim 3, wherein updating the HBase preparation library based on the cache database to obtain the HBase preparation library corresponding to the real-time dimension table comprises:
based on an HBase main library, monitoring the cache database;
if the monitoring result represents that the cache database has data different from the HBase main database; writing the updated data in the cache database into the HBase main library;
and synchronously updating the HBase backup library based on the data updating result of the HBase main library to obtain the HBase backup library corresponding to the real-time dimension table.
5. A method according to claim 3, further comprising:
based on the first trigger instruction, monitoring updated data in the cache database;
and if the monitoring result indicates that the storage time of the updated data in the cache database is longer than the preset time, clearing the updated data from the cache database.
6. The method as recited in claim 5, further comprising:
generating a second trigger instruction based on the clearing operation of the updated data in the cache database;
and reading data from the current real-time dimension table based on the second trigger instruction, and updating the data of the cache database based on a reading result.
7. The method of claim 1, wherein the cache database comprises update data stored for a preset time and a common data table;
when the common data table includes update data, the common data table is updated based on the update of the corresponding update data.
8. The method as recited in claim 1, further comprising:
and writing the association value into a distributed database.
9. An association device for a real-time dimension table is characterized by being applied to first equipment; the device comprises:
the reading module is used for reading the real-time data stream of the target service;
the processing module is used for processing the real-time data stream to obtain a plurality of processing windows; wherein each processing window has a corresponding ID attribute; the processing window comprises a plurality of ID numbers with the same ID attribute;
the association module is used for aiming at any ID number in the processing window: acquiring a real-time dimension table corresponding to the ID attribute of the ID number; determining a cache database and an HBase backup database corresponding to the real-time dimension table; judging whether the ID number exists in the cache database; if so, reading an association value corresponding to the ID number from the cache database; and if the association value does not exist, reading the association value corresponding to the ID number from the HBase stock library.
10. A computer readable medium, characterized in that a computer program is stored thereon, which program, when being executed by a processor, implements the method according to any of claims 1-8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310410695.XA CN116126872B (en) | 2023-04-18 | 2023-04-18 | Correlation method, device and computer readable medium for real-time dimension table |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310410695.XA CN116126872B (en) | 2023-04-18 | 2023-04-18 | Correlation method, device and computer readable medium for real-time dimension table |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116126872A CN116126872A (en) | 2023-05-16 |
CN116126872B true CN116126872B (en) | 2023-06-23 |
Family
ID=86304922
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310410695.XA Active CN116126872B (en) | 2023-04-18 | 2023-04-18 | Correlation method, device and computer readable medium for real-time dimension table |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116126872B (en) |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2009004231A2 (en) * | 2007-06-13 | 2009-01-08 | Compario | Information sorting method |
CN112306700A (en) * | 2019-07-23 | 2021-02-02 | 北京京东尚科信息技术有限公司 | Abnormal RPC request diagnosis method and device |
CN112765166A (en) * | 2021-01-06 | 2021-05-07 | 深圳市欢太科技有限公司 | Data processing method, device and computer readable storage medium |
CN113609374A (en) * | 2021-02-05 | 2021-11-05 | 腾讯科技(深圳)有限公司 | Data processing method, device and equipment based on content push and storage medium |
CN113761018A (en) * | 2021-02-24 | 2021-12-07 | 北京沃东天骏信息技术有限公司 | Data processing method, device, equipment and storage medium |
CN113760988A (en) * | 2021-02-04 | 2021-12-07 | 北京沃东天骏信息技术有限公司 | Method, device, equipment and storage medium for associating and processing unbounded stream data |
CN114510486A (en) * | 2022-02-22 | 2022-05-17 | 北京达佳互联信息技术有限公司 | Dimension table data processing method and device, electronic equipment and storage medium |
CN115048372A (en) * | 2022-04-12 | 2022-09-13 | 北京贝壳时代网络科技有限公司 | Multi-stream data association method and association device |
CN115185967A (en) * | 2022-07-06 | 2022-10-14 | 北京字跳网络技术有限公司 | Data processing method and device, electronic equipment and storage medium |
CN115328958A (en) * | 2022-08-25 | 2022-11-11 | 中国电信股份有限公司 | Data association method and device, computer storage medium and electronic equipment |
CN115587118A (en) * | 2022-09-26 | 2023-01-10 | 新华三技术有限公司 | Task data dimension table association processing method and device and electronic equipment |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9558225B2 (en) * | 2013-12-16 | 2017-01-31 | Sybase, Inc. | Event stream processor |
-
2023
- 2023-04-18 CN CN202310410695.XA patent/CN116126872B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2009004231A2 (en) * | 2007-06-13 | 2009-01-08 | Compario | Information sorting method |
CN112306700A (en) * | 2019-07-23 | 2021-02-02 | 北京京东尚科信息技术有限公司 | Abnormal RPC request diagnosis method and device |
CN112765166A (en) * | 2021-01-06 | 2021-05-07 | 深圳市欢太科技有限公司 | Data processing method, device and computer readable storage medium |
CN113760988A (en) * | 2021-02-04 | 2021-12-07 | 北京沃东天骏信息技术有限公司 | Method, device, equipment and storage medium for associating and processing unbounded stream data |
CN113609374A (en) * | 2021-02-05 | 2021-11-05 | 腾讯科技(深圳)有限公司 | Data processing method, device and equipment based on content push and storage medium |
CN113761018A (en) * | 2021-02-24 | 2021-12-07 | 北京沃东天骏信息技术有限公司 | Data processing method, device, equipment and storage medium |
CN114510486A (en) * | 2022-02-22 | 2022-05-17 | 北京达佳互联信息技术有限公司 | Dimension table data processing method and device, electronic equipment and storage medium |
CN115048372A (en) * | 2022-04-12 | 2022-09-13 | 北京贝壳时代网络科技有限公司 | Multi-stream data association method and association device |
CN115185967A (en) * | 2022-07-06 | 2022-10-14 | 北京字跳网络技术有限公司 | Data processing method and device, electronic equipment and storage medium |
CN115328958A (en) * | 2022-08-25 | 2022-11-11 | 中国电信股份有限公司 | Data association method and device, computer storage medium and electronic equipment |
CN115587118A (en) * | 2022-09-26 | 2023-01-10 | 新华三技术有限公司 | Task data dimension table association processing method and device and electronic equipment |
Non-Patent Citations (2)
Title |
---|
云海大数据一体机体系结构和关键技术;张东;亓开元;吴楠;辛国茂;刘正伟;颜秉珩;郭锋;;计算机研究与发展(第02期);第148-163页 * |
大数据文件存储策略探索;屈美娟;付良廷;;科技创新与应用(第12期);第146-147页 * |
Also Published As
Publication number | Publication date |
---|---|
CN116126872A (en) | 2023-05-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105827706B (en) | Message pushing device and method | |
CN107526631B (en) | Task monitoring method, device, equipment and medium | |
US9009850B2 (en) | Database management by analyzing usage of database fields | |
CN103019879A (en) | Method and system for processing crash information of browser | |
WO2019041707A1 (en) | Method and system for exporting mass service data | |
CN103701906A (en) | Distributed real-time calculation system and data processing method thereof | |
US11372904B2 (en) | Automatic feature extraction from unstructured log data utilizing term frequency scores | |
CN110599229A (en) | Hundred million-level flow advertisement real-time processing method, storage medium, electronic equipment and system | |
CN112347355B (en) | Data processing method, device, server and storage medium | |
CN108337100B (en) | Cloud platform monitoring method and device | |
CN115934414A (en) | Data backup method, data recovery method, device, equipment and storage medium | |
CN116126872B (en) | Correlation method, device and computer readable medium for real-time dimension table | |
CN110677271A (en) | Big data alarm method, device, equipment and storage medium based on ELK | |
EP2904520A1 (en) | Reference data segmentation from single to multiple tables | |
CN104317820B (en) | Statistical method and device for report forms | |
CN111047427A (en) | Data reporting method, device, server and storage medium | |
CN110597830B (en) | Real-time index generation method and system, electronic equipment and storage medium | |
CN105245394A (en) | Method and equipment for analyzing network access log based on layered approach | |
US10360234B2 (en) | Recursive extractor framework for forensics and electronic discovery | |
CN114238823A (en) | Method and device for accessing website, computer equipment and storage medium | |
CN113760568A (en) | Data processing method and device | |
CN111813779A (en) | Data query method, system, device and medium based on data interface configuration | |
CN111858480A (en) | Data processing method and device and computer storage medium | |
CN112131027B (en) | Distributed application cluster and data desensitization method | |
US20220245107A1 (en) | Method, electronic device, and computer program product for processing data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |