CN113901007A - Distributed caching method for massive small files for AI training - Google Patents
Distributed caching method for massive small files for AI training Download PDFInfo
- Publication number
- CN113901007A CN113901007A CN202111220222.0A CN202111220222A CN113901007A CN 113901007 A CN113901007 A CN 113901007A CN 202111220222 A CN202111220222 A CN 202111220222A CN 113901007 A CN113901007 A CN 113901007A
- Authority
- CN
- China
- Prior art keywords
- chunk
- training
- cache
- reading
- small file
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012549 training Methods 0.000 title claims abstract description 39
- 238000000034 method Methods 0.000 title claims abstract description 23
- 238000013500 data storage Methods 0.000 claims description 6
- 239000012634 fragment Substances 0.000 claims 3
- 230000000977 initiatory effect Effects 0.000 claims 1
- 238000011161 development Methods 0.000 description 3
- 238000013507 mapping Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 235000014347 soups Nutrition 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/172—Caching, prefetching or hoarding of files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0862—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with prefetch
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0866—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches for peripheral storage systems, e.g. disk cache
- G06F12/0871—Allocation or management of cache space
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/13—File access structures, e.g. distributed indices
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/10—Providing a specific technical effect
- G06F2212/1016—Performance improvement
- G06F2212/1021—Hit rate improvement
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/10—Providing a specific technical effect
- G06F2212/1041—Resource optimization
- G06F2212/1044—Space efficiency improvement
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/15—Use in a specific computing environment
- G06F2212/154—Networked environment
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an AI training-oriented mass small file distributed caching method, which is used for realizing high-performance distributed caching of mass small files. Firstly, combining small files into chunk according to rules of the Batch Size characteristics in the fitting AI training; secondly, analyzing the cache state of the chunk, and performing double-layer shuffle operation on the small file sequence; and finally, when AI training reads data, adopting Local Cache short circuit reading aiming at repeated I/O, and starting asynchronous packet pre-reading at the moment of Local Cache short circuit reading. The invention solves the problem of random reading of a large number of small files for AI training by efficiently utilizing the cache, obviously improves the data access rate and the cache hit rate in the AI training-oriented scene, and reduces the iteration time of AI training.
Description
Technical Field
The invention provides an AI training-oriented massive small file distributed caching method for smart cities, and aims to realize high-performance distributed caching of massive small files for scenes such as face recognition, video query, intelligent storage and the like.
Background
In recent years, with the rapid development of social economy and scientific technology in the global scope, the AI technology is applied to the security field in a large scale, and the development of safe smart cities is powerfully assisted. Meanwhile, scenes such as face recognition, pedestrian cross-border tracking and the like in a safe smart city pose challenges to the AI technology. Typically the number of files required for an AI training task is on the order of millions or even tens of millions, for example the Google OpenImages dataset contains 900 ten thousand Images and the Tencent ML-Images dataset contains approximately 1,769 ten thousand Images. Massive small files of billions of scales are generally required to be cached, and then large-scale AI network models such as ResNet50 and ResNet101 are adopted for distributed training. Therefore, large-scale AI training for massive small files is an important trend of AI development, and a high-performance distributed caching scheme for massive small files for AI training is an important problem to be solved at present.
UC Berkley proposed in 2018 an open source distributed cache system Alluxio. The system supports large file block storage during data storage, adopts layered cache of MEM, SSD and HDD according to access frequency during data access, and independently adopts a cache replacement strategy at each layer. However, the system defaults to the traditional LRU replacement strategy on the cache strategy, and due to the AI random access characteristic, the accessed data does not have hot spot data, so the traditional cache replacement strategy cannot improve the cache hit rate and is not suitable for the random access scene during AI training.
Business soup proposes a master-slave architecture system DIESEL. The system provides a scheme for storing massive small files by combining data blocks, and provides a shuffle in a group to replace AI training random access, but the system still has a cache miss phenomenon when an AI task accesses the data blocks for the first time, so that the system needs to wait for longer disk I/O time, and the data access rate is influenced.
Disclosure of Invention
The aim of the invention is to overcome the problems existing in AI-oriented training: the problems of low random access speed of data and low cache hit rate are solved, and an AI training-oriented massive small file distributed caching method is provided.
The invention comprises the following steps:
And 2, in the data storage stage of AI training, carrying out chunk merging operation on the data set based on the fitting Batch Size characteristic.
And step 3: before AI training iteration, analyzing the cache state of chunk, and generating a traversal sequence of the iteration by adopting a double-layer shuffle operation of a chunk shuffle + group inner small file shuffle for a small file sequence in the chunk.
And 4, step 4: and during AI training iteration, performing short-circuit reading on repeated I/O by adopting a Local Cache, and simultaneously performing pre-reading on subsequent chunk to an Alluxio Cache by adopting an asynchronous packet pre-reading method.
The invention has the beneficial effects that:
1. according to the method, in the data storage stage, files are merged into chunks according to rules fitting the characteristics of the Batch Size, namely, small files are merged by adopting a rule of 2 powers of N, and the merged chunk structure is more suitable for AI data reading.
2. Analyzing the cache state of the chunk, adopting a double-layer chunk of the small file in the chunk and group to solve the problem of random access, preferentially utilizing the chunk data in the cache in each iteration, and efficiently utilizing the cache.
3. During AI training iteration, Local Cache short circuit reading is adopted for repeated I/O, and meanwhile, a subsequent chunk is pre-read to an Alluxio Cache by adopting an asynchronous packet pre-reading method, so that the Cache hit rate is increased, and the data access rate is improved.
Drawings
FIG. 1 is a flowchart of small file merging;
FIG. 2 is a flow chart of a double-layer shuffle;
fig. 3 is an AI training iteration flow diagram.
Detailed Description
Embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.
The invention comprises the following steps:
The method comprises the steps of creating a Local Cache stored by key-value key values at a client, and creating an Alluxio Cache in distributed storage equipment.
Alluxio caching: the Alluxio cache supports block unit storage, the merged chunk data is mainly stored, and when an application program accesses a chunk which is not in the Alluxio cache, the chunk is taken out from the bottom-layer storage and stored in the Alluxio cache for subsequent access.
Local Cache: the Local Cache is stored by key-value key values, is positioned at the client, and is mainly used for storing all small files analyzed by chunk, and when one chunk is taken out from the Alluxio Cache, the small files in the chunk can be analyzed and stored in the Local Cache.
And 2, in the data storage stage of AI training, carrying out chunk merging operation on the data set based on the fitting Batch Size characteristic.
AI training needs to prepare and store a data set, and the invention merges the small files into chunk in the data storage stage to ensure that the cache system can better manage and store the small files, and the specific steps are as follows.
As shown in fig. 1, first, all the small file sequences are obtained, and an initial shuffle is performed on the small file sequences, so as to ensure that the small files in each chunk are uniformly distributed.
The small files are then merged into chunk according to the rules that fit the catch Size feature. When traversing data for AI training, the unit of batch is often usedWith data reads, the batch size is typically set to 16, 32, 64, 128, etc., since GPUs can perform better for power-2 batches. Based on the characteristics, the small files are merged into chunk according to the merging rule of the power N of 2, and N can be defined by a user. Assuming that the Batch Size is set to the power M of 2, the following three cases are prevalent in data access. (1) If M ═ N, then accessing a batch of data consists of a small number of files for a chunk. (2) If M is>N, then the data accessed for a batch consists of a small number of k chunks, k being 2M-N. (3) If M is<N, then the data accessing k batches consists of a small number of files of one chunk, k 2N-M. Therefore, the general situation of the merging rule can be well matched with the data access of AI training.
And finally, storing the mapping file from the small file to the chunk at the client, and converting the mapping information into an I/O request for the chunk when the application program initiates the I/O request for a certain small file.
And step 3: before AI training iteration, analyzing the cache state of chunk, and generating a traversal sequence of the iteration by adopting a double-layer shuffle operation of a chunk shuffle + group inner small file shuffle for a small file sequence in the chunk.
After the small files are combined into chunks, each chunk contains a fixed small file sequence, and before an AI training iteration, the data is generally required to be clipped to satisfy the randomness of the traversal data. In order to better fit random access operation of AI training, the invention designs a reliable shuffle strategy.
First, based on the buffer status of chunk, chunk is divided into two sequences, which respectively correspond to chunk sequences located in the buffer (sequence 1 in fig. 2) and chunk sequences not located in the buffer (sequence 2 in fig. 2). Sequence 1 and sequence 2 were then double-layered shuffle.
First layer shuffle: the first layer shuffle is chunk shuffle. The chunk in each sequence is shuffled in order. Taking sequence 2 in fig. 2 as an example, after sequence 2 passes through the first layer shuffle, the order of the small files in the chunk is not changed, and the order of the chunk in the sequence is changed.
Second layer shuffle: the second layer shuffle is a small file shuffle in the group. And (3) grouping the chunks in the two sequences respectively, dividing a plurality of chunks into a group, acquiring all the small file sequences in each group, and performing shuffle operation in the group on the small files. Taking sequence 2 in fig. 2 as an example, the chunk is divided into groups 2 and 3 by grouping, and the small files in groups 2 and 3 are completely fragmented within the groups, but the groups are still arranged in sequence.
And finally, splicing the two small file sequences back and forth to form a file access sequence for AI training in the iteration, wherein the access sequence is reasonably disturbed, the existing data in the cache is preferentially placed in the front of the access sequence, and the existing data in the cache is efficiently utilized.
And 4, step 4: during AI training iteration, reading repeated I/O by using Local Cache short circuit, and simultaneously, pre-reading subsequent chunk to an Alluxio Cache by using an asynchronous packet pre-reading method
When the traversing sequence of AI training is prepared, data traversal can be started, and in the invention, during AI training iteration, Local Cache short circuit reading is adopted for repeated I/O, and meanwhile, an asynchronous packet pre-reading method is adopted to pre-read subsequent chunk to an Alluxio Cache, thereby improving the Cache hit rate and the data access rate.
Step 1: and when a certain small file of one group is accessed, judging whether the small file exists in a Local Cache or not, if not, executing the step 2, and if so, executing the step 3.
Step 2: the I/O request for the doclet is converted to an I/O request for chunk. The I/O request will access chunk from the underlying storage system and store in the Alluxio cache. And then parsing the chunk and taking out the target small file from the chunk, and simultaneously storing all the small files in the current chunk into a Local Cache. Then jump is made to step 5.
And step 3: if the target small file exists in the Local Cache, it is indicated that the chunk corresponding to the small file has been accessed, at this time, it is determined to be a repeated I/O, the client intercepts the I/O request for the small file initiated this time, and directly takes out the target small file in the Local Cache by adopting a Local Cache short-circuit reading mode, and then executes step 4.
And 4, step 4: in order to avoid exhaustion of the Local Cache, after a small file is accessed from the Local Cache, the current file is deleted from the Local Cache, and then the step 5 is skipped.
And 5: whether all chunks in a group have been accessed at this time is judged, that is, after chunk c in group1 is accessed in fig. 3, all small files in chunk a, chunk b, and chunk c are stored in the Local Cache, and then subsequent accesses are all Local Cache short-circuit reads. Therefore, asynchronous packet pre-reading is adopted, and in the process of subsequent access, an I/O request for chunk in the next group is initiated to the underlying storage system, and the chunk is pre-read to the Alluxio cache. And simultaneously jumps to step 1 to continue accessing the subsequent file.
According to the method, during AI training iteration, repeated I/O is read by adopting a Local Cache short circuit, so that an application program does not need to repeatedly initiate I/O requests to access the same chunk, the number of the I/O requests is reduced, and data access is accelerated. Meanwhile, a subsequent chunk is pre-read to the Alluxio cache by adopting an asynchronous packet pre-reading method, so that the cache hit rate is improved, and the cache response is accelerated.
Claims (6)
1. A distributed caching method for massive small files for AI training is characterized by comprising the following steps:
step 1, creating a Local Cache stored by key-value key values at a client, and creating an Alluxio Cache in distributed storage equipment;
step 2, in the data storage stage of AI training, performing chunk merging operation on the data set based on the fitting Batch Size characteristics;
and step 3: before AI training iteration, analyzing the cache state of chunk, and generating a traversal sequence of the iteration by adopting double-layer fragment operation of a small file fragment in chunk fragment + group;
and 4, step 4: and during AI training iteration, performing short-circuit reading on repeated I/O by adopting a Local Cache, and simultaneously performing pre-reading on subsequent chunk to an Alluxio Cache by adopting an asynchronous packet pre-reading method.
2. The AI training oriented mass small file distributed caching method according to claim 1, further comprising: and in the step 2, performing chunk merging on the data set based on the fit Batch Size characteristic, namely merging the small files into chunks according to an N-power merging rule of 2.
3. The AI training oriented mass small file distributed caching method according to claim 1, further comprising: the step 3 of analyzing the buffer state of the chunk is to divide the chunk into two sequences, namely, the chunk sequence located in the buffer and the chunk sequence not located in the buffer, through the buffer state in the initial process of the chunk operation, and to respectively perform a double-layer chunk operation on the two sequences.
4. The AI training oriented mass small file distributed caching method according to claim 1, further comprising: the step 3 of performing a double-layer shuffle operation of chunk shuffle + group inner small file shuffle on the small file sequence in the chunk specifically includes:
the first layer of shuffle operation is a chunk shuffle operation, that is, all chunks in the sequence are obtained, and the whole shuffle in the sequence is obtained;
the second layer of shuffle operation is a group small file shuffle operation, that is, all the chunks in the sequence are grouped, a small file sequence in each group is obtained respectively, and a small file is subjected to whole shuffle in the group.
5. The AI training oriented mass small file distributed caching method according to claim 1, further comprising: the repeated I/O in the step 4 refers to: if the target file accessed by the current I/O request already exists in the Local Cache, the I/O request is repeated; the short-circuit reading of the Local Cache refers to the following steps: intercepting the I/O request initiated at the client, taking out the target small file from the Local Cache, and then deleting the file in the Local Cache.
6. The AI training oriented mass small file distributed caching method according to claim 1, further comprising: the step 4 of pre-reading the subsequent chunk to the Alluxio cache by using the asynchronous packet pre-reading method specifically includes: judging whether all the chunks in the current group are stored in the Local Cache, if so, performing short-circuit reading on the Local Cache for the subsequent accesses, so that asynchronous packet pre-reading is adopted, initiating an I/O (input/output) request for the chunk in the next group to a bottom layer storage system during the subsequent accesses, and pre-reading the chunk to an Alluxio Cache.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111220222.0A CN113901007A (en) | 2021-10-20 | 2021-10-20 | Distributed caching method for massive small files for AI training |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111220222.0A CN113901007A (en) | 2021-10-20 | 2021-10-20 | Distributed caching method for massive small files for AI training |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113901007A true CN113901007A (en) | 2022-01-07 |
Family
ID=79192744
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111220222.0A Pending CN113901007A (en) | 2021-10-20 | 2021-10-20 | Distributed caching method for massive small files for AI training |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113901007A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116185308A (en) * | 2023-04-25 | 2023-05-30 | 山东英信计算机技术有限公司 | Data set processing method, device, equipment, medium and model training system |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104794182A (en) * | 2015-04-10 | 2015-07-22 | 中国科学院计算技术研究所 | Small file asynchronous pre-reading device and method for parallel network file system |
CN110515920A (en) * | 2019-08-30 | 2019-11-29 | 北京浪潮数据技术有限公司 | A kind of mass small documents access method and system based on Hadoop |
CN110598467A (en) * | 2019-07-31 | 2019-12-20 | 北京大学 | Memory data block integrity checking method |
CN112465046A (en) * | 2020-12-03 | 2021-03-09 | 苏州浪潮智能科技有限公司 | Method, system, equipment and medium for artificial intelligence training of mass small files |
-
2021
- 2021-10-20 CN CN202111220222.0A patent/CN113901007A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104794182A (en) * | 2015-04-10 | 2015-07-22 | 中国科学院计算技术研究所 | Small file asynchronous pre-reading device and method for parallel network file system |
CN110598467A (en) * | 2019-07-31 | 2019-12-20 | 北京大学 | Memory data block integrity checking method |
CN110515920A (en) * | 2019-08-30 | 2019-11-29 | 北京浪潮数据技术有限公司 | A kind of mass small documents access method and system based on Hadoop |
CN112465046A (en) * | 2020-12-03 | 2021-03-09 | 苏州浪潮智能科技有限公司 | Method, system, equipment and medium for artificial intelligence training of mass small files |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116185308A (en) * | 2023-04-25 | 2023-05-30 | 山东英信计算机技术有限公司 | Data set processing method, device, equipment, medium and model training system |
CN116185308B (en) * | 2023-04-25 | 2023-08-04 | 山东英信计算机技术有限公司 | Data set processing method, device, equipment, medium and model training system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9817865B2 (en) | Direct lookup for identifying duplicate data in a data deduplication system | |
Yue et al. | Building an efficient put-intensive key-value store with skip-tree | |
US11080196B2 (en) | Pattern-aware prefetching using parallel log-structured file system | |
CN113535670B (en) | Virtual resource mirror image storage system and implementation method thereof | |
CN110569245A (en) | Fingerprint index prefetching method based on reinforcement learning in data de-duplication system | |
KR20160060550A (en) | Page cache device and method for efficient mapping | |
KR20220143578A (en) | Data access method, data access control device, and data access system | |
CN106569963A (en) | Buffering method and buffering device | |
US20200133548A1 (en) | Thining databases for garbage collection | |
CN102355502B (en) | Remote access method for remotely accessing storage system into desktop operation system | |
Liu et al. | A High Performance Memory Key-Value Database Based on Redis. | |
CN113901007A (en) | Distributed caching method for massive small files for AI training | |
US20180075116A1 (en) | Information processing system, control device, and computer-readable recording medium having processing program recorded therein | |
CN118535578A (en) | Hash and LSM Tree-based hybrid index method and key value storage system | |
Chen et al. | Low‐overhead inline deduplication for persistent memory | |
Yan et al. | Hmfs: efficient support of small files processing over HDFS | |
US10585802B1 (en) | Method and system for caching directories in a storage system | |
Hamandawana et al. | Accelerating ml/dl applications with hierarchical caching on deduplication storage clusters | |
CN108614879A (en) | Small documents processing method and device | |
CN116910314A (en) | Method and device for optimizing range query in key value storage system based on key value separation | |
CN117093579A (en) | Data query and data storage method, device, equipment and storage medium | |
Wang et al. | Sfp: Smart file-aware prefetching for flash based storage systems | |
US11586353B2 (en) | Optimized access to high-speed storage device | |
CN114461635A (en) | MySQL database data storage method and device and electronic equipment | |
CN107506156B (en) | Io optimization method of block device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |