CN113901007A

CN113901007A - Distributed caching method for massive small files for AI training

Info

Publication number: CN113901007A
Application number: CN202111220222.0A
Authority: CN
Inventors: 路锦; 曾艳; 赵乃良; 张纪林; 袁俊峰; 万健; 张雪容; 沈鸿辉
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-10-20
Filing date: 2021-10-20
Publication date: 2022-01-07

Abstract

The invention discloses an AI training-oriented mass small file distributed caching method, which is used for realizing high-performance distributed caching of mass small files. Firstly, combining small files into chunk according to rules of the Batch Size characteristics in the fitting AI training; secondly, analyzing the cache state of the chunk, and performing double-layer shuffle operation on the small file sequence; and finally, when AI training reads data, adopting Local Cache short circuit reading aiming at repeated I/O, and starting asynchronous packet pre-reading at the moment of Local Cache short circuit reading. The invention solves the problem of random reading of a large number of small files for AI training by efficiently utilizing the cache, obviously improves the data access rate and the cache hit rate in the AI training-oriented scene, and reduces the iteration time of AI training.

Description

Distributed caching method for massive small files for AI training

Technical Field

The invention provides an AI training-oriented massive small file distributed caching method for smart cities, and aims to realize high-performance distributed caching of massive small files for scenes such as face recognition, video query, intelligent storage and the like.

Background

In recent years, with the rapid development of social economy and scientific technology in the global scope, the AI technology is applied to the security field in a large scale, and the development of safe smart cities is powerfully assisted. Meanwhile, scenes such as face recognition, pedestrian cross-border tracking and the like in a safe smart city pose challenges to the AI technology. Typically the number of files required for an AI training task is on the order of millions or even tens of millions, for example the Google OpenImages dataset contains 900 ten thousand Images and the Tencent ML-Images dataset contains approximately 1,769 ten thousand Images. Massive small files of billions of scales are generally required to be cached, and then large-scale AI network models such as ResNet50 and ResNet101 are adopted for distributed training. Therefore, large-scale AI training for massive small files is an important trend of AI development, and a high-performance distributed caching scheme for massive small files for AI training is an important problem to be solved at present.

UC Berkley proposed in 2018 an open source distributed cache system Alluxio. The system supports large file block storage during data storage, adopts layered cache of MEM, SSD and HDD according to access frequency during data access, and independently adopts a cache replacement strategy at each layer. However, the system defaults to the traditional LRU replacement strategy on the cache strategy, and due to the AI random access characteristic, the accessed data does not have hot spot data, so the traditional cache replacement strategy cannot improve the cache hit rate and is not suitable for the random access scene during AI training.

Business soup proposes a master-slave architecture system DIESEL. The system provides a scheme for storing massive small files by combining data blocks, and provides a shuffle in a group to replace AI training random access, but the system still has a cache miss phenomenon when an AI task accesses the data blocks for the first time, so that the system needs to wait for longer disk I/O time, and the data access rate is influenced.

Disclosure of Invention

The aim of the invention is to overcome the problems existing in AI-oriented training: the problems of low random access speed of data and low cache hit rate are solved, and an AI training-oriented massive small file distributed caching method is provided.

The invention comprises the following steps:

step 1, a Local Cache stored by key-value key values is created at a client, and an Alluxio Cache is created in distributed storage equipment.

And 2, in the data storage stage of AI training, carrying out chunk merging operation on the data set based on the fitting Batch Size characteristic.

And step 3: before AI training iteration, analyzing the cache state of chunk, and generating a traversal sequence of the iteration by adopting a double-layer shuffle operation of a chunk shuffle + group inner small file shuffle for a small file sequence in the chunk.

And 4, step 4: and during AI training iteration, performing short-circuit reading on repeated I/O by adopting a Local Cache, and simultaneously performing pre-reading on subsequent chunk to an Alluxio Cache by adopting an asynchronous packet pre-reading method.

The invention has the beneficial effects that:

1. according to the method, in the data storage stage, files are merged into chunks according to rules fitting the characteristics of the Batch Size, namely, small files are merged by adopting a rule of 2 powers of N, and the merged chunk structure is more suitable for AI data reading.

2. Analyzing the cache state of the chunk, adopting a double-layer chunk of the small file in the chunk and group to solve the problem of random access, preferentially utilizing the chunk data in the cache in each iteration, and efficiently utilizing the cache.

3. During AI training iteration, Local Cache short circuit reading is adopted for repeated I/O, and meanwhile, a subsequent chunk is pre-read to an Alluxio Cache by adopting an asynchronous packet pre-reading method, so that the Cache hit rate is increased, and the data access rate is improved.

Drawings

FIG. 1 is a flowchart of small file merging;

FIG. 2 is a flow chart of a double-layer shuffle;

fig. 3 is an AI training iteration flow diagram.

Detailed Description

Embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.

The invention comprises the following steps:

step 1, creating a Local Cache and an Alluxio Cache.

The method comprises the steps of creating a Local Cache stored by key-value key values at a client, and creating an Alluxio Cache in distributed storage equipment.

Alluxio caching: the Alluxio cache supports block unit storage, the merged chunk data is mainly stored, and when an application program accesses a chunk which is not in the Alluxio cache, the chunk is taken out from the bottom-layer storage and stored in the Alluxio cache for subsequent access.

Local Cache: the Local Cache is stored by key-value key values, is positioned at the client, and is mainly used for storing all small files analyzed by chunk, and when one chunk is taken out from the Alluxio Cache, the small files in the chunk can be analyzed and stored in the Local Cache.

AI training needs to prepare and store a data set, and the invention merges the small files into chunk in the data storage stage to ensure that the cache system can better manage and store the small files, and the specific steps are as follows.

As shown in fig. 1, first, all the small file sequences are obtained, and an initial shuffle is performed on the small file sequences, so as to ensure that the small files in each chunk are uniformly distributed.

The small files are then merged into chunk according to the rules that fit the catch Size feature. When traversing data for AI training, the unit of batch is often usedWith data reads, the batch size is typically set to 16, 32, 64, 128, etc., since GPUs can perform better for power-2 batches. Based on the characteristics, the small files are merged into chunk according to the merging rule of the power N of 2, and N can be defined by a user. Assuming that the Batch Size is set to the power M of 2, the following three cases are prevalent in data access. (1) If M ═ N, then accessing a batch of data consists of a small number of files for a chunk. (2) If M is>N, then the data accessed for a batch consists of a small number of k chunks, k being 2^M-N. (3) If M is<N, then the data accessing k batches consists of a small number of files of one chunk, k 2^N-M. Therefore, the general situation of the merging rule can be well matched with the data access of AI training.

And finally, storing the mapping file from the small file to the chunk at the client, and converting the mapping information into an I/O request for the chunk when the application program initiates the I/O request for a certain small file.

After the small files are combined into chunks, each chunk contains a fixed small file sequence, and before an AI training iteration, the data is generally required to be clipped to satisfy the randomness of the traversal data. In order to better fit random access operation of AI training, the invention designs a reliable shuffle strategy.

First, based on the buffer status of chunk, chunk is divided into two sequences, which respectively correspond to chunk sequences located in the buffer (sequence 1 in fig. 2) and chunk sequences not located in the buffer (sequence 2 in fig. 2). Sequence 1 and sequence 2 were then double-layered shuffle.

First layer shuffle: the first layer shuffle is chunk shuffle. The chunk in each sequence is shuffled in order. Taking sequence 2 in fig. 2 as an example, after sequence 2 passes through the first layer shuffle, the order of the small files in the chunk is not changed, and the order of the chunk in the sequence is changed.

Second layer shuffle: the second layer shuffle is a small file shuffle in the group. And (3) grouping the chunks in the two sequences respectively, dividing a plurality of chunks into a group, acquiring all the small file sequences in each group, and performing shuffle operation in the group on the small files. Taking sequence 2 in fig. 2 as an example, the chunk is divided into

groups

2 and 3 by grouping, and the small files in

groups

2 and 3 are completely fragmented within the groups, but the groups are still arranged in sequence.

And finally, splicing the two small file sequences back and forth to form a file access sequence for AI training in the iteration, wherein the access sequence is reasonably disturbed, the existing data in the cache is preferentially placed in the front of the access sequence, and the existing data in the cache is efficiently utilized.

And 4, step 4: during AI training iteration, reading repeated I/O by using Local Cache short circuit, and simultaneously, pre-reading subsequent chunk to an Alluxio Cache by using an asynchronous packet pre-reading method

When the traversing sequence of AI training is prepared, data traversal can be started, and in the invention, during AI training iteration, Local Cache short circuit reading is adopted for repeated I/O, and meanwhile, an asynchronous packet pre-reading method is adopted to pre-read subsequent chunk to an Alluxio Cache, thereby improving the Cache hit rate and the data access rate.

Step 1: and when a certain small file of one group is accessed, judging whether the small file exists in a Local Cache or not, if not, executing the step 2, and if so, executing the step 3.

Step 2: the I/O request for the doclet is converted to an I/O request for chunk. The I/O request will access chunk from the underlying storage system and store in the Alluxio cache. And then parsing the chunk and taking out the target small file from the chunk, and simultaneously storing all the small files in the current chunk into a Local Cache. Then jump is made to step 5.

And step 3: if the target small file exists in the Local Cache, it is indicated that the chunk corresponding to the small file has been accessed, at this time, it is determined to be a repeated I/O, the client intercepts the I/O request for the small file initiated this time, and directly takes out the target small file in the Local Cache by adopting a Local Cache short-circuit reading mode, and then executes step 4.

And 4, step 4: in order to avoid exhaustion of the Local Cache, after a small file is accessed from the Local Cache, the current file is deleted from the Local Cache, and then the step 5 is skipped.

And 5: whether all chunks in a group have been accessed at this time is judged, that is, after chunk c in group1 is accessed in fig. 3, all small files in chunk a, chunk b, and chunk c are stored in the Local Cache, and then subsequent accesses are all Local Cache short-circuit reads. Therefore, asynchronous packet pre-reading is adopted, and in the process of subsequent access, an I/O request for chunk in the next group is initiated to the underlying storage system, and the chunk is pre-read to the Alluxio cache. And simultaneously jumps to step 1 to continue accessing the subsequent file.

According to the method, during AI training iteration, repeated I/O is read by adopting a Local Cache short circuit, so that an application program does not need to repeatedly initiate I/O requests to access the same chunk, the number of the I/O requests is reduced, and data access is accelerated. Meanwhile, a subsequent chunk is pre-read to the Alluxio cache by adopting an asynchronous packet pre-reading method, so that the cache hit rate is improved, and the cache response is accelerated.

Claims

1. A distributed caching method for massive small files for AI training is characterized by comprising the following steps:

step 1, creating a Local Cache stored by key-value key values at a client, and creating an Alluxio Cache in distributed storage equipment;

step 2, in the data storage stage of AI training, performing chunk merging operation on the data set based on the fitting Batch Size characteristics;

and step 3: before AI training iteration, analyzing the cache state of chunk, and generating a traversal sequence of the iteration by adopting double-layer fragment operation of a small file fragment in chunk fragment + group;

2. The AI training oriented mass small file distributed caching method according to claim 1, further comprising: and in the step 2, performing chunk merging on the data set based on the fit Batch Size characteristic, namely merging the small files into chunks according to an N-power merging rule of 2.

3. The AI training oriented mass small file distributed caching method according to claim 1, further comprising: the step 3 of analyzing the buffer state of the chunk is to divide the chunk into two sequences, namely, the chunk sequence located in the buffer and the chunk sequence not located in the buffer, through the buffer state in the initial process of the chunk operation, and to respectively perform a double-layer chunk operation on the two sequences.

4. The AI training oriented mass small file distributed caching method according to claim 1, further comprising: the step 3 of performing a double-layer shuffle operation of chunk shuffle + group inner small file shuffle on the small file sequence in the chunk specifically includes:

the first layer of shuffle operation is a chunk shuffle operation, that is, all chunks in the sequence are obtained, and the whole shuffle in the sequence is obtained;

the second layer of shuffle operation is a group small file shuffle operation, that is, all the chunks in the sequence are grouped, a small file sequence in each group is obtained respectively, and a small file is subjected to whole shuffle in the group.

5. The AI training oriented mass small file distributed caching method according to claim 1, further comprising: the repeated I/O in the step 4 refers to: if the target file accessed by the current I/O request already exists in the Local Cache, the I/O request is repeated; the short-circuit reading of the Local Cache refers to the following steps: intercepting the I/O request initiated at the client, taking out the target small file from the Local Cache, and then deleting the file in the Local Cache.

6. The AI training oriented mass small file distributed caching method according to claim 1, further comprising: the step 4 of pre-reading the subsequent chunk to the Alluxio cache by using the asynchronous packet pre-reading method specifically includes: judging whether all the chunks in the current group are stored in the Local Cache, if so, performing short-circuit reading on the Local Cache for the subsequent accesses, so that asynchronous packet pre-reading is adopted, initiating an I/O (input/output) request for the chunk in the next group to a bottom layer storage system during the subsequent accesses, and pre-reading the chunk to an Alluxio Cache.