CN113901007A - Distributed caching method for massive small files for AI training - Google Patents

Distributed caching method for massive small files for AI training Download PDF

Info

Publication number
CN113901007A
CN113901007A CN202111220222.0A CN202111220222A CN113901007A CN 113901007 A CN113901007 A CN 113901007A CN 202111220222 A CN202111220222 A CN 202111220222A CN 113901007 A CN113901007 A CN 113901007A
Authority
CN
China
Prior art keywords
chunk
training
cache
reading
small file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111220222.0A
Other languages
Chinese (zh)
Inventor
路锦
曾艳
赵乃良
张纪林
袁俊峰
万健
张雪容
沈鸿辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202111220222.0A priority Critical patent/CN113901007A/en
Publication of CN113901007A publication Critical patent/CN113901007A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0862Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with prefetch
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0866Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches for peripheral storage systems, e.g. disk cache
    • G06F12/0871Allocation or management of cache space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/10Providing a specific technical effect
    • G06F2212/1016Performance improvement
    • G06F2212/1021Hit rate improvement
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/10Providing a specific technical effect
    • G06F2212/1041Resource optimization
    • G06F2212/1044Space efficiency improvement
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/15Use in a specific computing environment
    • G06F2212/154Networked environment

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an AI training-oriented mass small file distributed caching method, which is used for realizing high-performance distributed caching of mass small files. Firstly, combining small files into chunk according to rules of the Batch Size characteristics in the fitting AI training; secondly, analyzing the cache state of the chunk, and performing double-layer shuffle operation on the small file sequence; and finally, when AI training reads data, adopting Local Cache short circuit reading aiming at repeated I/O, and starting asynchronous packet pre-reading at the moment of Local Cache short circuit reading. The invention solves the problem of random reading of a large number of small files for AI training by efficiently utilizing the cache, obviously improves the data access rate and the cache hit rate in the AI training-oriented scene, and reduces the iteration time of AI training.

Description

Distributed caching method for massive small files for AI training
Technical Field
The invention provides an AI training-oriented massive small file distributed caching method for smart cities, and aims to realize high-performance distributed caching of massive small files for scenes such as face recognition, video query, intelligent storage and the like.
Background
In recent years, with the rapid development of social economy and scientific technology in the global scope, the AI technology is applied to the security field in a large scale, and the development of safe smart cities is powerfully assisted. Meanwhile, scenes such as face recognition, pedestrian cross-border tracking and the like in a safe smart city pose challenges to the AI technology. Typically the number of files required for an AI training task is on the order of millions or even tens of millions, for example the Google OpenImages dataset contains 900 ten thousand Images and the Tencent ML-Images dataset contains approximately 1,769 ten thousand Images. Massive small files of billions of scales are generally required to be cached, and then large-scale AI network models such as ResNet50 and ResNet101 are adopted for distributed training. Therefore, large-scale AI training for massive small files is an important trend of AI development, and a high-performance distributed caching scheme for massive small files for AI training is an important problem to be solved at present.
UC Berkley proposed in 2018 an open source distributed cache system Alluxio. The system supports large file block storage during data storage, adopts layered cache of MEM, SSD and HDD according to access frequency during data access, and independently adopts a cache replacement strategy at each layer. However, the system defaults to the traditional LRU replacement strategy on the cache strategy, and due to the AI random access characteristic, the accessed data does not have hot spot data, so the traditional cache replacement strategy cannot improve the cache hit rate and is not suitable for the random access scene during AI training.
Business soup proposes a master-slave architecture system DIESEL. The system provides a scheme for storing massive small files by combining data blocks, and provides a shuffle in a group to replace AI training random access, but the system still has a cache miss phenomenon when an AI task accesses the data blocks for the first time, so that the system needs to wait for longer disk I/O time, and the data access rate is influenced.
Disclosure of Invention
The aim of the invention is to overcome the problems existing in AI-oriented training: the problems of low random access speed of data and low cache hit rate are solved, and an AI training-oriented massive small file distributed caching method is provided.
The invention comprises the following steps:
step 1, a Local Cache stored by key-value key values is created at a client, and an Alluxio Cache is created in distributed storage equipment.
And 2, in the data storage stage of AI training, carrying out chunk merging operation on the data set based on the fitting Batch Size characteristic.
And step 3: before AI training iteration, analyzing the cache state of chunk, and generating a traversal sequence of the iteration by adopting a double-layer shuffle operation of a chunk shuffle + group inner small file shuffle for a small file sequence in the chunk.
And 4, step 4: and during AI training iteration, performing short-circuit reading on repeated I/O by adopting a Local Cache, and simultaneously performing pre-reading on subsequent chunk to an Alluxio Cache by adopting an asynchronous packet pre-reading method.
The invention has the beneficial effects that:
1. according to the method, in the data storage stage, files are merged into chunks according to rules fitting the characteristics of the Batch Size, namely, small files are merged by adopting a rule of 2 powers of N, and the merged chunk structure is more suitable for AI data reading.
2. Analyzing the cache state of the chunk, adopting a double-layer chunk of the small file in the chunk and group to solve the problem of random access, preferentially utilizing the chunk data in the cache in each iteration, and efficiently utilizing the cache.
3. During AI training iteration, Local Cache short circuit reading is adopted for repeated I/O, and meanwhile, a subsequent chunk is pre-read to an Alluxio Cache by adopting an asynchronous packet pre-reading method, so that the Cache hit rate is increased, and the data access rate is improved.
Drawings
FIG. 1 is a flowchart of small file merging;
FIG. 2 is a flow chart of a double-layer shuffle;
fig. 3 is an AI training iteration flow diagram.
Detailed Description
Embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.
The invention comprises the following steps:
step 1, creating a Local Cache and an Alluxio Cache.
The method comprises the steps of creating a Local Cache stored by key-value key values at a client, and creating an Alluxio Cache in distributed storage equipment.
Alluxio caching: the Alluxio cache supports block unit storage, the merged chunk data is mainly stored, and when an application program accesses a chunk which is not in the Alluxio cache, the chunk is taken out from the bottom-layer storage and stored in the Alluxio cache for subsequent access.
Local Cache: the Local Cache is stored by key-value key values, is positioned at the client, and is mainly used for storing all small files analyzed by chunk, and when one chunk is taken out from the Alluxio Cache, the small files in the chunk can be analyzed and stored in the Local Cache.
And 2, in the data storage stage of AI training, carrying out chunk merging operation on the data set based on the fitting Batch Size characteristic.
AI training needs to prepare and store a data set, and the invention merges the small files into chunk in the data storage stage to ensure that the cache system can better manage and store the small files, and the specific steps are as follows.
As shown in fig. 1, first, all the small file sequences are obtained, and an initial shuffle is performed on the small file sequences, so as to ensure that the small files in each chunk are uniformly distributed.
The small files are then merged into chunk according to the rules that fit the catch Size feature. When traversing data for AI training, the unit of batch is often usedWith data reads, the batch size is typically set to 16, 32, 64, 128, etc., since GPUs can perform better for power-2 batches. Based on the characteristics, the small files are merged into chunk according to the merging rule of the power N of 2, and N can be defined by a user. Assuming that the Batch Size is set to the power M of 2, the following three cases are prevalent in data access. (1) If M ═ N, then accessing a batch of data consists of a small number of files for a chunk. (2) If M is>N, then the data accessed for a batch consists of a small number of k chunks, k being 2M-N. (3) If M is<N, then the data accessing k batches consists of a small number of files of one chunk, k 2N-M. Therefore, the general situation of the merging rule can be well matched with the data access of AI training.
And finally, storing the mapping file from the small file to the chunk at the client, and converting the mapping information into an I/O request for the chunk when the application program initiates the I/O request for a certain small file.
And step 3: before AI training iteration, analyzing the cache state of chunk, and generating a traversal sequence of the iteration by adopting a double-layer shuffle operation of a chunk shuffle + group inner small file shuffle for a small file sequence in the chunk.
After the small files are combined into chunks, each chunk contains a fixed small file sequence, and before an AI training iteration, the data is generally required to be clipped to satisfy the randomness of the traversal data. In order to better fit random access operation of AI training, the invention designs a reliable shuffle strategy.
First, based on the buffer status of chunk, chunk is divided into two sequences, which respectively correspond to chunk sequences located in the buffer (sequence 1 in fig. 2) and chunk sequences not located in the buffer (sequence 2 in fig. 2). Sequence 1 and sequence 2 were then double-layered shuffle.
First layer shuffle: the first layer shuffle is chunk shuffle. The chunk in each sequence is shuffled in order. Taking sequence 2 in fig. 2 as an example, after sequence 2 passes through the first layer shuffle, the order of the small files in the chunk is not changed, and the order of the chunk in the sequence is changed.
Second layer shuffle: the second layer shuffle is a small file shuffle in the group. And (3) grouping the chunks in the two sequences respectively, dividing a plurality of chunks into a group, acquiring all the small file sequences in each group, and performing shuffle operation in the group on the small files. Taking sequence 2 in fig. 2 as an example, the chunk is divided into groups 2 and 3 by grouping, and the small files in groups 2 and 3 are completely fragmented within the groups, but the groups are still arranged in sequence.
And finally, splicing the two small file sequences back and forth to form a file access sequence for AI training in the iteration, wherein the access sequence is reasonably disturbed, the existing data in the cache is preferentially placed in the front of the access sequence, and the existing data in the cache is efficiently utilized.
And 4, step 4: during AI training iteration, reading repeated I/O by using Local Cache short circuit, and simultaneously, pre-reading subsequent chunk to an Alluxio Cache by using an asynchronous packet pre-reading method
When the traversing sequence of AI training is prepared, data traversal can be started, and in the invention, during AI training iteration, Local Cache short circuit reading is adopted for repeated I/O, and meanwhile, an asynchronous packet pre-reading method is adopted to pre-read subsequent chunk to an Alluxio Cache, thereby improving the Cache hit rate and the data access rate.
Step 1: and when a certain small file of one group is accessed, judging whether the small file exists in a Local Cache or not, if not, executing the step 2, and if so, executing the step 3.
Step 2: the I/O request for the doclet is converted to an I/O request for chunk. The I/O request will access chunk from the underlying storage system and store in the Alluxio cache. And then parsing the chunk and taking out the target small file from the chunk, and simultaneously storing all the small files in the current chunk into a Local Cache. Then jump is made to step 5.
And step 3: if the target small file exists in the Local Cache, it is indicated that the chunk corresponding to the small file has been accessed, at this time, it is determined to be a repeated I/O, the client intercepts the I/O request for the small file initiated this time, and directly takes out the target small file in the Local Cache by adopting a Local Cache short-circuit reading mode, and then executes step 4.
And 4, step 4: in order to avoid exhaustion of the Local Cache, after a small file is accessed from the Local Cache, the current file is deleted from the Local Cache, and then the step 5 is skipped.
And 5: whether all chunks in a group have been accessed at this time is judged, that is, after chunk c in group1 is accessed in fig. 3, all small files in chunk a, chunk b, and chunk c are stored in the Local Cache, and then subsequent accesses are all Local Cache short-circuit reads. Therefore, asynchronous packet pre-reading is adopted, and in the process of subsequent access, an I/O request for chunk in the next group is initiated to the underlying storage system, and the chunk is pre-read to the Alluxio cache. And simultaneously jumps to step 1 to continue accessing the subsequent file.
According to the method, during AI training iteration, repeated I/O is read by adopting a Local Cache short circuit, so that an application program does not need to repeatedly initiate I/O requests to access the same chunk, the number of the I/O requests is reduced, and data access is accelerated. Meanwhile, a subsequent chunk is pre-read to the Alluxio cache by adopting an asynchronous packet pre-reading method, so that the cache hit rate is improved, and the cache response is accelerated.

Claims (6)

1. A distributed caching method for massive small files for AI training is characterized by comprising the following steps:
step 1, creating a Local Cache stored by key-value key values at a client, and creating an Alluxio Cache in distributed storage equipment;
step 2, in the data storage stage of AI training, performing chunk merging operation on the data set based on the fitting Batch Size characteristics;
and step 3: before AI training iteration, analyzing the cache state of chunk, and generating a traversal sequence of the iteration by adopting double-layer fragment operation of a small file fragment in chunk fragment + group;
and 4, step 4: and during AI training iteration, performing short-circuit reading on repeated I/O by adopting a Local Cache, and simultaneously performing pre-reading on subsequent chunk to an Alluxio Cache by adopting an asynchronous packet pre-reading method.
2. The AI training oriented mass small file distributed caching method according to claim 1, further comprising: and in the step 2, performing chunk merging on the data set based on the fit Batch Size characteristic, namely merging the small files into chunks according to an N-power merging rule of 2.
3. The AI training oriented mass small file distributed caching method according to claim 1, further comprising: the step 3 of analyzing the buffer state of the chunk is to divide the chunk into two sequences, namely, the chunk sequence located in the buffer and the chunk sequence not located in the buffer, through the buffer state in the initial process of the chunk operation, and to respectively perform a double-layer chunk operation on the two sequences.
4. The AI training oriented mass small file distributed caching method according to claim 1, further comprising: the step 3 of performing a double-layer shuffle operation of chunk shuffle + group inner small file shuffle on the small file sequence in the chunk specifically includes:
the first layer of shuffle operation is a chunk shuffle operation, that is, all chunks in the sequence are obtained, and the whole shuffle in the sequence is obtained;
the second layer of shuffle operation is a group small file shuffle operation, that is, all the chunks in the sequence are grouped, a small file sequence in each group is obtained respectively, and a small file is subjected to whole shuffle in the group.
5. The AI training oriented mass small file distributed caching method according to claim 1, further comprising: the repeated I/O in the step 4 refers to: if the target file accessed by the current I/O request already exists in the Local Cache, the I/O request is repeated; the short-circuit reading of the Local Cache refers to the following steps: intercepting the I/O request initiated at the client, taking out the target small file from the Local Cache, and then deleting the file in the Local Cache.
6. The AI training oriented mass small file distributed caching method according to claim 1, further comprising: the step 4 of pre-reading the subsequent chunk to the Alluxio cache by using the asynchronous packet pre-reading method specifically includes: judging whether all the chunks in the current group are stored in the Local Cache, if so, performing short-circuit reading on the Local Cache for the subsequent accesses, so that asynchronous packet pre-reading is adopted, initiating an I/O (input/output) request for the chunk in the next group to a bottom layer storage system during the subsequent accesses, and pre-reading the chunk to an Alluxio Cache.
CN202111220222.0A 2021-10-20 2021-10-20 Distributed caching method for massive small files for AI training Pending CN113901007A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111220222.0A CN113901007A (en) 2021-10-20 2021-10-20 Distributed caching method for massive small files for AI training

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111220222.0A CN113901007A (en) 2021-10-20 2021-10-20 Distributed caching method for massive small files for AI training

Publications (1)

Publication Number Publication Date
CN113901007A true CN113901007A (en) 2022-01-07

Family

ID=79192744

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111220222.0A Pending CN113901007A (en) 2021-10-20 2021-10-20 Distributed caching method for massive small files for AI training

Country Status (1)

Country Link
CN (1) CN113901007A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116185308A (en) * 2023-04-25 2023-05-30 山东英信计算机技术有限公司 Data set processing method, device, equipment, medium and model training system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104794182A (en) * 2015-04-10 2015-07-22 中国科学院计算技术研究所 Small file asynchronous pre-reading device and method for parallel network file system
CN110515920A (en) * 2019-08-30 2019-11-29 北京浪潮数据技术有限公司 A kind of mass small documents access method and system based on Hadoop
CN110598467A (en) * 2019-07-31 2019-12-20 北京大学 Memory data block integrity checking method
CN112465046A (en) * 2020-12-03 2021-03-09 苏州浪潮智能科技有限公司 Method, system, equipment and medium for artificial intelligence training of mass small files

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104794182A (en) * 2015-04-10 2015-07-22 中国科学院计算技术研究所 Small file asynchronous pre-reading device and method for parallel network file system
CN110598467A (en) * 2019-07-31 2019-12-20 北京大学 Memory data block integrity checking method
CN110515920A (en) * 2019-08-30 2019-11-29 北京浪潮数据技术有限公司 A kind of mass small documents access method and system based on Hadoop
CN112465046A (en) * 2020-12-03 2021-03-09 苏州浪潮智能科技有限公司 Method, system, equipment and medium for artificial intelligence training of mass small files

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116185308A (en) * 2023-04-25 2023-05-30 山东英信计算机技术有限公司 Data set processing method, device, equipment, medium and model training system
CN116185308B (en) * 2023-04-25 2023-08-04 山东英信计算机技术有限公司 Data set processing method, device, equipment, medium and model training system

Similar Documents

Publication Publication Date Title
US9817865B2 (en) Direct lookup for identifying duplicate data in a data deduplication system
Yue et al. Building an efficient put-intensive key-value store with skip-tree
US11080196B2 (en) Pattern-aware prefetching using parallel log-structured file system
CN113535670B (en) Virtual resource mirror image storage system and implementation method thereof
CN110569245A (en) Fingerprint index prefetching method based on reinforcement learning in data de-duplication system
KR20160060550A (en) Page cache device and method for efficient mapping
KR20220143578A (en) Data access method, data access control device, and data access system
CN106569963A (en) Buffering method and buffering device
US20200133548A1 (en) Thining databases for garbage collection
CN102355502B (en) Remote access method for remotely accessing storage system into desktop operation system
Liu et al. A High Performance Memory Key-Value Database Based on Redis.
CN113901007A (en) Distributed caching method for massive small files for AI training
US20180075116A1 (en) Information processing system, control device, and computer-readable recording medium having processing program recorded therein
CN118535578A (en) Hash and LSM Tree-based hybrid index method and key value storage system
Chen et al. Low‐overhead inline deduplication for persistent memory
Yan et al. Hmfs: efficient support of small files processing over HDFS
US10585802B1 (en) Method and system for caching directories in a storage system
Hamandawana et al. Accelerating ml/dl applications with hierarchical caching on deduplication storage clusters
CN108614879A (en) Small documents processing method and device
CN116910314A (en) Method and device for optimizing range query in key value storage system based on key value separation
CN117093579A (en) Data query and data storage method, device, equipment and storage medium
Wang et al. Sfp: Smart file-aware prefetching for flash based storage systems
US11586353B2 (en) Optimized access to high-speed storage device
CN114461635A (en) MySQL database data storage method and device and electronic equipment
CN107506156B (en) Io optimization method of block device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination