CN112306675A - Data processing method, related device and computer readable storage medium - Google Patents
Data processing method, related device and computer readable storage medium Download PDFInfo
- Publication number
- CN112306675A CN112306675A CN202011084416.8A CN202011084416A CN112306675A CN 112306675 A CN112306675 A CN 112306675A CN 202011084416 A CN202011084416 A CN 202011084416A CN 112306675 A CN112306675 A CN 112306675A
- Authority
- CN
- China
- Prior art keywords
- memory
- memory blocks
- multiplexing
- blocks
- operator
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 27
- 230000015654 memory Effects 0.000 claims abstract description 436
- 230000004044 response Effects 0.000 claims abstract description 36
- 238000012545 processing Methods 0.000 claims abstract description 23
- 238000004590 computer program Methods 0.000 claims 1
- 238000000034 method Methods 0.000 abstract description 45
- 230000008569 process Effects 0.000 abstract description 20
- 238000013135 deep learning Methods 0.000 abstract description 11
- 239000000872 buffer Substances 0.000 description 22
- 230000006870 function Effects 0.000 description 15
- 238000010586 diagram Methods 0.000 description 11
- 238000004891 communication Methods 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 4
- 238000005192 partition Methods 0.000 description 4
- 238000006243 chemical reaction Methods 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000013136 deep learning model Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000000638 solvent extraction Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004040 coloring Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000011002 quantification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5011—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
- G06F9/5016—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a data processing method, related equipment and a computer readable storage medium, belonging to the field of data processing, wherein the data processing method comprises the following steps: the memory multiplexing client cuts the output memory of the operator and establishes the corresponding relation between the cut memory blocks and the memory blocks before cutting; the memory multiplexing client sends a memory multiplexing request to a memory multiplexing server; the memory multiplexing client receives a response message of the memory multiplexing server; the memory multiplexing client sets one or more offsets of an operator output memory according to the corresponding relation between the cut memory block and the memory block before cutting and the relative offsets of the memory blocks in the response message; by dividing the input and output caches of the operators into smaller blocks of memory requirements, the memory holes in the process of memory multiplexing can be filled more easily, and the total memory requirements of the input and output caches of the model operators in the process of deep learning are reduced.
Description
Technical Field
The present invention relates to the field of data processing, and in particular, to a data processing method, a related device, and a computer-readable storage medium capable of reducing edge computing memory usage.
Background
With the rise of artificial intelligence in recent years, a plurality of deep learning frameworks (such as TensorFlow, PyTorc, Caffe and the like) emerge, various deep learning models are emerging continuously, and the application of deep learning in various industries brings great convenience to the life of people. Such as license plate recognition, on-line conversion between various languages, etc.
The deep learning neural network model is shown in fig. 1, wherein the leftmost layer is an input layer, the rightmost layer is an output layer, and the middle can be understood as a computing node, so that a hierarchy concept is formed. The more nodes on a level, the deeper the level, and the more complex the logic of deep learning.
The data which needs to be input by a computing node (or called an operator) is computed, the data is naturally needed to be stored, the data is called as input caches of the operator, the data needs to be output after the operator is computed, the cache for storing the output data is called as an output cache of the operator, the number of the input caches and the output caches has the function definition of the operator, for example, addition is carried out, two input caches are needed when two matrixes are added, and three input caches are needed when three matrixes are input; the size of the buffer is determined by the amount of input data and the type of data.
Because of the large number of computing nodes participating in the computation, the data amount to be processed is large, and the data storage amount required by the whole computation is large, usually in GB, which puts strict requirements on the computing system. In order to adapt to the situation that Google pushes TensorFlow Lite for deep learning on a mobile phone, the size of a model during storage and the memory requirement during calculation are reduced through weight quantification (the weight of 32 bits is represented by 8 bits). There is a need for multiplexing input and output memory for deep learning models to reduce the total memory requirements.
Meaning of memory multiplexing: the memory block is used as the input/output memory of a plurality of operators, and the execution of the operators is sequentially executed (a plurality of parallel sequential executions can exist), so that the memory block can be used as the input/output cache of the operators in a non-overlapping life cycle to form memory multiplexing.
Apache MXNet is a deep learning framework, is designed for improving efficiency and flexibility, provides a simple and effective heuristic method, firstly determines the number of required memory blocks, and then calculates the size of each memory block, as follows.
The core idea is as follows: allowing variables (input and output buffers) to share memory in non-overlapping life cycles, designing a reference counter for each block of memory and coloring (color is agreed to indicate the same block of memory), when the reference counter is equal to 0, indicating that the block of memory can be recycled to the memory block resource pool for use by other operators.
Referring to fig. 2, the process of determining the number of memory blocks required is as follows, where a is the input operator, separate memories are required (and outputs multiplex inputs), and the output buffer of each operator is the input buffer of the following operator, so the whole process of calculating the number of memory blocks only needs to consider the allocation of the output buffers of the operators.
Step 1: operator B needs a block of output buffer for operators C and F to input buffer, the memory pool is empty, so a block of memory (red) needs to be newly allocated, and the reference count is 2 (the reference count needs to be used as the input buffer of operators C and F);
step 2: executing an operator C, wherein the operator C needs a block of output cache, a memory pool is empty, a green memory is allocated, the reference count is 1, and meanwhile, the reference count of the red memory block is reduced by one (the execution of the operator C is finished);
step 3: executing an operator F, wherein the operator F needs a block of output cache, the memory pool is empty at the moment, a blue memory is allocated, the reference count is 1, the reference count of the red memory block is subtracted by 1 after the operator F finishes executing, and the reference count is 0 at the moment, so that the red memory block is put into the memory block pool;
step 4: executing an operator E, needing to output a cache, wherein a red memory is arranged in a cache pool, so that a new memory block is not required to be applied, after the execution of the operator E is finished, the reference count of an output memory (green) block of the operator C is reduced by one, the count is equal to 0, and the operator E is placed in a memory block pool;
step 5: and executing the operator G, wherein the output memory of the operator G can multiplex the input memory (namely the red memory block) according to the characteristic of the operator, a new memory block does not need to be allocated from the memory pool, and after the execution of the operator G is finished, the reference count of the output cache block (blue) of the operator F is reduced by one.
Thus, regardless of the initial input memory block, the inference execution of the entire model requires 3 memory blocks, red, green and blue, respectively.
And then determining the size of each memory block according to the memory block of each color as the maximum value of the output cache of each operator, wherein the sum of all the memory blocks is the total memory required in the execution process of the operators.
From the above, it can be seen that there is a 1M space between green and red, and the life cycles of blue and purple are non-overlapping, which obviously wastes part of the storage space.
Disclosure of Invention
In order to solve the problem of large memory occupancy rate of the existing data in the operation process, the invention provides a method and related equipment capable of reducing memory occupancy in the edge calculation process.
In order to achieve the above object, a first aspect of the present invention provides a data processing method, including:
the memory multiplexing client cuts the output memory of the operator and establishes the corresponding relation between the cut memory blocks and the memory blocks before cutting;
the memory multiplexing client sends a memory multiplexing request to a memory multiplexing server; the memory multiplexing request comprises a memory block list to be multiplexed;
the memory multiplexing client receives a response message of the memory multiplexing server; wherein the response message includes the relative offset of the memory blocks to be arranged;
and the memory multiplexing client sets one or more offsets of the operator output memory according to the corresponding relation between the memory blocks after cutting and the memory blocks before cutting and the relative offsets of the memory blocks in the response message, and sets the offsets and sizes of one or more memory blocks of the suffix operator.
Optionally, the memory block list includes the size of the memory blocks to be arranged after cutting, the identifier of the memory blocks to be arranged, the life starting and ending time of the memory blocks to be arranged, and the position constraint relationship of the memory blocks to be arranged.
Optionally, the memory multiplexing request further includes a message type and a request identifier; the response message further includes a message type and a request identification.
In the data processing method, optionally, the relative offset of the memory blocks to be arranged is determined by the memory multiplexing server placing the memory blocks in the memory block list according to the information in the memory block list.
In a second aspect, the present invention provides a memory multiplexing client, including:
the first processing unit is used for cutting the output memory of the operator and establishing the corresponding relation between the plurality of cut memory blocks and the memory blocks before cutting;
the first sending unit is used for sending a memory multiplexing request to the memory multiplexing server; the memory multiplexing request comprises a memory block list to be multiplexed;
a first receiving unit, configured to receive a response message of the memory multiplexing server; wherein the response message includes the relative offset of the memory blocks to be arranged;
and the second processing unit is used for setting one or more offsets of the memory output by the operator according to the corresponding relation between the memory blocks after cutting and the memory blocks before cutting and the relative offsets of the memory blocks in the response message, and simultaneously setting the offsets and the sizes of one or more memory blocks of the postfix operator.
In the above memory multiplexing client, optionally, the memory block list includes the size of the memory blocks to be arranged after cutting, the identifier of the memory blocks to be arranged, the life starting and ending time of the memory blocks to be arranged, and the position constraint relationship of the memory blocks to be arranged.
In the foregoing memory multiplexing client, optionally, the relative offset of the memory blocks to be arranged is determined by the memory multiplexing server placing the memory blocks in the memory block list according to the information in the memory block list.
In a third aspect, the present invention provides a memory multiplexing server, including:
the second receiving unit is used for receiving the memory multiplexing request sent by the memory multiplexing client; the memory multiplexing request comprises a memory block list to be multiplexed;
the third processing unit is configured to lay the memory blocks in the memory block list according to the information in the memory block list, and ensure that the memory blocks are not overlapped so as to determine the relative offsets of the memory blocks to be laid;
and a second sending unit, configured to send a response message, where the response message includes a relative offset of the memory blocks to be arranged.
In the above memory multiplexing server, optionally, the memory block list includes the size of the memory blocks to be arranged after cutting, the identifier of the memory blocks to be arranged, the life starting and ending time of the memory blocks to be arranged, and the position constraint relationship of the memory blocks to be arranged.
In the above memory multiplexing server, optionally, the memory multiplexing client is configured to cut an output memory of an operator, and establish a correspondence between a plurality of cut memory blocks and a memory block before cutting.
In the above memory multiplexing server, optionally, the memory multiplexing client is further configured to set one or more offsets of the memory output by the operator according to the correspondence between the memory blocks after the cutting and the memory blocks before the cutting and the relative offsets of the memory blocks in the response message, and set the offsets and sizes of the one or more memory blocks of the suffix operator at the same time.
Compared with the prior art, the invention has the beneficial effects that: by dividing the input and output caches of the operators into smaller blocks of memory requirements, the memory holes in the process of memory multiplexing can be filled more easily, and the total memory requirements of the input and output caches of the model operators in the process of deep learning are reduced. Compared with the method before division, the method has the advantages that the number of holes in memory multiplexing is less, the memory multiplexing efficiency is higher, the number of required total memories is less, the data interaction between the cache and the DDR main memory is less, the CPU waiting time is less, and the utilization rate of the CPU is favorably improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a model diagram of a deep neural network;
FIG. 2 is a flow diagram of existing deep neural network data processing;
FIG. 3 is a flow chart of a data processing method in the present invention;
FIG. 4 is a schematic diagram of memory multiplexing in the present invention;
FIG. 5 is a schematic diagram of a position constraint relationship;
FIGS. 6 and 7 are schematic diagrams of the Concat operator data processing;
FIG. 8 is a schematic diagram of a non-overlapping algorithm;
FIG. 9 is a schematic illustration of an address mapping;
FIGS. 10 and 11 are data processing flow diagrams of the edge computing device when in an offline mode;
FIG. 12 is a flow chart of the data processing method of the present invention applied in the APP of the MEP or MEC or in the dedicated hardware;
FIG. 13 is a schematic diagram of the slicing of the output memory of an operator;
FIG. 14 is a flow chart of the slicing of the output memory of an operator;
FIG. 15 is a diagram illustrating the layout of the memory after cutting according to the present invention;
FIG. 16 is a flow chart of a data processing method in the present invention applied to a mobile device;
FIG. 17 is a flow chart of a data processing method in the present invention in an application (cloud) server;
FIG. 18 is a block diagram of a memory multiplexing client in the present invention;
fig. 19 is a structural diagram of the memory multiplexing server in the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 3, the present embodiment provides a data processing method, including the following steps:
step1, a memory multiplexing client (hereinafter referred to as a client) cuts an output memory of an operator, cuts a cuttable memory according to a minimum cuttable memory block, and simultaneously establishes a corresponding relation between a plurality of cut memory blocks and a memory block before cutting; outputting a memory block list to be multiplexed, wherein the memory block list to be multiplexed comprises the size of the memory blocks to be arranged after cutting, the identification of the memory blocks to be arranged, the life starting and ending time of the memory blocks to be arranged and the position constraint relation of the memory blocks to be arranged;
for example, as shown in fig. 4, the operator B fetches 8 bytes of data of the 1K × 1K matrix from the input buffer each time, that is, fetches 1MB of data each time, and can divide the required 4MB of buffer into 4 memory blocks to be arranged, and divide each memory block into 1 MB. Cache B (before dicing) corresponds to 4 memories after dicing: cache B1, cache B2, cache B3, and cache B4.
It should be noted that, as shown in fig. 5, the position constraint relationship refers to a requirement that partial memory blocks have to be connected, and usually occurs when the model operator is optimized, for example, the Concat operator is used to combine 2 or more memory block data into one large memory block data, and if the input is stored continuously, the Concat operator can be optimized at this time to save time when deep learning is performed for inference.
As shown in fig. 6, the Concat operator needs to read and write the data in the cache a and the cache B into a new cache C, and the relative position of the data in the cache C is not changed, so that the cache C may not be applied, and a large amount of operations are saved by not reading and writing the data, thereby improving the efficiency, and only the cache a and the cache B are directly and continuously stored, as shown in fig. 7.
Step2, the client sends a memory multiplexing request to a memory multiplexing server (hereinafter referred to as a server for short), wherein the request comprises a memory block list/message type/request identification to be multiplexed, the memory block list is a necessary option, and the rest are selectable options; the list items include memory block ID, memory block size, life cycle start and end values, location constraints with other memory blocks, etc.
It should be understood that the memory block identification, request identification and message type may be numeric/text or numeric + text.
Step3, after receiving the request, the server determines the offset and the size of the memory block by using a non-overlapping algorithm for the memory block in the memory block list; and then sending a response message to the client, wherein the response message comprises the relative offset (mandatory)/the message type/the message identification of the memory blocks to be allocated.
The non-overlapping algorithm is described as follows, as shown in fig. 8, 1 is the already deployed memory, 2 is deployed at positions of a/B/C/D4, a and C do not overlap in memory space, and BD does not overlap in life cycle.
And 4, after receiving the response message, the client refers to the corresponding relation between the cut memory blocks and the memory blocks before cutting and the offset setting operator of the memory blocks in the response message to output one or more offsets of the memory, and simultaneously sets the offset and the size of one or more memory blocks of the postfix operator.
Examples are as follows: the response message contains offsets of 4 memory blocks { { B1, 0x00000000}, { B2, 0x00100000}, { B3, 0x00200000}, { B4, 0x00300000} }. The client-side retrieval corresponding relation is a cache B, the size of each partitioned memory is 1MB, the cache B corresponds to the first output of the operator A, and the suffix of the first output is the second input of the operator B.
The first setting method comprises the following steps: the first output to operator a and the second input to operator B are the next 4 offsets and magnitudes { {1, 0x00000000,0x 100000}, {2, 0x00100000, 0x100000}, {3, 0x00200000,0x 100000}, {4, 0x00300000,0x 100000} }.
The second setting method comprises the following steps: given that the 4-block memory is contiguous, only one offset and size may be set. The first output to operator a and the second input to operator B are given the next 1 offset and magnitude { {1, 0x00000000,0x 400000}, respectively. If the partitioned memory block is not sequentially and continuously increased, a read-write address mapping table needs to be established, the read-write address is judged, and then a proper offset value is added to the address, as shown in the following.
Data in the response of the server is as follows { {1, 0x00800000, 0x100000}, {2, 0x00700000, 0x100000}, {3, 0x00600000, 0x100000}, {4, 0x00a00000, 0x100000} }.
And (5) the operator reads and writes data. Because the reading and writing of the data are distributed to the plurality of memory blocks, address conversion is needed when the operator reads and writes the data, which is very easy to achieve. For example, the following conversion relationship is established by taking the above 4 partitioned memory blocks as an example, as shown in fig. 9. When the access range is greater than or equal to 0 and less than 0x100000, the corresponding offset value is 0x00800000, when the access range is greater than or equal to 0x200000 and less than 0x300000, the corresponding offset value is 0x00700000, and the address corresponding to the memory access is the storage starting from 0x00a 00000.
Here, superimposing the offset value is an operation with very small overhead, which is negligible compared to matrix-like multiplication and addition operations.
Referring to fig. 10, in practical use, an offline execution model may be generated according to the method of the present invention in advance outside the edge computing device (the relative offsets of the multiple memory blocks of the respective input and output memories are already calculated and stored in the offline model file).
Referring to fig. 11, the size and number of memory blocks that can be partitioned may be set according to the data type/number of times that the operator needs to access and the number of data involved in a basic operation, and the memory multiplexing client obtains guidance for reasonable partitioning from the operator corresponding to the request, then partitions the memory blocks, and then performs calculation of relative offsets using a memory multiplexing algorithm.
According to the data processing method, the input and output caches of the operators are divided into smaller memory requirements, so that memory holes during memory multiplexing can be filled more easily, and the total memory requirements of the input and output caches of the model operators during deep learning are reduced. Compared with the method before division, the method has the advantages that the number of holes in memory multiplexing is less, the memory multiplexing efficiency is higher, the number of required total memories is less, the data interaction between the cache and the DDR main memory is less, the CPU waiting time is less, and the utilization rate of the CPU is favorably improved.
The above is the principle of the data processing method in the present application, and the data processing method in the present application is further described below with reference to specific applications. It should be noted that, the client and the server in the present invention may be any one of the following: the system comprises a cloud server, a virtual machine, a container, a process, a thread, a function and a code block. The method can be implemented on personal mobile equipment, computing cloud and edge computing equipment with a multi-access mode, or can be implemented by generating an offline computing model on one equipment, executing the offline computing model on a second equipment, partitioning an input/output cache of an operator in the model when the offline computing model is generated, and then implementing a memory multiplexing algorithm.
Case 1: the invention is illustrated in a multiple access edge calculation.
The present embodiment uses the present invention in APP of MEP or MEC (multiple Edge computing) or in dedicated hardware.
As shown in fig. 12, the client and the server in the APP can be two threads or two functions in one thread, which are described below.
Referring to fig. 13, when the function a is called, the input memory is cut, and it is assumed that the input memory is cut by 1 MB. In the function a, storage space needs to be allocated to memory blocks in 4 blocks, wherein the memory blocks in 4 blocks are respectively M1(2MB), M2(1MB), M3(1MB) and M4(2MB), wherein M2, M3 and M4 are required to be stored continuously, and the offset of M2 is larger than that of M3.
Referring to FIG. 14, assume also that M1 is the output of operator A, the second input of operator B.
Step 101: function a partitions M1 into M1_1(1MB) and M1_2(1MB), where M1 memory chunks correspond to M1_1 and M1_2, and constructs the request message using the function parameters, where the list of memory chunks is shown in table 1:
TABLE 1
Step 102: function a calls function B. The memory block List may be stored in a Vector or a List in C + +, specific contents refer to table 1, and the selectable parameter is a request identifier omission and a selectable parameter message type (request memory multiplexing, represented by an enumerated value, etc.).
Step 103: and the function B lays the memory blocks according to the memory block list in the reference, ensures that the memory blocks are not overlapped, and returns the laying result by using the MemResVec. One possible placement result for the above-described memory block list is shown on the right side of FIG. 15.
It can be seen that M1_1 and M1_2 are not continuous, and a memory block M2 is placed between the two, which requires 6MB of memory in total; if the M1 memory is not partitioned, the placement result may be as shown on the left side of the figure, requiring 7MB of memory.
The results returned on the right side are { { M4,0x00000000,0x00200000}, { M3,0x00200000,0x00100000}, { M1-1, 0x00200000,0x00100000}, { M2,0x00300000,0x00200000}, { M1-2, 0x00500000,0x00100000} }.
Step 104: and after the function B is called and returned, the function A refers to the corresponding relation between the memories before and after the division, and sets the offsets and the sizes of the plurality of memory blocks of the output cache of the operator A and the second input cache of the operator B. Only the output buffer of operator a and the second input buffer of operator B are depicted here for simplicity.
The output buffer of operator a { { out, 1,0x00200000,0x00100000}, { out, 1,0x00500000,0x00100000} }, and the second input buffer of operator B set { { in, 2,0x00200000,0x00100000}, { in, 2,0x00500000,0x00100000} }. Where in/out represents the operator's input or input and the second number represents the operator's input or output. After the operator configuration is completed, the operator writes data into the specified memory block or reads data from the specified memory block for processing according to the offset of the data and the corresponding offset value during execution.
The specific case two is as follows: the invention is illustrated on a mobile device.
The invention can be used in consumer products such as notebook computers, tablets or mobile phones.
Referring to FIG. 16, the client and server in the consumer product can be two threads or two functions in one thread, which are described below.
Referring to fig. 13, when thread a is called, input memory is cut, and it is assumed that the input memory is cut by 1 MB. There are 4 blocks of memory in thread A that require space allocation, M1(2MB), M2(1MB), M3(1MB), M4(2MB), respectively, where it is required that M2, M3, M4 must be stored consecutively and M2's offset is greater than M3's offset.
Referring to FIG. 14, assume also that M1 is the output of operator A, the second input of operator B.
Step 201: thread a partitions M1 into M1_1(1MB) and M1_2(1MB), where M1 memory chunks correspond to M1_1 and M1_2, and sends memory multiplexing request messages through inter-thread communication, where the list of memory chunks is shown in table 2:
TABLE 2
Step 202: thread a passes the request message to thread B through inter-thread communication. The memory block List may be stored in a Vector or a List in C + +, specific contents refer to table 2, and the selectable parameter is a request identifier omission and a selectable parameter message type (request memory multiplexing, represented by an enumerated value, etc.). The structured data forms byte streams through a serialization method to be sent, and the receiving side carries out deserialization to obtain the data before serialization.
Step 203: and the thread B lays the memory blocks according to the memory block list in the message to ensure that the memory blocks are not overlapped, serializes the laying result by using the MemResVec, and then sends the result to the thread A through inter-thread communication. One possible placement result for the above-described memory block list is shown on the right side of FIG. 15.
It can be seen that M1_1 and M1_2 are not continuous, and a memory block M2 is placed between the two, which requires 6MB of memory in total; if the M1 memory is not partitioned, the placement result may be as shown on the left side of the figure, requiring 7MB of memory.
The results returned on the right side are { { M4,0x00000000,0x00200000}, { M3,0x00200000,0x00100000}, { M1-1, 0x00200000,0x00100000}, { M2,0x00300000,0x00200000}, { M1-2, 0x00500000,0x00100000} }.
Step 204: and after the thread A obtains the thread B response message, referring to the corresponding relation of the memories before and after the division, and setting the offsets and the sizes of the plurality of memory blocks of the output cache of the operator A and the second input cache of the operator B. Only the output buffer of operator a and the second input buffer of operator B are depicted here for simplicity.
The output buffer of operator a { { out, 1,0x00200000,0x00100000}, { out, 1,0x00500000,0x00100000} }, and the second input buffer of operator B set { { in, 2,0x00200000,0x00100000}, { in, 2,0x00500000,0x00100000} }. Where in/out represents the operator's input or input and the second number represents the operator's input or output. After the operator configuration is completed, the operator writes data into the specified memory block or reads data from the specified memory block for processing according to the offset of the data and the corresponding offset value during execution.
The concrete case three: the invention is illustrated on a (cloud) server.
The invention is used on (cloud) servers, including virtual machines, containers, or non-virtualized servers.
Referring to FIG. 17, the client and server can be two threads (processes) or two functions in one thread, illustrated as two processes.
Referring to fig. 13, when process a is called, the input memory is cut, assuming that the input memory is cut by 1 MB. There are 4 blocks of memory in process A that require space allocation, M1(2MB), M2(1MB), M3(1MB), M4(2MB), respectively, where M2, M3, M4 are required to be stored continuously, and M2 offset is greater than M3 offset.
As shown in fig. 14, assume that M1 is the output of operator a and the second input of operator B.
Step 301: process a partitions M1 into M1_1(1MB) and M1_2(1MB), where M1 memory chunks correspond to M1_1 and M1_2, and thread a sends a memory multiplexing request message through inter-thread communication, where the list of memory chunks is shown in table 3:
TABLE 3
Step 302: process a passes the request message to thread B through interprocess communication. The memory block List may be stored in a Vector or a List in C + +, specific contents refer to table 3, and the selectable parameter is a request identifier omission and a selectable parameter message type (request memory multiplexing, represented by an enumerated value, etc.). The structured data forms byte streams through a serialization method to be sent, and the receiving side carries out deserialization to obtain the data before serialization.
Step 303: and the process B lays the memory blocks according to the memory block list in the message to ensure that the memory blocks are not overlapped, serializes the laying result by using the MemResVec, and then sends the result to the process A through inter-process communication. One possible placement result for the above-described memory block list is shown on the right side of FIG. 15.
It can be seen that M1_1 and M1_2 are not continuous, and a memory block M2 is placed between the two, which requires 6MB of memory in total; if the M1 memory is not partitioned, the placement result may be as shown on the left side of the figure, requiring 7MB of memory.
The results returned on the right side are { { M4,0x00000000,0x00200000}, { M3,0x00200000,0x00100000}, { M1-1, 0x00200000,0x00100000}, { M2,0x00300000,0x00200000}, { M1-2, 0x00500000,0x00100000} }.
Step 304: and after obtaining the process B response message, the process A refers to the corresponding relation between the memories before and after the division, and sets the offsets and the sizes of the plurality of memory blocks of the output cache of the operator A and the second input cache of the operator B. Only the output buffer of operator a and the second input buffer of operator B are depicted here for simplicity.
The output buffer of operator a { { out, 1,0x00200000,0x00100000}, { out, 1,0x00500000,0x00100000} }, and the second input buffer of operator B set { { in, 2,0x00200000,0x00100000}, { in, 2,0x00500000,0x00100000} }. Where in/out represents the operator's input or input and the second number represents the operator's input or output. After the operator configuration is completed, the operator writes data into the specified memory block or reads data from the specified memory block for processing according to the offset of the data and the corresponding offset value during execution.
In the data processing method in the embodiment, the input and output caches of the operators are divided into smaller blocks of memory requirements, so that memory holes during memory multiplexing can be filled more easily, and the total memory requirements of the input and output caches of the model operators during deep learning are reduced. Compared with the method before division, the method has the advantages that the number of holes in memory multiplexing is less, the memory multiplexing efficiency is higher, the number of required total memories is less, the data interaction between the cache and the DDR main memory is less, the CPU waiting time is less, and the utilization rate of the CPU is favorably improved.
In some embodiments, the present invention further provides a memory multiplexing client, as shown in fig. 18, where the memory multiplexing client includes:
the first processing unit 101 is configured to cut an output memory of an operator, and establish a correspondence between a plurality of cut memory blocks and a memory block before cutting; the specific processing procedure has already been described in detail in step1 of the data processing method, and is not described herein again.
A first sending unit 102, configured to send a memory multiplexing request to a memory multiplexing server; the memory multiplexing request comprises a memory block list to be multiplexed; the memory block list includes the size of the memory blocks to be arranged after cutting, the identification of the memory blocks to be arranged, the life starting and ending time of the memory blocks to be arranged, and the position constraint relationship of the memory blocks to be arranged. The specific processing procedure has already been described in detail in step2 of the data processing method, and is not described herein again.
A first receiving unit 103, configured to receive a response message of the memory multiplexing server; wherein the response message includes the relative offset of the memory blocks to be arranged; and the relative offset of the memory blocks to be arranged is determined by the memory multiplexing server arranging the memory blocks in the memory block list and ensuring that the memory blocks are not overlapped. The specific processing procedure has already been described in detail in step3 of the data processing method, and is not described herein again.
The second processing unit 104 is configured to set one or more offsets of the operator output memory according to the correspondence between the memory blocks after the cutting and the memory blocks before the cutting and the relative offsets of the memory blocks in the response message, and set the offsets and sizes of one or more memory blocks of the suffix operator at the same time. The specific processing procedure has already been described in detail in step4 of the data processing method, and is not described herein again.
In some other embodiments, the present invention provides a memory multiplexing server, as shown in fig. 19, including:
a second receiving unit 201, configured to receive a memory multiplexing request sent by a memory multiplexing client; the memory multiplexing request comprises a memory block list to be multiplexed; the memory block list includes the size of the memory blocks to be arranged after cutting, the identification of the memory blocks to be arranged, the life starting and ending time of the memory blocks to be arranged, and the position constraint relationship of the memory blocks to be arranged.
A third processing unit 202, configured to lay the memory blocks in the memory block list according to the information in the memory block list, and ensure that the memory blocks are not overlapped, thereby determining a relative offset of the memory blocks to be laid; the specific data processing procedure has already been described in detail in step3 of the data processing method, and is not described herein again.
A second sending unit 203, configured to send a response message, where the response message includes a relative offset of the memory blocks to be arranged.
In addition, the memory multiplexing client is used for cutting the output memory of the operator and establishing the corresponding relation between the plurality of cut memory blocks and the memory blocks before cutting.
In addition, the memory multiplexing client is further configured to set one or more offsets of the operator output memory according to the correspondence between the memory blocks after the cutting and the memory blocks before the cutting and the relative offsets of the memory blocks in the response message, and set the offsets and sizes of the one or more memory blocks of the suffix operator at the same time.
In addition, an embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium may store a program, and when the program is executed, the program includes some or all of the steps of any one of the data processing methods described in the above method embodiments.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a memory and includes several instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash Memory disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.
An exemplary flow chart of a method for implementing a service chain according to an embodiment of the present invention is described above with reference to the accompanying drawings. It should be noted that the numerous details included in the above description are merely exemplary of the invention and are not limiting of the invention. In other embodiments of the invention, the method may have more, fewer, or different steps, and the order, inclusion, function, etc. of the steps may be different from that described and illustrated.
Claims (12)
1. A data processing method, comprising:
the memory multiplexing client cuts the output memory of the operator and establishes the corresponding relation between the cut memory blocks and the memory blocks before cutting;
the memory multiplexing client sends a memory multiplexing request to a memory multiplexing server; the memory multiplexing request comprises a memory block list to be multiplexed;
the memory multiplexing client receives a response message of the memory multiplexing server; wherein the response message includes the relative offset of the memory blocks to be arranged;
and the memory multiplexing client sets one or more offsets of the operator output memory according to the corresponding relation between the memory blocks after cutting and the memory blocks before cutting and the relative offsets of the memory blocks in the response message, and sets the offsets and sizes of one or more memory blocks of the suffix operator.
2. The data processing method of claim 1, wherein: the memory block list includes the size of the memory blocks to be arranged after cutting, the identification of the memory blocks to be arranged, the life starting and ending time of the memory blocks to be arranged, and the position constraint relation of the memory blocks to be arranged.
3. The data processing method of claim 1, wherein: and the relative offset of the memory blocks to be arranged is determined by the memory multiplexing server according to the information in the memory block list to arrange the memory blocks in the memory block list.
4. The data processing method of claim 1, wherein: the memory multiplexing request also comprises a message type and a request identifier; the response message further includes a message type and a request identification.
5. A memory multiplexing client, comprising:
the first processing unit is used for cutting the output memory of the operator and establishing the corresponding relation between the plurality of cut memory blocks and the memory blocks before cutting;
the first sending unit is used for sending a memory multiplexing request to the memory multiplexing server; the memory multiplexing request comprises a memory block list to be multiplexed;
a first receiving unit, configured to receive a response message of the memory multiplexing server; wherein the response message includes the relative offset of the memory blocks to be arranged;
and the second processing unit is used for setting one or more offsets of the memory output by the operator according to the corresponding relation between the memory blocks after cutting and the memory blocks before cutting and the relative offsets of the memory blocks in the response message, and simultaneously setting the offsets and the sizes of one or more memory blocks of the postfix operator.
6. The memory multiplexing client of claim 5, wherein: the memory block list includes the size of the memory blocks to be arranged after cutting, the identification of the memory blocks to be arranged, the life starting and ending time of the memory blocks to be arranged, and the position constraint relation of the memory blocks to be arranged.
7. The memory multiplexing client of claim 5, wherein: and the relative offset of the memory blocks to be arranged is determined by the memory multiplexing server according to the information in the memory block list to arrange the memory blocks in the memory block list.
8. A memory multiplexing server, comprising:
the second receiving unit is used for receiving the memory multiplexing request sent by the memory multiplexing client; the memory multiplexing request comprises a memory block list to be multiplexed;
the third processing unit is configured to lay the memory blocks in the memory block list according to the information in the memory block list, and ensure that the memory blocks are not overlapped so as to determine the relative offsets of the memory blocks to be laid;
and a second sending unit, configured to send a response message, where the response message includes a relative offset of the memory blocks to be arranged.
9. The memory multiplexing server of claim 6, wherein: the memory block list includes the size of the memory blocks to be arranged after cutting, the identification of the memory blocks to be arranged, the life starting and ending time of the memory blocks to be arranged, and the position constraint relation of the memory blocks to be arranged.
10. The memory multiplexing server of claim 6, wherein: the memory multiplexing client is used for cutting the output memory of the operator and establishing the corresponding relation between the plurality of cut memory blocks and the memory blocks before cutting.
11. The memory multiplexing server of claim 8, wherein: the memory multiplexing client is further configured to set one or more offsets of the operator output memory according to the correspondence between the memory blocks after cutting and the memory blocks before cutting and the relative offsets of the memory blocks in the response message, and set the offsets and sizes of the one or more memory blocks of the suffix operator at the same time.
12. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the steps of a data processing method according to any one of claims 1 to 3 for reducing edge computing memory footprint.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011084416.8A CN112306675B (en) | 2020-10-12 | 2020-10-12 | Data processing method, related device and computer readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011084416.8A CN112306675B (en) | 2020-10-12 | 2020-10-12 | Data processing method, related device and computer readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112306675A true CN112306675A (en) | 2021-02-02 |
CN112306675B CN112306675B (en) | 2024-06-04 |
Family
ID=74488411
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011084416.8A Active CN112306675B (en) | 2020-10-12 | 2020-10-12 | Data processing method, related device and computer readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112306675B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113608881A (en) * | 2021-10-09 | 2021-11-05 | 腾讯科技(深圳)有限公司 | Memory allocation method, device, equipment, readable storage medium and program product |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103309767A (en) * | 2012-03-08 | 2013-09-18 | 阿里巴巴集团控股有限公司 | Method and device for processing client log |
CN108780656A (en) * | 2016-03-10 | 2018-11-09 | 美光科技公司 | Device and method for logic/memory device |
US20190278707A1 (en) * | 2018-03-12 | 2019-09-12 | Beijing Horizon Information Technology Co., Ltd. | Methods and Apparatus For Using Circular Addressing in Convolutional Operation |
CN110597616A (en) * | 2018-06-13 | 2019-12-20 | 华为技术有限公司 | Memory allocation method and device for neural network |
CN110766135A (en) * | 2019-10-15 | 2020-02-07 | 北京芯启科技有限公司 | Method for storing required data when optimizing operation function of neural network in any depth |
CN111105018A (en) * | 2019-10-21 | 2020-05-05 | 深圳云天励飞技术有限公司 | Data processing method and device |
CN111401532A (en) * | 2020-04-28 | 2020-07-10 | 南京宁麒智能计算芯片研究院有限公司 | Convolutional neural network reasoning accelerator and acceleration method |
-
2020
- 2020-10-12 CN CN202011084416.8A patent/CN112306675B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103309767A (en) * | 2012-03-08 | 2013-09-18 | 阿里巴巴集团控股有限公司 | Method and device for processing client log |
CN108780656A (en) * | 2016-03-10 | 2018-11-09 | 美光科技公司 | Device and method for logic/memory device |
US20190278707A1 (en) * | 2018-03-12 | 2019-09-12 | Beijing Horizon Information Technology Co., Ltd. | Methods and Apparatus For Using Circular Addressing in Convolutional Operation |
CN110597616A (en) * | 2018-06-13 | 2019-12-20 | 华为技术有限公司 | Memory allocation method and device for neural network |
CN110766135A (en) * | 2019-10-15 | 2020-02-07 | 北京芯启科技有限公司 | Method for storing required data when optimizing operation function of neural network in any depth |
CN111105018A (en) * | 2019-10-21 | 2020-05-05 | 深圳云天励飞技术有限公司 | Data processing method and device |
CN111401532A (en) * | 2020-04-28 | 2020-07-10 | 南京宁麒智能计算芯片研究院有限公司 | Convolutional neural network reasoning accelerator and acceleration method |
Non-Patent Citations (1)
Title |
---|
张川 等: "面向组合逻辑的DNA计算", 《中国科学》, vol. 49, no. 7, pages 819 - 837 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113608881A (en) * | 2021-10-09 | 2021-11-05 | 腾讯科技(深圳)有限公司 | Memory allocation method, device, equipment, readable storage medium and program product |
CN113608881B (en) * | 2021-10-09 | 2022-02-25 | 腾讯科技(深圳)有限公司 | Memory allocation method, device, equipment, readable storage medium and program product |
Also Published As
Publication number | Publication date |
---|---|
CN112306675B (en) | 2024-06-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107437110B (en) | Block convolution optimization method and device of convolutional neural network | |
US11018979B2 (en) | System and method for network slicing for service-oriented networks | |
US10209908B2 (en) | Optimization of in-memory data grid placement | |
US9298760B1 (en) | Method for shard assignment in a large-scale data processing job | |
CN103347055B (en) | Task processing system in cloud computing platform, Apparatus and method for | |
US20160359668A1 (en) | Virtual machine placement optimization with generalized organizational scenarios | |
CN108491263A (en) | Data processing method, data processing equipment, terminal and readable storage medium storing program for executing | |
CN109447253B (en) | Video memory allocation method and device, computing equipment and computer storage medium | |
CN107729138B (en) | Method and device for analyzing high-performance distributed vector space data | |
CN111984400A (en) | Memory allocation method and device of neural network | |
US20190377606A1 (en) | Smart accelerator allocation and reclamation for deep learning jobs in a computing cluster | |
WO2017000645A1 (en) | Method and apparatus for allocating host resource | |
CN116991560B (en) | Parallel scheduling method, device, equipment and storage medium for language model | |
CN112433812A (en) | Method, system, equipment and computer medium for virtual machine cross-cluster migration | |
KR102326586B1 (en) | Method and apparatus for processing large-scale distributed matrix product | |
CN112306675A (en) | Data processing method, related device and computer readable storage medium | |
CN115167992A (en) | Task processing method, system, device, server, medium, and program product | |
EP4057142A1 (en) | Job scheduling method and job scheduling apparatus | |
EP4012573A1 (en) | Graph reconstruction method and apparatus | |
US20170364809A1 (en) | Parallelization techniques for variable selection and predictive models generation and its applications | |
CN112912849B (en) | Graph data-based calculation operation scheduling method, system, computer readable medium and equipment | |
CN115061825B (en) | Heterogeneous computing system and method for private computing, private data and federal learning | |
CN114995770B (en) | Data processing method, device, equipment, system and readable storage medium | |
CN107025099B (en) | Asynchronous graph calculation implementation method and system based on double-queue model | |
CN112905223A (en) | Method, device and equipment for generating upgrade package |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CP03 | Change of name, title or address |
Address after: No. 9 Mozhou East Road, Nanjing City, Jiangsu Province, 211111 Patentee after: Zijinshan Laboratory Country or region after: China Address before: No. 9 Mozhou East Road, Jiangning Economic Development Zone, Jiangning District, Nanjing City, Jiangsu Province Patentee before: Purple Mountain Laboratories Country or region before: China |