CN112532251A - Data processing method and device - Google Patents

Data processing method and device Download PDF

Info

Publication number
CN112532251A
CN112532251A CN201910891200.3A CN201910891200A CN112532251A CN 112532251 A CN112532251 A CN 112532251A CN 201910891200 A CN201910891200 A CN 201910891200A CN 112532251 A CN112532251 A CN 112532251A
Authority
CN
China
Prior art keywords
gradient
data
quantization
matrix
codebook
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910891200.3A
Other languages
Chinese (zh)
Inventor
郑尚策
董永汉
于璠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN201910891200.3A priority Critical patent/CN112532251A/en
Publication of CN112532251A publication Critical patent/CN112532251A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3059Digital compression and data reduction techniques where the original information is represented by a subset or similar information, e.g. lossy compression
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/55Compression Theory, e.g. compression of random number, repeated compression

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The application discloses a data processing method which can be used in the field of Artificial Intelligence (AI). The method comprises the following steps: the terminal device divides gradient data of a model to be trained into gradient segments, then carries out quantization conversion on the gradient segments according to a codebook, the number of elements in quantization representation obtained after quantization conversion is less than that of the elements in the gradient segments, and then sends the quantization representation to the cloud device. According to the technical scheme, the number of elements in the quantization representation is less than that of elements in the gradient segment, namely, compared with the element-based quantization technology, the compression ratio of the gradient data is improved. Because the compression ratio of the quantized representation is increased, the communication overhead when the terminal device transmits the quantized representation to the cloud device is also reduced.

Description

Data processing method and device
Technical Field
The present application relates to the field of computer technologies, and in particular, to a data processing method and device.
Background
With the increase of hardware computing power and the emergence of big data in recent years, the deep learning model is developed and applied at a speed and a heat which are not available before. Deep learning models need to be trained by the model before being applied. The federated learning is a model training scheme, and the cloud equipment can jointly train a model by combining a plurality of terminal equipment, and meanwhile, the privacy problem is solved. The federal learning has a plurality of rounds of training processes, each round of training is carried out by screening a plurality of terminal devices according to a certain rule, and the problem that sample data quantity on the end side is small is solved to a certain extent. The user privacy data are locally used at the end side in the federal learning process, and do not need to be uploaded to the cloud side, so that the privacy disclosure problem can be solved. The gradient obtained by training of each end side is uploaded to the cloud side, the cloud side aggregates the gradient data uploaded by the end sides, and then the aggregated gradient data is issued to the end sides.
A large amount of gradient data are transmitted between the end side and the cloud side, so that the federal learning faces a huge communication bottleneck problem, and urgent solution is needed.
Disclosure of Invention
The embodiment of the application provides a data processing method, which can improve the compression ratio of gradient data and effectively reduce the communication overhead of gradient data transmission. The embodiment of the application also provides corresponding equipment.
A first aspect of the present application provides a data processing method, which may be applied to an end-side device, such as a terminal device. The method can comprise the following steps: a gradient data of a model to be trained is obtained, wherein A is a positive integer. For each of the B gradient data (or referred to as first gradient data), the following is performed: dividing the gradient data into at least two gradient segments, wherein the gradient data comprise d elements, the B gradient data are contained in the A gradient data, B is a positive integer and is less than or equal to A, each gradient segment comprises d ' elements, d ' is a positive integer larger than 2, and d can be evenly divided by d '; and performing quantization conversion on each gradient segment in the at least two gradient segments (gradient segments) by using a code book (or first codebook) of the gradient data to obtain a quantization representation corresponding to each gradient segment, wherein the number of elements in the quantization representation is less than d ', the codebook is a first matrix of d' rows and m columns or a second matrix of d 'columns and m rows, m is a positive integer greater than 2, and m is greater than or equal to d'. Transmitting B quantized sets of the B gradient data to a cloud device, wherein a first quantized set of the first gradient data includes a quantized representation of each of the gradient segments.
It should be noted that, for convenience of reference, the first gradient data may be used to refer to any one of the B gradient data. The model to be trained and the source of the a gradient data are not limited in this application.
Here, "sending the B quantization sets corresponding to the B pieces of gradient data" may be uniformly sending the B pieces of gradient data after all the quantization conversion is completed, or may be sending the B pieces of gradient data in multiple times or sending the B pieces of gradient data after part (including one) of the B pieces of gradient data is completed.
In the first aspect, there may be a operators associated with weights in the model to be trained, each operator may have a corresponding gradient data, each gradient data may include at least one element, and each element may represent a gradient. The B gradient data may be part or all of the a gradient data, for example: when B is equal to a, it means that each of the a gradient data needs to be subjected to division and quantization conversion of the gradient segment, when B < a, it means that some of the a gradient data need to be subjected to division and quantization conversion of the gradient segment, and the remaining (a-B) gradient data need not be subjected to division and quantization conversion of the gradient segment. When the gradient segment is divided, d ' can be determined according to the number of d ', and d can be evenly divided by d ', for example: when d is 1200, d 'is 16, and d/d' is 75, it means that the first gradient data can be divided into 75 gradient segments. Each first gradient data corresponds to a codebook, d 'of different gradient data may be different, and d' and m in different codebooks may also be different. m ≧ d' can ensure that the first matrix or the second matrix is in a full rank state, so that a column or a row with higher matching precision can be found for each gradient segment. In the model training process, it is usually specified to use the first matrix or the second matrix. The number of elements in the quantization representation is less than d', the number of quantization data is further reduced on the basis of quantization compression, and therefore the compression ratio is improved. In addition, because the compression ratio of the quantized representation is increased, the communication overhead when the terminal device transmits the quantized representation to the cloud device is also reduced.
In one possible implementation manner of the first aspect, when a > B, the method may further include:
transmitting (A-B) gradient data of the A gradient data except the B gradient data to the cloud device.
In this possible implementation, the number of the (a-B) gradient data may be small, the quantization conversion process also consumes computational resources, and from the perspective of comprehensive profit, the (a-B) gradient data may be directly transmitted without being compressed, or the (a-B) gradient data may be compressed by using another compression processing method and then transmitted.
In a possible implementation manner of the first aspect, the method may further include:
determining that B gradient data in the A gradient data need to be quantized according to the respective data volume of each gradient data;
for each first gradient data in the B gradient data, determining d 'and m according to the element number d of the first gradient data, wherein d' is the power p of 2, m is the power q of 2, p and q are both positive integers, and q is more than or equal to p;
randomly generating the first matrix with d 'rows and m columns or the second matrix with d' columns and m rows for the first gradient data according to a random seed, and scaling each Gaussian vector in the first matrix or the second matrix to be modulo 1 to obtain a first codebook of the first gradient data, wherein each column in the first matrix is a Gaussian vector, and each row in the second matrix is a Gaussian vector.
In this possible implementation, a threshold of the amount of data to be quantized and transformed may be determined from the perspective of the overall profit, for example: the data amount threshold is set to 512 elements, and if the number of elements in the gradient data is greater than or equal to 512, it indicates that quantization conversion is required, and if the number of elements in the gradient data is less than 512, it indicates that quantization conversion is not required. The terminal device only needs to generate a codebook for the gradient data which needs to be subjected to quantization conversion. Therefore, the calculation cost caused by generating the codebook can be reduced, and the comprehensive benefits of the calculation cost and the communication cost are improved. In order to ensure that the cloud device can normally recover the gradient data, the random seed used by the terminal device to generate the codebook is the same as that used by the cloud device. The random seed may be a numerical value, and a codebook of gradient data may be generated randomly by inputting the random seed to a random function.
In a possible implementation manner of the first aspect, the method may further include:
and receiving a codebook corresponding to each of the B gradient data sent by the cloud device, wherein the codebook corresponding to each of the B gradient data includes the first codebook.
In this possible implementation manner, the terminal device may not be required to generate the codebook by itself, and the codebook may be generated by the cloud device and then issued to the terminal device, so that the calculation overhead of the terminal device may be reduced.
In a possible implementation manner of the first aspect, the steps include: performing quantization conversion on each of the at least two gradient segments using the first codebook of the first gradient data to obtain a quantized representation corresponding to each of the at least two gradient segments, may include:
determining a target code (target code) for a first gradient segment from the first codebook, wherein the target code is a column in the first matrix or a row in the second matrix, and the first gradient segment is any one of the at least two gradient segments;
determining a pseudo-norm and a code index (code index) of the object code;
determining the pseudo-modulo length and the code index as a quantized representation of the first gradient segment.
In this possible implementation, the pseudo-modular length is relative to the modular length, and the pseudo-modular length is calculated not according to a mathematical vector modular length calculation method but through a codebook and gradient segments. The code index represents an index of one column of the first matrix, for example: index (5) indicates the fifth column. Alternatively, the index of a row of the second matrix is represented, for example: index (6) indicates the sixth row. If the modulo length is denoted by u and the code index (c), the quantization representation may be (u, index (c)). According to the possible implementation mode, one gradient segment is converted to include only two elements of pseudo-modular length and code index, so that the data volume is greatly reduced, the compression ratio of gradient data is improved, and the communication overhead for transmitting the gradient data is also reduced.
In a possible implementation manner of the first aspect, the steps include: determining a target code from the first codebook for a first gradient segment may include:
determining a vector norm (vector norm) of the first gradient segment;
and if the vector modular length is equal to 0, determining any column in the first matrix or any row in the second matrix as the target code.
In this possible implementation, the vector modulo length is the sum of the squares of the elements on the first gradient segment and then the square root. If the vector modulo length is equal to 0, then each element is 0, so only one row or column needs to be randomly selected to be brought together into a format consistent with the quantized representation, for example: the first column or the first row may be selected.
In a possible implementation manner of the first aspect, the method may further include:
if the vector modular length is not equal to 0, determining a first coefficient vector of the first gradient segment according to the first codebook and the first gradient segment, wherein the first coefficient vector comprises m elements;
normalizing each element in the first coefficient vector to obtain a second coefficient vector, wherein the second coefficient vector comprises m normalized elements;
and adopting a roulette selection strategy for the normalized m elements, and determining an ith row object code in the first matrix or an ith row object code in the second matrix corresponding to the ith element, wherein the ith element is the element selected by adopting the roulette selection strategy.
This possible implementationIn case the first codebook is denoted by C, the first gradient segment by g, the first coefficient vector by p, p ═ CT(CCT)-1g, where CT represents the transposed matrix of C and the second coefficient vector is used
Figure BDA0002209149120000031
It is shown that,
Figure BDA0002209149120000032
m normalized elements are included, and element normalization can be understood as:
Figure BDA0002209149120000033
wherein
Figure BDA0002209149120000034
Represents the b-th, | p of the m normalized elementsbI represents the absolute value of the b-th element in the first coefficient vector p, I p I calculation1Represents the sum of the absolute values of the elements of m elements of the first coefficient vector p. The selection strategy of roulette can consider that each normalized element has respective probability in the scheme, the probability of each element is large or small, the sum of all the probabilities is equal to 1, the probability of selecting with large probability is large, the probability of selecting with small probability is small, and the final selection result is based on the element selected after the roulette rotates. If selected is
Figure BDA0002209149120000035
The code of the column or row corresponding to the b-th element is selected as the object code. Of course, if the ith normalized element is selected, the code of the column or row corresponding to the ith element is selected as the target code.
In a possible implementation manner of the first aspect, the steps include: determining a target code from the first codebook for a first gradient segment, comprising:
determining a projection vector (projection vector) of the first gradient segment according to the first codebook and the first gradient segment, wherein the projection vector comprises m elements;
determining an ith row target code in the first codebook or the second matrix corresponding to an ith element, wherein the absolute value of the ith element is the largest among the absolute values of the m elements.
In this possible implementation, if the first codebook is denoted by C, the first gradient segment by g, and the projection vector by p, then p ═ CTg, wherein CTRepresenting the transposed matrix of C. The projection vector p comprises m elements; | piAnd | represents the absolute value of the ith element of the m elements. If i is 5, it means that the 5 th column in the first codebook or the 5 th row in the second matrix is selected as the target code.
In a possible implementation manner of the first aspect, the method may further include:
receiving respective quantization sets of B gradient aggregation data corresponding to the B gradient data sent by the cloud device, where each quantization set of gradient aggregation data includes a quantization representation corresponding to each gradient segment of the gradient aggregation data.
In this possible implementation manner, after the gradient aggregation is performed, the cloud device may also perform quantization conversion on the gradient aggregation data obtained after aggregation by using the quantization conversion manner of the terminal device, so that the communication overhead between the cloud device and the terminal device may be further reduced.
A second aspect of the present application provides a data processing method, where the method may be applied to a cloud-side device, for example, a cloud device, and the cloud device may be a server, or may be another device or a piece of virtual resource. The method can comprise the following steps: receiving B quantization sets corresponding to B pieces of gradient data sent by a terminal device, wherein each quantization set comprises a quantization representation corresponding to each gradient segment of the corresponding gradient data; for each first quantization set in the B quantization sets, performing inverse quantization on each quantization representation in the first quantization set by using a first codebook corresponding to the first quantization set to obtain a first gradient segment corresponding to each quantization representation, wherein the first gradient segment comprises d ' elements, the number of the elements in the quantization representation is less than d ', the first codebook is a first matrix with d ' rows and m columns or a second matrix with d ' columns and m rows, d and m are positive integers greater than 2, and m is greater than or equal to d '; splicing first gradient segments corresponding to each quantization representation in the first quantization set to obtain first gradient data corresponding to the first quantization set, wherein the first gradient data comprises elements, d is a positive integer, and d can be evenly divided by d'; performing gradient aggregation on N first gradient data of a first weight to obtain gradient aggregation data of the first weight, wherein the N first gradient data of the first weight correspond to N terminal devices, and N is an integer greater than 1; and sending the gradient aggregation data of the B weights to the terminal equipment.
It should be noted that, for convenience of reference, a "first quantization set" may be used to refer to any one of the B quantization sets. The "first gradient data" refers to any one of the B gradient data.
Here, "receiving B quantization sets corresponding to B pieces of gradient data" may be reception in a single state or reception in a plurality of times.
Here, "sending the gradient aggregation data of each of the B weights" may be sending the gradient aggregation data together, or sending the gradient aggregation data in multiple times or sending the gradient aggregation data after part (including one) of the gradient aggregation data is completed without waiting for all of the gradient aggregation data to be completed.
In the second aspect, after receiving the quantization representation, the cloud device performs inverse quantization, that is, quantization recovery, using the same codebook used in quantization conversion by the terminal device, and further performs gradient aggregation. The gradient aggregation process refers to that the cloud device aggregates gradients of the same weight received from multiple terminal devices, for example: if 500 gradients are received for weight a, the 500 gradients may be added and averaged, and the average value is used as the aggregated gradient of weight a. After the cloud device sends the aggregated gradient aggregation data to the terminal device, the terminal device can update each weight, so that the model to be trained converges further. The process of updating the weight may be to subtract the gradient aggregation data obtained by the calculation of the current round from the current weight of the current round to obtain an updated weight, and the updated weight is used for the next round of training. In the second aspect, since the cloud device receives the quantized representation after quantization conversion by the terminal device from the terminal device, the number of elements in the quantized representation is smaller than the number d' of elements in each gradient segment, and thus the gradient data can be received with only a small communication overhead.
In one possible implementation manner of the second aspect, the method may further include:
for each first gradient data in B gradient data, randomly generating a first matrix with d 'rows and m columns or a second matrix with d' columns and m rows for the first gradient data according to a random seed, and scaling each Gaussian vector in the first matrix or the second matrix to be modulo 1 to obtain a first codebook of the first gradient data, wherein each column in the first matrix is a Gaussian vector, and each row in the second matrix is a Gaussian vector.
In this possible implementation, the random seed may be a numerical value, and the codebook of gradient data may be randomly generated by inputting the random seed into the random function.
In one possible implementation manner of the second aspect, the method may further include:
and sending the codebook corresponding to the B gradient data to the terminal equipment.
In this possible implementation manner, the cloud device sends the generated codebook to the terminal device, so that the calculation overhead of the terminal device can be reduced.
In a possible implementation manner of the second aspect, the steps are as follows: performing inverse quantization on each quantized representation in the first quantized set by using a first codebook corresponding to the first quantized set to obtain a first gradient segment corresponding to each quantized representation, may include:
for each first quantized representation of the first set of quantizes, determining a target code of the first quantized representation from the first codebook according to a code index of the first quantized representation;
and restoring a first gradient segment corresponding to the first quantized representation according to the pseudo modular length in the first quantized representation and the target code.
In this possible implementation manner, the cloud device may find the target code from the first codebook according to the code index, and then may recover the corresponding gradient segment by multiplying the pseudo-modulo length by an element in the target code.
In one possible implementation manner of the second aspect, the method may further include:
respectively carrying out quantization conversion on the gradient aggregation data of the B weights to obtain respective quantization sets of the B gradient aggregation data;
the sending, to the terminal device, gradient aggregation data of each of the B weights includes:
and sending respective quantization sets of the B gradient aggregation data to the terminal device, wherein each quantization set of the gradient aggregation data comprises a corresponding quantization representation of each gradient segment of the gradient aggregation data.
In this possible implementation manner, the cloud device may also perform quantization conversion on the gradient aggregation data to be sent to the terminal device, which is similar to that performed by the terminal device on the gradient data, so that the communication overhead between the cloud device and the terminal device may be further reduced.
A third aspect of the present application provides a terminal device having a function of implementing the method according to the first aspect or any one of the possible implementation manners of the first aspect. The function can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above functions, such as: the device comprises a receiving unit, a processing unit and a sending unit.
A fourth aspect of the present application provides a cloud device having a function of implementing the method of any one of the second aspect or possible implementation manner of the second aspect. The function can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above functions, such as: the device comprises a receiving unit, a processing unit and a sending unit. The cloud device in the present application may be any computer device deployed on a network side or a cloud side.
A fifth aspect of the present application provides a terminal device, which includes at least one processor, a memory, a transceiver, and computer executable instructions stored in the memory and executable on the processor, wherein when the computer executable instructions are executed by the processor, the processor performs the method according to the first aspect or any one of the possible implementation manners of the first aspect.
A sixth aspect of the present application provides a cloud device comprising at least one processor, a memory, a communication port, and computer executable instructions stored in the memory and executable on the processor, wherein when the computer executable instructions are executed by the processor, the processor performs the method according to any one of the possible implementations of the second aspect or the second aspect.
A seventh aspect of the present application provides a computer-readable storage medium storing one or more computer-executable instructions that, when executed by a processor, perform a method according to the first aspect or any one of the possible implementations of the first aspect.
An eighth aspect of the present application provides a computer-readable storage medium storing one or more computer-executable instructions that, when executed by a processor, perform a method as described in any one of the possible implementations of the second aspect or the second aspect.
A ninth aspect of the present application provides a computer program product (or computer program) storing one or more computer executable instructions, which when executed by the processor, perform the method of the first aspect or any one of the possible implementations of the first aspect.
A tenth aspect of the present application provides a computer program product storing one or more computer executable instructions that, when executed by the processor, perform the method of any one of the possible implementations of the second aspect or the second aspect.
An eleventh aspect of the present application provides a chip system, where the chip system includes a processor, configured to support a terminal device to implement the functions recited in the first aspect or any one of the possible implementation manners of the first aspect. In one possible design, the system-on-chip may further include a memory, which stores program instructions and data necessary for the terminal device. The chip system may be constituted by a chip, or may include a chip and other discrete devices.
A twelfth aspect of the present application provides a chip system, where the chip system includes a processor, and is configured to support a cloud device to implement the functions in the second aspect or any one of the possible implementations of the second aspect. In one possible design, the chip system may further include a memory, a storage, for storing necessary program instructions and data for the cloud device. The chip system may be constituted by a chip, or may include a chip and other discrete devices.
For technical effects brought by any one or any one of the possible implementation manners of the third, fifth, seventh, ninth, and eleventh aspects, reference may be made to technical effects brought by different possible implementation manners of the first aspect or the first aspect, and details are not described here.
For example, the technical effect brought by the fourth, sixth, eighth, tenth and twelfth aspects or any one of possible implementation manners of the fourth, sixth, eighth, tenth and twelfth aspects may refer to the technical effect brought by the second aspect or different possible implementation manners of the second aspect, and details are not described here.
The gradient data to be transmitted are divided into gradient segments, then quantization conversion is carried out according to a codebook, so that quantization expression is obtained, and the number of elements in the quantization expression is smaller than that of the elements in each gradient segment. Therefore, the quantity of quantized data is further reduced on the basis of quantization compression, and the compression ratio is improved. In addition, because the compression ratio of the quantized representation is increased, the communication overhead when the terminal device transmits the quantized representation to the cloud device is also reduced.
Drawings
FIG. 1 is a schematic diagram of an artificial intelligence agent framework provided by an embodiment of the present application;
fig. 2 is a schematic diagram of an application environment according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a convolutional neural network according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of a convolutional neural network according to an embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of a neural network processor according to an embodiment of the present disclosure;
fig. 6 is a schematic diagram of a model training system with end-cloud integration according to an embodiment of the present disclosure;
fig. 7 is a schematic diagram of an embodiment of a data processing method provided in an embodiment of the present application;
fig. 8 is a schematic diagram of another embodiment of a method for data processing according to an embodiment of the present application;
fig. 9 is a schematic diagram of an embodiment of a terminal device according to an embodiment of the present application;
fig. 10 is a schematic diagram of an embodiment of a cloud device provided in an embodiment of the present application;
fig. 11 is a schematic diagram of an embodiment of a terminal device according to an embodiment of the present application;
fig. 12 is a schematic diagram of an embodiment of a cloud device provided in an embodiment of the present application.
Detailed Description
Embodiments of the present application will now be described with reference to the accompanying drawings, and it is to be understood that the described embodiments are merely illustrative of some, but not all, embodiments of the present application. As can be known to those skilled in the art, with the development of technology and the emergence of new scenarios, the technical solution provided in the embodiments of the present application is also applicable to similar technical problems.
The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The embodiment of the application provides a data processing method, which can improve the compression ratio of gradient data in the gradient transmission process of an end side and a cloud side, and effectively reduce the communication overhead of gradient data transmission. The embodiment of the application also provides corresponding equipment. The following are detailed below.
FIG. 1 shows a schematic diagram of an artificial intelligence body framework that describes the overall workflow of an artificial intelligence system, applicable to the general artificial intelligence field requirements.
The artificial intelligence topic framework described above is set forth below in terms of two dimensions, the "intelligent information chain" (horizontal axis) and the "IT value chain" (vertical axis).
The "smart information chain" reflects a list of processes processed from the acquisition of data. For example, the general processes of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision making and intelligent execution and output can be realized. In this process, the data undergoes a "data-information-knowledge-wisdom" refinement process.
The 'IT value chain' reflects the value of the artificial intelligence to the information technology industry from the bottom infrastructure of the human intelligence, information (realization of providing and processing technology) to the industrial ecological process of the system.
(1) Infrastructure:
the infrastructure provides computing power support for the artificial intelligent system, realizes communication with the outside world, and realizes support through a foundation platform. Communicating with the outside through a sensor; the computing power is provided by a smart chip (a Central Processing Unit (CPU), a neural Network Processor (NPU), a Graphic Processing Unit (GPU), an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA), and other hardware acceleration chips); the basic platform comprises distributed computing framework, network and other related platform guarantees and supports, and can comprise cloud storage and computing, interconnection and intercommunication networks and the like. For example, sensors and external communications acquire data that is provided to intelligent chips in a distributed computing system provided by the base platform for computation.
(2) Data of
Data at the upper level of the infrastructure is used to represent the data source for the field of artificial intelligence. The data relates to graphs, images, voice and texts, and also relates to the data of the Internet of things of traditional equipment, including service data of the existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.
(3) Data processing
Data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.
The machine learning and the deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.
Inference means a process of simulating an intelligent human inference mode in a computer or an intelligent system, using formalized information to think about and solve a problem by a machine according to an inference control strategy, and a typical function is searching and matching.
The decision-making refers to a process of making a decision after reasoning intelligent information, and generally provides functions of classification, sequencing, prediction and the like.
(4) General capabilities
After the above-mentioned data processing, further based on the result of the data processing, some general capabilities may be formed, such as algorithms or a general system, e.g. translation, analysis of text, computer vision processing, speech recognition, recognition of images, etc.
(5) Intelligent product and industrial application
The intelligent product and industry application refers to the product and application of an artificial intelligence system in various fields, and is the encapsulation of an artificial intelligence integral solution, the intelligent information decision is commercialized, and the landing application is realized, and the application field mainly comprises: intelligent manufacturing, intelligent transportation, intelligent home, intelligent medical treatment, intelligent security, automatic driving, safe city, intelligent terminal and the like.
Referring to fig. 2, a system architecture 200 is provided in an embodiment of the present application. Data collection facility 260 is configured to collect training data and store the training data in database 230, and training facility 220 generates target model/rule 201 based on the training data maintained in database 230. How the training device 220 derives the target model/rule 201 based on the training data will be described in more detail below, and the target model/rule 201 can be used in application scenarios such as image recognition, video classification, speech recognition, and language translation.
The object model/rule 201 may be obtained based on a deep neural network or a Convolutional Neural Network (CNN), which are described below.
The operation of each layer in the deep neural network can be expressed mathematically
Figure BDA0002209149120000081
To describe: from the work of each layer in the physical-level deep neural network, it can be understood that the transformation of the input space into the output space (i.e. the row space to the column space of the matrix) is accomplished by five operations on the input space (set of input vectors), which include: 1. ascending/descending dimensions; 2. zooming in/out; 3. rotating; 4. translating; 5. "bending". Wherein 1, 2, 3 are operated by
Figure BDA0002209149120000091
The operation of 4 is completed by + b, and the operation of 5 is realized by a (). The expression "spatial" is used here because of the classificationIs not a single thing, but a class of things, and space refers to the set of all individuals of such things. Where W is a weight vector, each value in the vector representing a weight value for a neuron in the layer of neural network. The vector W determines the spatial transformation of the input space into the output space described above, i.e. the weight W of each layer controls how the space is transformed. The purpose of training the deep neural network is to finally obtain the weight matrix (the weight matrix formed by the vectors W of many layers) of all the layers of the trained neural network. Therefore, the training process of the neural network is essentially a way of learning the control space transformation, and more specifically, the weight matrix.
Because it is desirable that the output of the deep neural network is as close as possible to the value actually desired to be predicted, the weight vector of each layer of the neural network can be updated by comparing the predicted value of the current network with the value actually desired to be predicted, and then updating the weight vector according to the difference between the predicted value and the value actually desired (of course, there is usually an initialization process before the first update, that is, parameters are configured in advance for each layer in the deep neural network). Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the deep neural network becomes the process of reducing the loss as much as possible.
The target models/rules obtained by the training device 220 may be applied in different systems or devices. In FIG. 2, the execution device 210 is configured with an I/O interface 212 to interact with data from an external device, and a "user" may input data to the I/O interface 212 via a client device 240.
The execution device 210 may call data, code, etc. from the data storage system 250 and may store data, instructions, etc. in the data storage system 250.
The calculation module 211 processes the input data using the object model/rule 201, and taking text-type language translation as an example, the calculation module 211 may analyze sentences in the text of the first language to obtain words such as subjects, predicates, and objects in each sentence.
The association function module 213 may translate words such as subject, predicate, and object in the first sentence in the computation module 211 into the second language, and then logically organize the sentence in combination with the syntax of the second language.
The association function module 214 may translate words such as subject, predicate, and object in the second sentence in the calculation module 211 into the second language, and then organize the sentence according to the syntax logic of the second language.
Finally, the I/O interface 212 returns the results of the processing to the client device 240 for presentation to the user.
Further, the training device 220 may generate corresponding target models/rules 201 based on different data for different targets to provide better results to the user.
In the case shown in FIG. 2, the user may manually specify data to be input into the execution device 210, for example, to operate in an interface provided by the I/O interface 212. Alternatively, the client device 240 may automatically enter data into the I/O interface 212 and obtain the results, and if the client device 240 automatically enters data to obtain authorization from the user, the user may set the corresponding permissions in the client device 240. The user can view the result output by the execution device 210 at the client device 240, and the specific presentation form can be display, sound, action, and the like. The client device 240 may also act as a data collection end to store the collected training data in the database 230.
It should be noted that fig. 2 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the position relationship between the devices, modules, and the like shown in the diagram does not constitute any limitation, for example, in fig. 2, the data storage system 250 is an external memory with respect to the execution device 210, and in other cases, the data storage system 250 may also be disposed in the execution device 210.
The convolutional neural network is a deep neural network with a convolutional structure, and is a deep learning (deep learning) architecture, and the deep learning architecture refers to learning of multiple levels at different abstraction levels through a machine learning algorithm. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons respond to overlapping regions in an image input thereto.
As shown in fig. 3, Convolutional Neural Network (CNN)100 may include an input layer 110, a convolutional/pooling layer 120, where the pooling layer is optional, and a neural network layer 130.
Convolutional layer/pooling layer 120:
and (3) rolling layers:
as shown in FIG. 3, convolutional layer/pooling layer 120 may include, for example, 121-126 layers, in one implementation, 121 layers are convolutional layers, 122 layers are pooling layers, 123 layers are convolutional layers, 124 layers are pooling layers, 125 layers are convolutional layers, and 126 layers are pooling layers; in another implementation, 121, 122 are convolutional layers, 123 are pooling layers, 124, 125 are convolutional layers, and 126 are pooling layers. I.e., the output of a convolutional layer may be used as input to a subsequent pooling layer, or may be used as input to another convolutional layer to continue the convolution operation.
Taking convolutional layer 121 as an example, convolutional layer 121 may include a plurality of convolution operators, also called kernels, whose role in image processing is to act as a filter for extracting specific information from an input image matrix, and a convolution operator may be essentially a weight matrix, which is usually predefined, and during the convolution operation on an image, the weight matrix is usually processed on the input image pixel by pixel (or two pixels by two pixels … …, which depends on the value of step size stride) in the horizontal direction, so as to complete the task of extracting a specific feature from the image. The size of the weight matrix should be related to the size of the image, and it should be noted that the depth dimension (depth dimension) of the weight matrix is the same as the depth dimension of the input image, and the weight matrix extends to the entire depth of the input image during the convolution operation. Thus, convolving with a single weight matrix will produce a single depth dimension of the convolved output, but in most cases not a single weight matrix is used, but a plurality of weight matrices of the same dimension are applied. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image. Different weight matrixes can be used for extracting different features in the image, for example, one weight matrix is used for extracting image edge information, another weight matrix is used for extracting specific colors of the image, another weight matrix is used for blurring unwanted noise points in the image … …, the dimensions of the multiple weight matrixes are the same, the dimensions of feature maps extracted by the multiple weight matrixes with the same dimensions are also the same, and the extracted multiple feature maps with the same dimensions are combined to form the output of convolution operation.
The weight values in these weight matrices need to be obtained through a large amount of training in practical application, and each weight matrix formed by the trained weight values can extract information from the input image, thereby helping the convolutional neural network 100 to make correct prediction.
When convolutional neural network 100 has multiple convolutional layers, the initial convolutional layer (e.g., 121) tends to extract more general features, which may also be referred to as low-level features; as the depth of the convolutional neural network 100 increases, the more convolutional layers (e.g., 126) that go further back extract more complex features, such as features with high levels of semantics, the more highly semantic features are more suitable for the problem to be solved.
A pooling layer:
since it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce pooling layers after the convolutional layer, i.e. the layers 121-126 as illustrated by 120 in fig. 3, may be one convolutional layer followed by one pooling layer, or may be multiple convolutional layers followed by one or more pooling layers. During image processing, the only purpose of the pooling layer is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to smaller sized images. The average pooling operator may calculate pixel values in the image over a particular range to produce an average. The max pooling operator may take the pixel with the largest value in a particular range as the result of the max pooling. In addition, just as the size of the weighting matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size. The size of the image output after the processing by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel point in the image output by the pooling layer represents an average value or a maximum value of a corresponding sub-region of the image input to the pooling layer.
The neural network layer 130:
after processing by convolutional layer/pooling layer 120, convolutional neural network 100 is not sufficient to output the required output information. Because, as previously described, the convolutional layer/pooling layer 120 only extracts features and reduces the parameters brought by the input image. However, to generate the final output information (class information or other relevant information as needed), the convolutional neural network 100 needs to generate one or a set of outputs of the number of classes as needed using the neural network layer 130. Accordingly, a plurality of hidden layers (such as 131, 132, and 13n shown in fig. 3) and an output layer 140 may be included in the neural network layer 130, and parameters included in the hidden layers may be pre-trained according to the associated training data of a specific task type, for example, the task type may include image recognition, image classification, image super-resolution reconstruction, and the like … ….
After the hidden layers in the neural network layer 130, i.e. the last layer of the whole convolutional neural network 100 is the output layer 140, the output layer 140 has a loss function similar to the class cross entropy, and is specifically used for calculating the prediction error, once the forward propagation (i.e. the propagation from 110 to 140 in fig. 3 is the forward propagation) of the whole convolutional neural network 100 is completed, the backward propagation (i.e. the propagation from 140 to 110 in fig. 3 is the backward propagation) starts to update the weight values and the bias of the aforementioned layers, so as to reduce the loss of the convolutional neural network 100 and the error between the result output by the convolutional neural network 100 through the output layer and the ideal result.
It should be noted that the convolutional neural network 100 shown in fig. 3 is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models, for example, as shown in fig. 4, a plurality of convolutional layers/pooling layers are parallel, and the features extracted respectively are all input to the overall neural network layer 130 for processing.
The convolutional neural network based algorithm shown in fig. 3 and 4 may be implemented in the NPU chip shown in fig. 5.
Fig. 5 is a diagram of a chip hardware structure according to an embodiment of the present disclosure.
The neural network processor NPU 50NPU is mounted on a main CPU (Host CPU) as a coprocessor, and tasks are allocated by the Host CPU. The core portion of the NPU is an arithmetic circuit 50, and the controller 504 controls the arithmetic circuit 503 to extract matrix data in the memory and perform multiplication.
In some implementations, the arithmetic circuit 503 internally includes a plurality of processing units (PEs). In some implementations, the operational circuitry 503 is a two-dimensional systolic array. The arithmetic circuit 503 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuitry 503 is a general-purpose matrix processor.
For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to matrix B from the weight memory 502 and buffers each PE in the arithmetic circuit. The arithmetic circuit takes the matrix a data from the input memory 501 and performs matrix operation with the matrix B, and partial results or final results of the obtained matrix are stored in the accumulator 508 accumulator.
The unified memory 506 is used to store input data as well as output data. The weight data is directly transferred to the weight Memory 502 through the Direct Memory Access Controller 505, and the DMAC. The input data is also carried through the DMAC into the unified memory 506.
The BIU is a Bus Interface Unit 510, which is used for the interaction between the AXI Bus and the DMAC and the Instruction Fetch memory 509Instruction Fetch Buffer.
The Bus Interface Unit 510(Bus Interface Unit, BIU for short) is configured to obtain an instruction from the instruction fetch memory 509 and obtain the original data of the input matrix a or the weight matrix B from the external memory by the memory Unit access controller 505.
The DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 506 or to transfer weight data into the weight memory 502 or to transfer input data into the input memory 501.
The vector calculation unit 507 has a plurality of operation processing units, and further processes the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like, if necessary. The method is mainly used for non-convolution/FC layer network calculation in the neural network, such as Pooling (Pooling), Batch Normalization (Batch Normalization), Local Response Normalization (Local Response Normalization) and the like.
In some implementations, the vector calculation unit 507 can store the processed output vector to the unified buffer 506. For example, the vector calculation unit 507 may apply a non-linear function to the output of the arithmetic circuit 503, such as a vector of accumulated values, to generate the activation value. In some implementations, the vector calculation unit 507 generates normalized values, combined values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to the arithmetic circuitry 503, for example for use in subsequent layers in a neural network.
An instruction fetch buffer 509 connected to the controller 504 for storing instructions used by the controller 504;
the unified memory 506, the input memory 501, the weight memory 502, and the instruction fetch memory 509 are all On-Chip memories. The external memory is private to the NPU hardware architecture.
Among them, the operations of the layers in the convolutional neural networks shown in fig. 3 and 4 may be performed by the matrix calculation unit 212 or the vector calculation unit 507.
Fig. 1 to fig. 5 describe the related contents of artificial intelligence, and the embodiment of the present application provides a method for data processing. The method for data processing provided by the embodiment of the present application may be implemented based on, for example, the training device 220 in fig. 2, where the training device 220 may correspond to the model training system of the present application. It should be noted that the representation form of the model training system provided by the embodiment of the present application may be different from the training apparatus 220 in fig. 2, but the model trained by the embodiment of the present application may be applied to various scenarios described in fig. 1, and the model may adopt any one of the possible neural network structures in fig. 3 to fig. 4.
The model training system of the embodiment of the application can be an end-cloud combined model training system. Referring to fig. 6, a model training system with end cloud combination is provided in the embodiments of the present application.
The model training system comprises cloud equipment and a plurality of terminal equipment, wherein the cloud equipment is in communication connection with the terminal equipment through a communication network.
The cloud device may be a resource set having computing and data transceiving functions, may be an independent computer device, or may be a cluster formed by a plurality of independent computer devices. Or may be a Virtual Machine (VM).
A terminal device (also referred to as User Equipment (UE)) is a device with a wireless transceiving function, and can be deployed on land, including indoors or outdoors, and handheld or vehicle-mounted; can also be deployed on the water surface (such as a ship and the like); and may also be deployed in the air (e.g., airplanes, balloons, satellites, etc.). The terminal may be a mobile phone (mobile phone), a tablet computer (pad), a computer with a wireless transceiving function, a Virtual Reality (VR) terminal, an Augmented Reality (AR) terminal, a wireless terminal in industrial control (industrial control), a wireless terminal in self driving (self driving), a wireless terminal in remote medical (remote medical), a wireless terminal in smart grid (smart grid), a wireless terminal in transportation safety (transportation safety), a wireless terminal in smart city (smart city), a wireless terminal in smart home (smart home), and the like.
In the model training system based on end cloud combination, a model to be trained is stored on the cloud device and each terminal device, and the representation form of the model to be trained on the cloud device and each terminal device may be a calculation graph containing operators and edges, or files in other forms.
As described above in fig. 2, the process of model training is a process of updating weights by continuous training. Because the weights set by the model to be trained during weight initialization are usually large, the process of updating the weights usually subtracts the aggregate gradient of the weights in the current round from the aggregate gradient of the weights in the current round to obtain updated weights, and the updated weights are used for the next round of training.
The process of model training is typically a process of calculating the gradient of each weight, the gradient typically being the derivative of the weight. Because each terminal device calculates a gradient for the same weight (for example, weight a), and each terminal device usually uses different training data during model training, the gradients calculated by each training node for the same weight are usually different, so that the cloud device is required to aggregate the gradients calculated by each terminal device to obtain an aggregate gradient. The process of gradient aggregation is generally a process of adding the gradients calculated by each terminal device and then averaging the gradients. Such as: for the weight a, if the gradient calculated by the terminal device 0 is a0, the gradient calculated by the terminal device 1 is a1, the gradient calculated by the terminal device 2 is a2, and the gradient calculated by the terminal device 3 is a3, the aggregate gradient of the weight a may be (a0+ a1+ a2+ a 3)/4. Of course, the gradient polymerization method is not limited to this method, and other applicable gradient polymerization methods are also applicable to the present application, and will not be further described herein.
In the embodiment of the application, the terminal device is required to transmit the gradient data to the cloud device, and communication overhead between the terminal device and the cloud device is generated. In order to reduce communication overhead as much as possible, embodiments of the present application provide a data processing method.
Referring to fig. 7, an embodiment of a data processing method provided in the embodiment of the present application may include:
601. the terminal equipment acquires A gradient data of the model to be trained.
A is a positive integer. In the embodiment of the present application, the number of a is related to the structure of the model to be trained, taking the example that the model to be trained is a convolutional neural network, the structure of the model to be trained is convolution (Conv) 1-Conv 2-Full Connection (FC) 1-FC2), where Conv1 and Conv2 both include two logic units, i.e., a filter and a bias (bias), respectively, so that a in the model is equal to 6. It can also be understood that the model comprises 6 weight-dependent operators. The model of the structure is only used as an example for explanation, and the value of a is related to the specific model structure, and the application is not limited to this. Each gradient data includes at least one element, each of which may represent a gradient.
602. The terminal device divides each first gradient data of the B gradient data into at least two gradient segments.
It should be noted that, for convenience of reference, the first gradient data may be used to refer to any one of the B gradient data. The model to be trained and the source of the a gradient data are not limited in this application.
The first gradient data comprises d elements, the B gradient data are contained in the A gradient data, B is a positive integer and is less than or equal to A, each gradient segment comprises d ' elements, d ' is a positive integer larger than 2, and d can be evenly divided by d '.
The B gradient data may be part or all of the a gradient data, for example: when B is equal to a, it means that each of a gradient data sets needs to be divided into gradient segments, when B < a, it means that some gradient data sets need to be divided into gradient segments, and the remaining (a-B) gradient data sets do not need to be divided into gradient segments. When the gradient segment is divided, d ' can be determined according to the number of d ', and d can be evenly divided by d ', for example: when d is 1200, d 'is 16, and d/d' is 75, it means that the first gradient data can be divided into 75 gradient segments.
603. And the terminal equipment performs quantization conversion on each gradient segment in the at least two gradient segments by using the first codebook of the first gradient data to obtain a quantization representation corresponding to each gradient segment.
The number of elements in the quantization representation is less than d ', the first codebook is a first matrix with d' rows and m columns or a second matrix with d 'columns and m rows, m is a positive integer greater than 2, and m is greater than or equal to d'.
Each first gradient data corresponds to a codebook, d 'of different gradient data may be different, and d' and m in different codebooks may also be different. m ≧ d' can ensure that a matching column or row is found for each gradient segment in the first matrix or the second matrix. In the model training process, it is usually specified to use the first matrix or the second matrix.
The terminal device may perform quantization conversion on the gradient data, and the first matrix or the second matrix may be understood by referring to table 1 below.
Table 1: relation table of gradient data and codebook
Figure BDA0002209149120000141
As can be seen from table 1, for the model to be trained of the convolutional neural network structure described in step 601, there may be 6 gradient data, each gradient data has multiple elements, such as: con1-filter gradient data has 150 elements, Con1-bias gradient data has 6 elements, Con2-filter gradient data has 2400 elements, Con2-bias gradient data has 96 elements, FC1-weight gradient data has 48000 elements, and FC2-weight gradient data has 1200 elements. Wherein, whether to quantize a line, N indicates that quantization is not required, and Y indicates that quantization is required. If the threshold of the data amount required to be quantized is set to 512, the gradient data of Con1-filter, Con1-bias and Con2-bias are all smaller than 512, and no quantization is required, and no codebook is required. In addition, the gradient data of Con2-filter, FC1-weight and FC2-weight exceed 512, which means that quantization is needed, and each gradient data needs a corresponding codebook, for example: the gradient data of Con2-filter has 2400 elements, and if d' is 32, the gradient data can be divided into 75 gradient segments, and m is 32, which indicates that the codebook is a 32 × 32 matrix. Similarly, the codebook corresponding to the FC1-weight gradient data is a 64 × 64 matrix. The FC2-weight gradient data corresponds to a codebook that is a 16 x 32 matrix.
604. The terminal device sends B quantization sets corresponding to the B gradient data to the cloud device, wherein the first quantization set corresponding to the first gradient data comprises a quantization representation corresponding to each gradient segment.
Here, "sending the B quantization sets corresponding to the B pieces of gradient data" may be uniformly sending the B pieces of gradient data after all the quantization conversion is completed, or may be sending the B pieces of gradient data in multiple times or sending the B pieces of gradient data after part (including one) of the B pieces of gradient data is completed.
Each quantization set corresponds to one gradient data, and if B is 3 and there are 75 gradient segments in the gradient data 1, then there are 75 quantization representations in the quantization set corresponding to the gradient data 1. If there are 120 gradient segments in the gradient data 2, there are 120 quantization representations in the quantization set corresponding to the gradient data 2. If there are 150 gradient segments in the gradient data 3, 150 quantization representations are present in the corresponding quantization set of the gradient data 3.
When A is larger than B, the terminal equipment also sends (A-B) gradient data except the B gradient data in the A gradient data to the cloud equipment.
The number of the (A-B) gradient data may be small, the quantization conversion process may consume computing resources, and from the perspective of comprehensive benefits, the (A-B) gradient data may be directly transmitted without being compressed, or the (A-B) gradient data may be compressed by using another compression processing method and then transmitted.
605. After receiving B quantization sets corresponding to B gradient data sent by a terminal device, a cloud device performs inverse quantization on each quantization representation in each first quantization set by using a first codebook corresponding to the first quantization set to obtain a first gradient segment corresponding to each quantization representation.
The first gradient segment comprises d 'elements, the number of the elements in the quantization representation is smaller than d', the first codebook is a first matrix with d 'rows and m columns or a second matrix with d' columns and m rows, d 'and m are positive integers larger than 2, and m is larger than or equal to d'.
After receiving the quantization representation, the cloud device performs inverse quantization, that is, quantization recovery, using the same codebook used in quantization conversion by the terminal device.
606. And the cloud equipment splices the first gradient segments corresponding to each quantization representation in the first quantization set to obtain first gradient data corresponding to the first quantization set.
The first gradient data includes d elements, d is a positive integer, and d is divisible by d'.
The splicing process is a sequencing process, and each gradient segment is well arranged according to the sequence of the terminal equipment when dividing the gradient segment, so that the first gradient data is recovered.
607. The cloud equipment carries out gradient aggregation on the N first gradient data of the first weight aiming at the first weight in the B weights corresponding to the B gradient data so as to obtain gradient aggregation data of the first weight.
The plurality of first gradient data of the first weight correspond to N terminal devices, where N is an integer greater than 1.
The gradient polymerization can be understood by referring to the foregoing description of the gradient polymerization process, and the detailed description is not repeated here.
608. And the cloud equipment sends the gradient aggregation data of the B weights to the terminal equipment.
609. The terminal device updates the weights according to the gradient aggregation data.
The update of the weights can be understood by referring to the foregoing description of the process of updating the weights, and the detailed description is not repeated here.
The gradient data to be transmitted are divided into gradient segments, then quantization conversion is carried out according to a codebook, so that quantization expression is obtained, and the number of elements in the quantization expression is smaller than that of the elements in each gradient segment. Therefore, the quantity of quantized data is further reduced on the basis of quantization compression, and the compression ratio is improved. In addition, because the compression ratio of the quantized representation is increased, the communication overhead when the terminal device transmits the quantized representation to the cloud device is also reduced.
In the embodiment of the present application, no matter in quantization conversion or in inverse quantization, codebooks are used, and the codebooks may be generated by the terminal device and the cloud device respectively according to the same random seed, or generated by the cloud device and then sent to the terminal device, and these two cases are introduced below respectively.
1. The terminal equipment and the cloud equipment generate the codebook by themselves.
The terminal equipment determines that B gradient data in the A gradient data need to be quantized according to the respective data volume of each gradient data;
the terminal equipment determines d 'and m according to the element number d of the first gradient data aiming at each first gradient data in the B gradient data, wherein d' is a power p of 2, m is a power q of 2, p and q are positive integers, and q is larger than or equal to p;
the terminal equipment randomly generates the first matrix with d' rows and m columns or the second matrix with d columns and m rows for the first gradient data according to a random seed, and each Gaussian vector in the first matrix or the second matrix is scaled to a modulus of 1 to obtain a first codebook of the first gradient data, wherein each column in the first matrix is a Gaussian vector, and each row in the second matrix is a Gaussian vector.
The cloud device randomly generates a first matrix with d 'rows and m columns or a second matrix with d' columns and m rows for each first gradient data in the B gradient data according to a random seed, and each Gaussian vector in the first matrix or the second matrix is scaled to modulo 1 to obtain a first codebook of the first gradient data, wherein each column in the first matrix is a Gaussian vector, and each row in the second matrix is a Gaussian vector.
The terminal device may determine a threshold of the amount of data to be subjected to quantization conversion from the perspective of the comprehensive profit, for example: the data amount threshold is set to 512 elements, and if the number of elements in the gradient data is greater than or equal to 512, it indicates that quantization conversion is required, and if the number of elements in the gradient data is less than 512, it indicates that quantization conversion is not required. The terminal device only needs to generate a codebook for the gradient data which needs to be subjected to quantization conversion. Therefore, the calculation cost caused by generating the codebook can be reduced, and the comprehensive benefits of the calculation cost and the communication cost are improved. In order to ensure that the cloud device can normally recover the gradient data, the random seed used by the terminal device to generate the codebook is the same as that used by the cloud device. The random seed may be a numerical value, such as: 798, the random seed is usually set by the cloud device, and then synchronized to each terminal number to be combined. A codebook of gradient data can be randomly generated by inputting a random seed into a random function. Because the random seeds are the same, the codebook generated by the terminal device and the codebook generated by the cloud device are the same.
2. And the cloud equipment generates a codebook and then issues the codebook to the terminal equipment.
The cloud device randomly generates a first matrix with d 'rows and m columns or a second matrix with d' columns and m rows for each first gradient data in the B gradient data according to a random seed, and each Gaussian vector in the first matrix or the second matrix is scaled to modulo 1 to obtain a first codebook of the first gradient data, wherein each column in the first matrix is a Gaussian vector, and each row in the second matrix is a Gaussian vector.
And the cloud equipment sends the codebooks corresponding to the B gradient data to the terminal equipment. Correspondingly, the terminal device receives the respective corresponding codebooks of the B gradient data sent by the cloud device, where the respective codebooks of the B gradient data include the first codebook.
In the implementation scheme, the cloud equipment sends the generated codebook to the terminal equipment, so that the calculation overhead of the terminal equipment can be reduced.
In this embodiment of the present application, step 603: the performing, by the terminal device, a quantization conversion on each of the at least two gradient segments by using the first codebook of the first gradient data to obtain a quantized representation corresponding to each of the at least two gradient segments may include:
the terminal device determines a target code for a first gradient segment from the first codebook, wherein the target code is a column in the first matrix or a row in the second matrix, and the first gradient segment is any one of the at least two gradient segments;
the terminal equipment determines the pseudo-modular length and the code index of the target code;
the terminal device determines the pseudo-modulo length and the code index as a quantized representation of the first gradient segment.
In the embodiment of the application, the pseudo modular length is opposite to the modular length, and the pseudo modular length is obtained by calculating the codebook and the gradient segment instead of the mathematically common vector modular length calculation method. The code index represents an index of one column of the first matrix, for example: index (5) indicates the fifth column. If there are multiple codebooks, a flag with the codebook, such as index (3, 5), may be needed to indicate the fifth column of codebook 3. Alternatively, the code index represents an index of one row of the second matrix, for example: index (3,6) indicates the sixth line of codebook 3. If the modulo length is denoted by u and the code index (c), the quantization representation may be (u, index (c)). In the embodiment of the application, one gradient segment is converted to include only two elements of the pseudo-modular length and the code index, so that the data volume is greatly reduced, the compression ratio of the gradient data is improved, and the communication overhead for transmitting the gradient data is also reduced.
Wherein, the steps are as follows: there are two implementations of determining the target code for the first gradient segment from the first codebook.
Referring to fig. 8, the first mode may include:
701. a vector norm (vector norm) of the first gradient segment is determined.
702. It is determined whether the vector modulo length is equal to 0, if so, step 703 is performed, and if not, step 704 is performed.
The vector modulo length is the sum of the squares of the elements on the first gradient segment and then the square root is taken.
703. And if the vector modular length is equal to 0, determining any column in the first matrix or any row in the second matrix as the target code.
If the vector modulo length is equal to 0, then each element is 0, so only one row or column needs to be randomly selected to be brought together into a format consistent with the quantized representation, for example: the first column or the first row may be selected.
If the first gradient segment is represented by g, then there will be If | | | g | | non-calculation2When the value is equal to 0, then return (0, c)1)。||g||2The sum of the squares representing the elements in the first gradient segment is then squared, i.e. the vector modulo length of the first gradient segment g. For the first gradient segment g, if the vector modulo length is equal to 0, a vector representation (0, index (x, 1)) may be output, where the vector modulo length is equal to 0, the code index is the first row or the first column in the x-th codebook, and x represents the index of the codebook, which may be the identifier or the number of the codebook. Of course, the first column or the first row may not be selected, and only one column or one row may be selected at random.
704. And if the vector modular length is not equal to 0, determining a first coefficient vector of the first gradient segment according to the first codebook and the first gradient segment.
The first coefficient vector comprises m elements.
If the first codebook is denoted by C, the first gradient segment by g and the first coefficient vector by p, then p ═ CT(CCT)-1g, wherein CTRepresenting the transposed matrix of C. P includes m elements.
705. And normalizing each element in the first coefficient vector to obtain a second coefficient vector.
The second coefficient vector comprises m normalized elements.
For the second coefficient vector
Figure BDA0002209149120000171
It is shown that,
Figure BDA0002209149120000172
m normalized elements are included, and element normalization can be understood as:
Figure BDA0002209149120000173
wherein
Figure BDA0002209149120000174
Represents the b-th, | p of the m normalized elementsbI represents the absolute value of the b-th element in the first coefficient vector p, I p I calculation1Represents the sum of the absolute values of the elements of m elements of the first coefficient vector p.
706. And determining an ith row in the first matrix or an ith row in the second matrix corresponding to the ith element by adopting a wheel disk group selection strategy for the normalized m elements.
And the ith element is an element selected by adopting a selection strategy of the wheel disc group.
In the scheme, each normalized element can be considered to have respective probability, the probability of each element is large or small, the sum of all the probabilities is equal to 1, the probability of selecting with large probability is large, the probability of selecting with small probability is small, but the final selection result is based on the element selected after the wheel disc set rotates. If selected is
Figure BDA0002209149120000175
The code of the column or row corresponding to the b-th element is selected as the object code. Of course, if the ith normalized element is selected, the code of the column or row corresponding to the ith element is selected as the target code.
If the object code selected in this step is cbThe length of the pseudo die is determined
Figure BDA0002209149120000181
Wherein | p | purple1Representing the sum of the absolute values of the elements in the first coefficient. Sign function signs. If the object code is cbThe pseudo-modular length represents the pair normalizationThe product of the latter b-th element and the sum of the absolute values of the elements in the first coefficient.
The code index is index (x, b), return (u, index (x, b)), and represents the output quantized representation (u, index (x, b)).
The second mode may include:
determining a projection vector (projection vector) of the first gradient segment according to the first codebook and the first gradient segment, wherein the projection vector comprises m elements;
determining an ith row target code in the first codebook or the second matrix corresponding to an ith element, wherein the absolute value of the ith element is the largest among the absolute values of the m elements.
In this possible implementation, if the first codebook is denoted by C, the first gradient segment by g, and the projection vector by p, then p ═ CTg, wherein CTRepresenting the transposed matrix of C. The projection vector p comprises m elements; | piAnd | represents the absolute value of the ith element of the m elements. If i is 5, it means that the 5 th column in the first codebook or the 5 th row in the second matrix is selected as the target code.
If the object code c selected in this stepiIf the length u is equal to p, the pseudo-mode length is equal to piI.e. the pseudo-modulo length is equal to the ith element in the projection vector p.
The code index is index (x, i), return (u, index (x, i)), and represents the output quantized representation (u, index (x, i)).
If there is only one codebook, the code index may not include the index x of the codebook.
The cloud device, when dequantizing, may: for each first quantized representation of the first set of quantizes, determining a target code of the first quantized representation from the first codebook according to a code index of the first quantized representation; and restoring a first gradient segment corresponding to the first quantized representation according to the pseudo modular length in the first quantized representation and the target code.
That is, the cloud device may determine the object code c by index (i)iThen using the pseudo modulo length u with each element in the object codeAnd multiplying the elements to recover the corresponding gradient segment.
And the cloud equipment receives the quantitative representation sent by the terminal equipment, and performs reverse operation on the quantitative representation to obtain an approximate value of the gradient data. Taking the FC2-weight gradient as an example, the gradient has 75 quantized gradient segments, and the quantization expression for each gradient segment is set to (0.523641, (3, 25)), which indicates that the 25 th code c is selected from the codebook 3, and the approximation of the gradient segment is 0.523641 ×.c. And splicing the approximate values of the 75 gradient segments to obtain an approximate representation of the FC2-weight gradient data.
It should be noted that, the codebook in the above example is described by taking a structure of a first matrix with d 'rows and m columns as an example, if the codebook is the second matrix with d' rows and m columns, C in the above formula needs to be transposed first, and the second matrix is converted into the first matrix and then calculated by the above corresponding formula.
To further save communication overhead, the cloud device may further:
respectively carrying out quantization conversion on the gradient aggregation data of the B weights to obtain respective quantization sets of the B gradient aggregation data;
and sending respective quantization sets of the B gradient aggregation data to the terminal device, wherein each quantization set of the gradient aggregation data comprises a corresponding quantization representation of each gradient segment of the gradient aggregation data.
In this possible implementation manner, the cloud device may also perform quantization conversion on the gradient aggregation data to be sent to the terminal device, which is similar to that performed by the terminal device on the gradient data, so that the communication overhead between the cloud device and the terminal device may be further reduced.
After the terminal device receives the respective quantization sets of the B gradient aggregation data corresponding to the B gradient data sent by the cloud device, the B gradient aggregation data can be restored by adopting the above process of inverse quantization of the cloud device, and then the weight update is performed.
At present, there are several mainstream compression methods in the Gradient data transmission process, which are random Gradient Descent (SGD), Quantized SGD (Quantized SGD, QSGD), symbol SGD (sign SGD), and Ternary Gradient (Ternary Gradients, ternard). The quantization transformation of the embodiment of the present application is also a gradient compression method, which may be referred to as a gradient compression method of Hyper-Sphere quantization (HSQ).
In developing HSQ, engineers tested 3 popular deep learning models, VGG19, ResNet50, and ResNet101, for the above several compression methods based on the same dataset in a simulated environment. The compression ratio (pure SGD as baseline) and convergence accuracy of the algorithm are listed in table 2, respectively, and d' represents the gradient segment size.
Table 2: compression ratio and convergence precision table corresponding to several compression methods
Figure BDA0002209149120000191
As can be seen from table 2 above, when d 'is 8 and d' is 16, the HSQ can obtain higher convergence accuracy than the SGD, and the compression ratios are 18.3 and 36.6 times, respectively. When d' is 64, the compression ratio of HSQ is significantly higher than other algorithms and the degradation of convergence accuracy is small. Compared with the existing several gradient compression methods, the method has more advantages in gradient compression ratio and convergence accuracy.
The model training system based on end-cloud combination and the data processing method in the model training are described above, and the terminal device and the cloud device provided by the embodiment of the application are introduced below with reference to the accompanying drawings.
As shown in fig. 9, an embodiment of a terminal device 80 provided in the embodiment of the present application may include:
the processing unit 801 is configured to:
obtaining A pieces of gradient data of a model to be trained, wherein A is a positive integer;
for each first gradient data of B gradient data, dividing the first gradient data into at least two gradient segments, wherein the first gradient data comprises d elements, the B gradient data is contained in the A gradient data, B is a positive integer and is less than or equal to A, each gradient segment comprises d ' elements, d ' is a positive integer which is greater than 2, and d can be evenly divided by d ';
performing quantization conversion on each gradient segment in the at least two gradient segments by using a first codebook of the first gradient data to obtain a quantization representation corresponding to each gradient segment, wherein the number of elements in the quantization representation is less than d ', the first codebook is a first matrix with rows and columns of d' or a second matrix with rows and columns of d 'and columns of m, m is a positive integer greater than 2, and m is greater than or equal to d';
a sending unit 802, configured to send, to a cloud device, B quantization sets corresponding to the B gradient data, where a first quantization set corresponding to the first gradient data includes a respective quantization representation corresponding to each gradient segment.
The gradient data to be transmitted are divided into gradient segments, then quantization conversion is carried out according to a codebook, so that quantization expression is obtained, and the number of elements in the quantization expression is smaller than that of the elements in each gradient segment. Therefore, the quantity of quantized data is further reduced on the basis of quantization compression, and the compression ratio is improved. In addition, because the compression ratio of the quantized representation is increased, the communication overhead when the terminal device transmits the quantized representation to the cloud device is also reduced.
In a possible embodiment, the sending unit 802 is further configured to send (a-B) gradient data of the a gradient data except for the B gradient data to the cloud device.
In this possible implementation, the number of the (a-B) gradient data may be small, the quantization conversion process also consumes computational resources, and from the perspective of comprehensive profit, the (a-B) gradient data may be directly transmitted without being compressed, or the (a-B) gradient data may be compressed by using another compression processing method and then transmitted.
In a possible embodiment, the processing unit 801 is further configured to:
determining that B gradient data in the A gradient data need to be quantized according to the respective data volume of each gradient data;
for each first gradient data in the B gradient data, determining d 'and m according to the element number d of the first gradient data, wherein d' is the power p of 2, m is the power q of 2, p and q are both positive integers, and q is more than or equal to p;
randomly generating the first matrix with d 'rows and m columns or the second matrix with d' columns and m rows for the first gradient data according to a random seed, and scaling each Gaussian vector in the first matrix or the second matrix to be modulo 1 to obtain a first codebook of the first gradient data, wherein each column in the first matrix is a Gaussian vector, and each row in the second matrix is a Gaussian vector.
In this possible implementation, a threshold of the amount of data to be quantized and transformed may be determined from the perspective of the overall profit, for example: the data amount threshold is set to 512 elements, and if the number of elements in the gradient data is greater than or equal to 512, it indicates that quantization conversion is required, and if the number of elements in the gradient data is less than 512, it indicates that quantization conversion is not required. The terminal device only needs to generate a codebook for the gradient data which needs to be subjected to quantization conversion. Therefore, the calculation cost caused by generating the codebook can be reduced, and the comprehensive benefits of the calculation cost and the communication cost are improved. In order to ensure that the cloud device can normally recover the gradient data, the random seed used by the terminal device to generate the codebook is the same as that used by the cloud device. The random seed may be a numerical value, and a codebook of gradient data may be generated randomly by inputting the random seed to a random function.
In a possible embodiment, the receiving unit 803 is configured to receive a codebook corresponding to each of the B gradient data sent by the cloud device, where the codebook corresponding to each of the B gradient data includes the first codebook.
In this possible implementation manner, the terminal device may not be required to generate the codebook by itself, and the codebook may be generated by the cloud device and then issued to the terminal device, so that the calculation overhead of the terminal device may be reduced.
In a possible embodiment, the processing unit 801 is configured to:
determining a target code for a first gradient segment from the first codebook, wherein the target code is a column in the first matrix or a row in the second matrix, and the first gradient segment is any one of the at least two gradient segments;
determining a pseudo-modulo length and a code index of the target code;
determining the pseudo-modulo length and the code index as a quantized representation of the first gradient segment.
In a possible embodiment, the processing unit 801 is configured to:
determining a vector modulo length of the first gradient segment;
and if the vector modular length is equal to 0, determining any column in the first matrix or any row in the second matrix as the target code.
In a possible embodiment, the processing unit 801 is further configured to:
if the vector modular length is not equal to 0, determining a first coefficient vector of the first gradient segment according to the first codebook and the first gradient segment, wherein the first coefficient vector comprises m elements;
normalizing each element in the first coefficient vector to obtain a second coefficient vector, wherein the second coefficient vector comprises m normalized elements;
and aiming at the normalized m elements, adopting a selection strategy of a wheel disk group, and determining an ith row in the first matrix or an ith row in the second matrix corresponding to the ith element, wherein the ith element is an element selected by adopting the selection strategy of the wheel disk group.
In a possible embodiment, the processing unit 801 is configured to:
determining a projection vector of the first gradient segment according to the first codebook and the first gradient segment, wherein the projection vector comprises m elements;
determining an ith row target code in the first codebook or the second matrix corresponding to an ith element, wherein the absolute value of the ith element is the largest among the absolute values of the m elements.
In a possible embodiment, the receiving unit 803 is configured to receive a respective quantization set of B gradient aggregation data corresponding to the B gradient data sent by the cloud device, where each quantization set of gradient aggregation data includes a quantization representation corresponding to each gradient segment of the gradient aggregation data.
It should be noted that, because the contents of information interaction, execution process, and the like between the modules of the terminal device 80 are based on the same concept as the method embodiment of the present application, the technical effect brought by the contents is the same as the method embodiment of the present invention, and specific contents may refer to the description in the foregoing method embodiment of the present application, and are not described herein again.
Referring to fig. 10, an embodiment of a cloud device 90 provided in the embodiment of the present application may include:
a receiving unit 901, configured to receive B quantization sets corresponding to B gradient data sent by a terminal device, where each quantization set includes a respective quantization representation corresponding to each gradient segment of the corresponding gradient data;
the processing unit 902 is configured to:
for each first quantization set in the B quantization sets, performing inverse quantization on each quantization representation in the first quantization set by using a first codebook corresponding to the first quantization set to obtain a first gradient segment corresponding to each quantization representation, wherein the first gradient segment comprises d 'elements, the number of the elements in the quantization representation is less than d', the first codebook is a first matrix with d 'rows and m columns or a second matrix with d' columns and m rows, d 'and m are positive integers greater than 2, and m is greater than or equal to d';
splicing first gradient segments corresponding to each quantization representation in the first quantization set to obtain first gradient data corresponding to the first quantization set, wherein the first gradient data comprises d elements, d is a positive integer, and d can be evenly divided by d';
performing gradient aggregation on N first gradient data of a first weight to obtain gradient aggregation data of the first weight, wherein the N first gradient data of the first weight correspond to N terminal devices, and N is an integer greater than 1;
a sending unit 903, configured to send the gradient aggregation data of each of the B weights to the terminal device.
In the embodiment of the application, after receiving the quantization expression, the cloud device performs inverse quantization, that is, quantization recovery, by using the same codebook used in quantization conversion of the terminal device, and further performs gradient aggregation. The gradient aggregation process refers to that the cloud equipment aggregates gradients of the same weight received from a plurality of terminal equipment, and after the cloud equipment sends the aggregated gradient aggregation data to the terminal equipment, the terminal equipment can update the weights, so that the model to be trained converges further. The process of updating the weight may be to subtract the gradient aggregation data obtained by the calculation of the current round from the current weight of the current round to obtain an updated weight, and the updated weight is used for the next round of training. Because the cloud device receives the quantized representation after the quantization conversion of the terminal device from the terminal device, the number of elements in the quantized representation is less than the number d' of elements in each gradient segment, and therefore, the gradient data can be received with small communication overhead.
In one possible embodiment, the processing unit 902 is further configured to: for each first gradient data in B gradient data, randomly generating a first matrix with d 'rows and m columns or a second matrix with d' columns and m rows for the first gradient data according to a random seed, and scaling each Gaussian vector in the first matrix or the second matrix to be modulo 1 to obtain a first codebook of the first gradient data, wherein each column in the first matrix is a Gaussian vector, and each row in the second matrix is a Gaussian vector.
In a possible embodiment, the sending unit 903 is further configured to send, to the terminal device, codebooks corresponding to the B pieces of gradient data.
In one possible embodiment, the processing unit 902 is configured to:
for each first quantized representation of the first set of quantizes, determining a target code of the first quantized representation from the first codebook according to a code index of the first quantized representation;
and restoring a first gradient segment corresponding to the first quantized representation according to the pseudo modular length in the first quantized representation and the target code.
In one possible embodiment, the processing unit 902 is further configured to: respectively carrying out quantization conversion on the gradient aggregation data of the B weights to obtain respective quantization sets of the B gradient aggregation data;
a sending unit 903, configured to send a quantization set of each of the B gradient aggregation data to the terminal device, where the quantization set of each gradient aggregation data includes a quantization representation corresponding to each gradient segment of the gradient aggregation data.
In this possible embodiment, the cloud device may also perform quantization conversion on the gradient aggregation data to be sent to the terminal device, similarly to that performed by the terminal device on the gradient data, so that the communication overhead of the cloud device and the terminal device may be further reduced.
It should be noted that, because the contents of information interaction, execution process, and the like between the modules of the cloud device 90 are based on the same concept as the method embodiment of the present application, the technical effect brought by the contents is the same as the method embodiment of the present invention, and specific contents may refer to the description in the foregoing method embodiment of the present application, and are not described herein again.
As shown in fig. 11, which is a schematic structural diagram of another device in the embodiment of the present application, the device is a terminal device, and the terminal device may include: a processor 1001 (e.g., CPU), a memory 1002, a transmitter 1004, and a receiver 1003; the transmitter 1004 and the receiver 1003 are coupled to the processor 1001, and the processor 1001 controls the transmitting action of the transmitter 1004 and the receiving action of the receiver 1003. The memory 1002 may comprise a high-speed RAM memory, and may also include a non-volatile memory NVM, such as at least one disk memory, in which various instructions may be stored for performing various processing functions and implementing method steps of embodiments of the present application. Optionally, the terminal device related to the embodiment of the present application may further include: the power supply 1005 and one or more of the communication ports 1006 may be connected through a communication bus, or may be connected through other connection manners, which is not limited in the embodiment of the present application. The receiver 1003 and the transmitter 1004 may be integrated into a transceiver of the terminal device, or may be separate transmitting and receiving antennas on the terminal device. The communication bus is used for realizing communication connection among the elements. The communication port 1006 is used for implementing connection communication between the terminal device and other peripherals.
In some embodiments, the processor 1001 in the terminal device may perform the actions performed by the processing unit 801 in fig. 9, the receiver 1003 in the terminal device may perform the actions performed by the receiving unit 803 in fig. 9, and the transmitter 1004 in the terminal device may perform the actions performed by the transmitting unit 802 in fig. 9, which have similar implementation principles and technical effects and are not described herein again.
The present application further provides a chip system, which includes a processor, and is configured to support the terminal device to implement the functions related thereto, for example, to receive or process data related to the foregoing method embodiments. In one possible design, the system-on-chip further includes a memory for storing program instructions and data necessary for the terminal device. The chip system may be constituted by a chip, or may include a chip and other discrete devices.
Fig. 12 is a schematic diagram illustrating a possible logical structure of the cloud device 110 according to the foregoing embodiments, provided for an embodiment of the present application. The cloud device 110 includes: a processor 1101, a communication port 1102, a memory 1103, and a bus 1104. The processor 1101, communication port 1102, and memory 1103 are interconnected by a bus 1104. In the embodiment of the present application, the processor 1101 is configured to control and manage the actions of the cloud device 110, for example, the processor 1101 is configured to execute the functions executed by the processing unit 902 in fig. 10. Communication port 1102 is used to support cloud device 110 for communication. A memory 1103 for storing program codes and data of the cloud device 110.
The processor 1101 may be, among other things, a central processing unit, a general purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, transistor logic, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor may also be a combination of computing functions, e.g., comprising one or more microprocessors, a digital signal processor and a microprocessor, or the like. The bus 1104 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 12, but this is not intended to represent only one bus or type of bus.
The present application further provides a chip system, which includes a processor, configured to support the cloud device to implement the functions related thereto, for example, to receive or process data related to the foregoing method embodiments. In one possible design, the system-on-chip further includes a memory for storing program instructions and data necessary for the terminal device. The chip system may be constituted by a chip, or may include a chip and other discrete devices.
In another embodiment of the present application, a computer-readable storage medium is further provided, in which computer-executable instructions are stored, and when the at least one processor of the device executes the computer-executable instructions, the device performs the method described in the above-mentioned embodiments of fig. 6 to 8.
In another embodiment of the present application, there is also provided a computer program product comprising computer executable instructions stored in a computer readable storage medium; the computer executable instructions may be read by at least one processor of the device from a computer readable storage medium, and execution of the computer executable instructions by the at least one processor causes the device to perform the method described in the embodiments of fig. 6-8 above.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present application.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the embodiments of the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application, which essentially or partly contribute to the prior art, may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above description is only a specific implementation of the embodiments of the present application, but the scope of the embodiments of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the embodiments of the present application, and all the changes or substitutions should be covered by the scope of the embodiments of the present application. Therefore, the protection scope of the embodiments of the present application shall be subject to the protection scope of the claims.

Claims (27)

1. A method of data processing, comprising:
obtaining A pieces of gradient data of a model to be trained, wherein A is a positive integer;
for each first gradient data of B gradient data, dividing the first gradient data into at least two gradient segments, wherein the first gradient data comprises d elements, the B gradient data is contained in the A gradient data, B is a positive integer and is less than or equal to A, each gradient segment comprises d ' elements, d ' is a positive integer which is greater than 2, and d can be evenly divided by d ';
performing quantization conversion on each gradient segment in the at least two gradient segments by using a first codebook of the first gradient data to obtain a quantization representation corresponding to each gradient segment, wherein the number of elements in the quantization representation is less than d ', the first codebook is a first matrix with rows and columns of d' or a second matrix with rows and columns of d 'and columns of m, m is a positive integer greater than 2, and m is greater than or equal to d';
transmitting B quantized sets of the B gradient data to a cloud device, wherein a first quantized set of the first gradient data includes a quantized representation of each of the gradient segments.
2. The method of claim 1, further comprising:
determining that B gradient data in the A gradient data need to be quantized according to the respective data volume of each gradient data;
for each first gradient data in the B gradient data, determining d 'and m according to the element number d of the first gradient data, wherein d' is the power p of 2, m is the power q of 2, p and q are both positive integers, and q is more than or equal to p;
randomly generating the first matrix with d 'rows and m columns or the second matrix with d' columns and m rows for the first gradient data according to a random seed, and scaling each Gaussian vector in the first matrix or the second matrix to be modulo 1 to obtain a first codebook of the first gradient data, wherein each column in the first matrix is a Gaussian vector, and each row in the second matrix is a Gaussian vector.
3. The method of claim 1, further comprising:
and receiving a codebook corresponding to each of the B gradient data sent by the cloud device, wherein the codebook corresponding to each of the B gradient data includes the first codebook.
4. The method according to any of claims 1-3, wherein said performing a quantization transformation on each of the at least two gradient segments using the first codebook of the first gradient data to obtain a respective quantized representation of each of the at least two gradient segments comprises:
determining a target code for a first gradient segment from the first codebook, wherein the target code is a column in the first matrix or a row in the second matrix, and the first gradient segment is any one of the at least two gradient segments;
determining a pseudo-modulo length and a code index of the target code;
determining the pseudo-modulo length and the code index as a quantized representation of the first gradient segment.
5. The method of claim 4, wherein determining a target code for a first gradient segment from the first codebook comprises:
determining a vector modulo length of the first gradient segment;
and if the vector modular length is equal to 0, determining any column in the first matrix or any row in the second matrix as the target code.
6. The method of claim 5, further comprising:
if the vector modular length is not equal to 0, determining a first coefficient vector of the first gradient segment according to the first codebook and the first gradient segment, wherein the first coefficient vector comprises m elements;
normalizing each element in the first coefficient vector to obtain a second coefficient vector, wherein the second coefficient vector comprises m normalized elements;
and adopting a roulette selection strategy for the normalized m elements, and determining an ith row object code in the first matrix or an ith row object code in the second matrix corresponding to the ith element, wherein the ith element is the element selected by adopting the roulette selection strategy.
7. The method of claim 4, wherein determining a target code for a first gradient segment from the first codebook comprises:
determining a projection vector of the first gradient segment according to the first codebook and the first gradient segment, wherein the projection vector comprises m elements;
determining an ith row target code in the first codebook or the second matrix corresponding to an ith element, wherein the absolute value of the ith element is the largest among the absolute values of the m elements.
8. The method of any one of claims 1-7, further comprising:
receiving respective quantization sets of B gradient aggregation data corresponding to the B gradient data sent by the cloud device, where each quantization set of gradient aggregation data includes a quantization representation corresponding to each gradient segment of the gradient aggregation data.
9. A method of data processing, comprising:
receiving B quantization sets corresponding to B pieces of gradient data sent by a terminal device, wherein each quantization set comprises a quantization representation corresponding to each gradient segment of the corresponding gradient data;
for each first quantization set in the B quantization sets, performing inverse quantization on each quantization representation in the first quantization set by using a first codebook corresponding to the first quantization set to obtain a first gradient segment corresponding to each quantization representation, wherein the first gradient segment comprises d 'elements, the number of the elements in the quantization representation is less than d', the first codebook is a first matrix with d 'rows and m columns or a second matrix with d' columns and m rows, d 'and m are positive integers greater than 2, and m is greater than or equal to d';
splicing first gradient segments corresponding to each quantization representation in the first quantization set to obtain first gradient data corresponding to the first quantization set, wherein the first gradient data comprises d elements, d is a positive integer, and d can be evenly divided by d';
performing gradient aggregation on N first gradient data of a first weight to obtain gradient aggregation data of the first weight, wherein the N first gradient data of the first weight correspond to N terminal devices, and N is an integer greater than 1;
and sending the gradient aggregation data of the B weights to the terminal equipment.
10. The method of claim 9, further comprising:
for each first gradient data in B gradient data, randomly generating a first matrix with d 'rows and m columns or a second matrix with d' columns and m rows for the first gradient data according to a random seed, and scaling each Gaussian vector in the first matrix or the second matrix to be modulo 1 to obtain a first codebook of the first gradient data, wherein each column in the first matrix is a Gaussian vector, and each row in the second matrix is a Gaussian vector.
11. The method of claim 10, further comprising:
and sending the codebook corresponding to the B gradient data to the terminal equipment.
12. The method according to any of claims 9-11, wherein said inverse quantizing each quantized representation in said first quantized set using a first codebook corresponding to said first quantized set to obtain a respective first gradient segment corresponding to said each quantized representation comprises:
for each first quantized representation of the first set of quantizes, determining a target code of the first quantized representation from the first codebook according to a code index of the first quantized representation;
and restoring a first gradient segment corresponding to the first quantized representation according to the pseudo modular length in the first quantized representation and the target code.
13. The method according to any one of claims 9-12, further comprising:
respectively carrying out quantization conversion on the gradient aggregation data of the B weights to obtain respective quantization sets of the B gradient aggregation data;
the sending, to the terminal device, gradient aggregation data of each of the B weights includes:
and sending respective quantization sets of the B gradient aggregation data to the terminal device, wherein each quantization set of the gradient aggregation data comprises a corresponding quantization representation of each gradient segment of the gradient aggregation data.
14. A terminal device comprising a transceiver, a processor and a memory, said processor being coupled to said memory, characterized in that said memory is adapted to store a program;
the processor is configured to:
obtaining A pieces of gradient data of a model to be trained, wherein A is a positive integer;
for each first gradient data of B gradient data, dividing the first gradient data into at least two gradient segments, wherein the first gradient data comprises d elements, the B gradient data is contained in the A gradient data, B is a positive integer and is less than or equal to A, each gradient segment comprises d ' elements, d ' is a positive integer which is greater than 2, and d can be evenly divided by d ';
performing quantization conversion on each gradient segment in the at least two gradient segments by using a first codebook of the first gradient data to obtain a quantization representation corresponding to each gradient segment, wherein the number of elements in the quantization representation is less than d ', the first codebook is a first matrix with rows and columns of d' or a second matrix with rows and columns of d 'and columns of m, m is a positive integer greater than 2, and m is greater than or equal to d';
the transceiver is configured to transmit, to a cloud device, B quantized sets of the B gradient data, wherein a first quantized set of the first gradient data includes a respective quantized representation of each of the gradient segments.
15. The terminal device of claim 14,
the processor is further configured to:
determining that B gradient data in the A gradient data need to be quantized according to the respective data volume of each gradient data;
for each first gradient data in the B gradient data, determining d 'and m according to the element number d of the first gradient data, wherein d' is the power p of 2, m is the power q of 2, p and q are both positive integers, and q is more than or equal to p;
randomly generating the first matrix with d 'rows and m columns or the second matrix with d' columns and m rows for the first gradient data according to a random seed, and scaling each Gaussian vector in the first matrix or the second matrix to be modulo 1 to obtain a first codebook of the first gradient data, wherein each column in the first matrix is a Gaussian vector, and each row in the second matrix is a Gaussian vector.
16. The terminal device of claim 14,
the transceiver is configured to receive a codebook corresponding to each of the B gradient data sent by the cloud device, where the codebook corresponding to each of the B gradient data includes the first codebook.
17. The terminal device according to any of claims 14-16,
the processor is configured to:
determining a target code for a first gradient segment from the first codebook, wherein the target code is a column in the first matrix or a row in the second matrix, and the first gradient segment is any one of the at least two gradient segments;
determining a pseudo-modulo length and a code index of the target code;
determining the pseudo-modulo length and the code index as a quantized representation of the first gradient segment.
18. The terminal device of claim 17,
the processor is configured to:
determining a vector modulo length of the first gradient segment;
and if the vector modular length is equal to 0, determining any column in the first matrix or any row in the second matrix as the target code.
19. The terminal device of claim 18,
the processor is further configured to:
if the vector modular length is not equal to 0, determining a first coefficient vector of the first gradient segment according to the first codebook and the first gradient segment, wherein the first coefficient vector comprises m elements;
normalizing each element in the first coefficient vector to obtain a second coefficient vector, wherein the second coefficient vector comprises m normalized elements;
and adopting a roulette selection strategy for the normalized m elements, and determining an ith row object code in the first matrix or an ith row object code in the second matrix corresponding to the ith element, wherein the ith element is the element selected by adopting the roulette selection strategy.
20. The terminal device of claim 17,
the processor is configured to:
determining a projection vector of the first gradient segment according to the first codebook and the first gradient segment, wherein the projection vector comprises m elements;
determining an ith row target code in the first codebook or the second matrix corresponding to an ith element, wherein the absolute value of the ith element is the largest among the absolute values of the m elements.
21. The terminal device according to any of claims 14-21,
the transceiver is configured to receive a quantization set of each of B gradient aggregation data corresponding to the B gradient data sent by the cloud device, where each quantization set of gradient aggregation data includes a quantization representation corresponding to each gradient segment of the gradient aggregation data.
22. A cloud device comprising a communication port, a processor and a memory, the processor coupled with the memory, wherein the memory is configured to store a program;
the communication port is used for receiving B quantization sets corresponding to B gradient data sent by a terminal device, wherein each quantization set comprises a respective corresponding quantization representation of each gradient segment of the corresponding gradient data;
the processor is configured to:
for each first quantization set in the B quantization sets, performing inverse quantization on each quantization representation in the first quantization set by using a first codebook corresponding to the first quantization set to obtain a first gradient segment corresponding to each quantization representation, wherein the first gradient segment comprises d 'elements, the number of the elements in the quantization representation is less than d', the first codebook is a first matrix with d 'rows and m columns or a second matrix with d' columns and m rows, d 'and m are positive integers greater than 2, and m is greater than or equal to d';
splicing first gradient segments corresponding to each quantization representation in the first quantization set to obtain first gradient data corresponding to the first quantization set, wherein the first gradient data comprises d elements, d is a positive integer, and d can be evenly divided by d';
performing gradient aggregation on N first gradient data of a first weight to obtain gradient aggregation data of the first weight, wherein the N first gradient data of the first weight correspond to N terminal devices, and N is an integer greater than 1;
and the communication port is used for sending the gradient aggregation data of the B weights to the terminal equipment.
23. The cloud device of claim 22,
the processor is further configured to: for each first gradient data in B gradient data, randomly generating a first matrix with d 'rows and m columns or a second matrix with d' columns and m rows for the first gradient data according to a random seed, and scaling each Gaussian vector in the first matrix or the second matrix to be modulo 1 to obtain a first codebook of the first gradient data, wherein each column in the first matrix is a Gaussian vector, and each row in the second matrix is a Gaussian vector.
24. The cloud device of claim 23,
the communication port is further configured to send codebooks corresponding to the B gradient data to the terminal device.
25. The cloud device of any of claims 22-24,
the processor is configured to:
for each first quantized representation of the first set of quantizes, determining a target code of the first quantized representation from the first codebook according to a code index of the first quantized representation;
and restoring a first gradient segment corresponding to the first quantized representation according to the pseudo modular length in the first quantized representation and the target code.
26. The cloud device of any of claims 22-25,
the processor is further configured to perform quantization conversion on the gradient aggregation data of the B weights respectively to obtain respective quantization sets of the B gradient aggregation data;
the communication port is configured to send respective quantization sets of the B gradient aggregation data to the terminal device, where each quantization set of the gradient aggregation data includes a quantization representation corresponding to each gradient segment of the gradient aggregation data.
27. A computer-readable storage medium comprising a program which, when run on a computer, causes the computer to perform the method of any one of claims 1 to 8 or causes the computer to perform the method of any one of claims 9 to 13.
CN201910891200.3A 2019-09-17 2019-09-17 Data processing method and device Pending CN112532251A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910891200.3A CN112532251A (en) 2019-09-17 2019-09-17 Data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910891200.3A CN112532251A (en) 2019-09-17 2019-09-17 Data processing method and device

Publications (1)

Publication Number Publication Date
CN112532251A true CN112532251A (en) 2021-03-19

Family

ID=74974543

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910891200.3A Pending CN112532251A (en) 2019-09-17 2019-09-17 Data processing method and device

Country Status (1)

Country Link
CN (1) CN112532251A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113656272A (en) * 2021-08-16 2021-11-16 Oppo广东移动通信有限公司 Data processing method and device, storage medium, user equipment and server
CN114584436A (en) * 2022-05-06 2022-06-03 北京理工大学 Message aggregation system and method in concurrent communication network of single handshake

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6035057A (en) * 1997-03-10 2000-03-07 Hoffman; Efrem H. Hierarchical data matrix pattern recognition and identification system
EP2224746A1 (en) * 2009-02-27 2010-09-01 Research In Motion Limited Optimization of image encoding using perceptual weighting
US20170286830A1 (en) * 2016-04-04 2017-10-05 Technion Research & Development Foundation Limited Quantized neural network training and inference
CN108304354A (en) * 2018-01-25 2018-07-20 腾讯科技(深圳)有限公司 A kind of prediction model training method and device, storage medium, electronic equipment
CN108805257A (en) * 2018-04-26 2018-11-13 北京大学 A kind of neural network quantization method based on parameter norm
CN109951438A (en) * 2019-01-15 2019-06-28 中国科学院信息工程研究所 A kind of communication optimization method and system of distribution deep learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6035057A (en) * 1997-03-10 2000-03-07 Hoffman; Efrem H. Hierarchical data matrix pattern recognition and identification system
EP2224746A1 (en) * 2009-02-27 2010-09-01 Research In Motion Limited Optimization of image encoding using perceptual weighting
US20170286830A1 (en) * 2016-04-04 2017-10-05 Technion Research & Development Foundation Limited Quantized neural network training and inference
CN108304354A (en) * 2018-01-25 2018-07-20 腾讯科技(深圳)有限公司 A kind of prediction model training method and device, storage medium, electronic equipment
CN108805257A (en) * 2018-04-26 2018-11-13 北京大学 A kind of neural network quantization method based on parameter norm
CN109951438A (en) * 2019-01-15 2019-06-28 中国科学院信息工程研究所 A kind of communication optimization method and system of distribution deep learning

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113656272A (en) * 2021-08-16 2021-11-16 Oppo广东移动通信有限公司 Data processing method and device, storage medium, user equipment and server
CN114584436A (en) * 2022-05-06 2022-06-03 北京理工大学 Message aggregation system and method in concurrent communication network of single handshake
CN114584436B (en) * 2022-05-06 2022-07-01 北京理工大学 Message aggregation system and method in concurrent communication network of single handshake

Similar Documents

Publication Publication Date Title
EP3738082B1 (en) Accelerated quantized multiply-and-add operations
WO2022083536A1 (en) Neural network construction method and apparatus
WO2022042713A1 (en) Deep learning training method and apparatus for use in computing device
US10417525B2 (en) Object recognition with reduced neural network weight precision
CN117456297A (en) Image generation method, neural network compression method, related device and equipment
WO2022111617A1 (en) Model training method and apparatus
CN112639828A (en) Data processing method, method and equipment for training neural network model
CN111882031A (en) Neural network distillation method and device
WO2022228425A1 (en) Model training method and apparatus
EP4318313A1 (en) Data processing method, training method for neural network model, and apparatus
CN111797970B (en) Method and device for training neural network
CN112580720B (en) Model training method and device
CN113191489B (en) Training method of binary neural network model, image processing method and device
CN111368656A (en) Video content description method and video content description device
CN113326930A (en) Data processing method, neural network training method, related device and equipment
CN113191241A (en) Model training method and related equipment
WO2022179588A1 (en) Data coding method and related device
CN115081588A (en) Neural network parameter quantification method and device
CN111738403A (en) Neural network optimization method and related equipment
CN111950700A (en) Neural network optimization method and related equipment
CN115238909A (en) Data value evaluation method based on federal learning and related equipment thereof
CN113723603A (en) Method, device and storage medium for updating parameters
CN113627163A (en) Attention model, feature extraction method and related device
CN112532251A (en) Data processing method and device
CN114841361A (en) Model training method and related equipment thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination