CN114841342A

CN114841342A - Tensor-based efficient Transformer construction method

Info

Publication number: CN114841342A
Application number: CN202210556441.4A
Authority: CN
Inventors: 朱晨露; 刘德彬; 阮一恒; 张立杰; 邓贤君; 杨天若
Original assignee: Hubei Chutian High Speed Digital Technology Co ltd
Current assignee: Hubei Chutian High Speed Digital Technology Co ltd
Priority date: 2022-05-19
Filing date: 2022-05-19
Publication date: 2022-08-02

Abstract

The invention is suitable for the field of artificial intelligence and provides a tensor-based high-efficiency Transformer architecture method.

Description

Tensor-based efficient Transformer construction method

Technical Field

The invention belongs to the field of artificial intelligence, and particularly relates to a tensor-based high-efficiency Transformer construction method.

Background

Along with the rapid development of artificial intelligence technology, communication technology and electronic chip technology, the industrial internet of things tightly connects billions of terminal devices such as intelligent mobile devices, wearable devices and sensors in green infrastructures, smart cities, smart medical treatment, smart power grids and intelligent traffic systems, so that more intelligent services are provided, and the production and life of people become more convenient. However, the industrial internet of things can generate a huge amount of multi-source heterogeneous data, and secondly, due to the excellent performance of the deep neural network model, the deep neural network model is widely applied to feature extraction and intelligent decision making. However, due to the complexity of the task and the huge amount of data, a large deep neural network model is usually required to train the data so as to obtain excellent task performance.

The Transformer is a popular deep neural network model at present, and has been widely applied to the industrial internet of things. The Transformer completely abandons the design concept of a convolution module and a circulation module, and the network model is only composed of an attention layer and a full connection layer. Transformers and other derived models, such as BERT, VIT, GPT, and Universal transformers, achieve superior task performance in terms of natural language processing, computer vision, recommendation systems, intelligent transportation, and the like. However, these network models are very large in size, typically containing hundreds of millions of training parameters, and thus require high performance chips that take weeks or even months to train these network models. Therefore, a large amount of computing resources and energy resources are consumed in the training process of the network model. Furthermore, with the advent of the federal learning paradigm and the increasing demand for real-time applications, smart mobile devices and other embedded devices need to participate in the training and decision-making of intelligent tasks. However, smart mobile devices and other embedded devices often have poor chip performance and high performance chips are bulky and cannot be mounted on edge devices. Therefore, the existing large-scale deep neural network model cannot be trained and deployed on the smart mobile device and other embedded devices. In addition, in the federal learning paradigm, each client needs to upload network model data to the cloud, and then the model fusion process is completed at the cloud. However, since the network model is huge in scale and data security of the network model is protected during uploading, the model data needs to be encrypted by using an encryption method. Encrypting the network model can effectively prevent privacy disclosure, but further increases the amount of communication data. In order to reduce the communication data volume in the federal learning process, reduce the bandwidth occupation and improve the communication efficiency, it becomes important how to reduce the number of training parameters in the network model. In addition, in some real-time applications such as traffic identification and fault detection, data is easily attacked by a network in the process of uploading the data to a cloud end, so that privacy disclosure and safety accidents are caused. Moreover, certain time is consumed for uploading data to the cloud, processing the data by the cloud and issuing a result to the terminal by the cloud, so that time delay is increased, and the requirement of a real-time task is not met. In summary, offloading the computing task to the cloud is not the best option. In addition, the large-scale network model consumes a large amount of energy resources in the training process, so that the carbon emission is increased, and the deterioration of the global environment is aggravated. Therefore, how to reduce the number of training parameters and floating point operation in the deep neural network model on a large scale on the premise of ensuring that the performance of the network model is not changed or the precision loss is within a tolerable range, quicken the training process of the network model, reduce the energy consumption in the training process, and facilitate the deployment on edge equipment with limited resources is a problem which needs to be solved urgently. Fortunately, a large number of redundant learning parameters and floating point operations usually exist in a large-scale network model, so that redundant parts in the network model can be eliminated, the number of learning parameters and the calculation complexity of the model are reduced, the training process of the model is accelerated, the consumption of power resources is reduced, and the development of green economy is promoted.

Since the Transformer shows excellent performance in various fields, the complexity of the network model limits the training and deployment of the Transformer on intelligent mobile devices and other embedded devices. Therefore, more and more researchers are beginning to investigate how to reduce the complexity of the model. And optimizing according to the structural characteristics of the transform model directly to reduce the number of blocks in the transform, so that the size and complexity of the model can be reduced, but the performance of the model is greatly reduced. Therefore, how to reduce the number of parameters and the computational complexity of the network model on the premise of ensuring that the model performance is not changed is a research hotspot. The current methods for model compression mainly include model pruning, low-order approximation, model quantification and knowledge distillation. Liu et al quantizes the trained vision transform model, reducing the memory storage and calculation costs of the model. Chung et al propose a mixed precision quantization strategy, which represents the weight of a transform by a small number of bits, so as to reduce the memory occupation of the model and improve the reasoning speed of the model. Zhu et al reduced the number of model parameters by pruning the Vision transform. Mao et al prune the Transformer components by analyzing their attributes, thereby reducing the number of parameters of the model and reducing the inference time of the model. Jiao et al propose a new Transformer distillation method that is used to transfer the knowledge of a complex teacher model to a small student model. Tensor decomposition method as an emerging high-efficiency network model compression method has achieved excellent performance in other network model compression. Hrinchunk et al compress the parameters of the embedded layer using a tensor chain decomposition method, thereby reducing the complexity of the model. Ma and the like propose a self-attention model based on BT decomposition based on the thought of tensor decomposition and parameter sharing, and reduce the number of parameters of a transform model. Although the above method can efficiently reduce the number of learning parameters and the computational complexity of the model, the method still has defects. Although the memory occupation of the model can be greatly reduced by the model quantization method, the performance of the model is usually greatly lost. Model pruning and knowledge-based distillation methods, while reducing the complexity of the model, are often overly cumbersome and require redesigning the compression scheme for the new model. Therefore, the two methods are less reusable. Hrinchunk only compresses the embedded layer, but has no compression effect for other models that do not contain embedded layers. Ma et al, although reducing the number of transform learning parameters, undermines the characteristics of attention mechanism.

Disclosure of Invention

The invention aims to provide a tensor-based high-efficiency Transformer architecture method, and aims to solve the problem that a current Transformer model cannot be deployed to intelligent mobile equipment and other embedded equipment for training and reasoning in the background art.

The invention provides a tensor-based high-efficiency Transformer construction method, which comprises the following steps of:

step 10, weighting matrix of multi-head attention layer

Mapping to tensor space, and then expressing the tensor space as a tensor decomposition form chain of a k mode;

step 20, inputting data (Q) _att ，K _att ，V _att ) Mapping to tensor space and then corresponding

The weight tensor chain of (a) is operated on, and the result is operated on by 'Attention' and then is compared with

The weight tensor chain is operated to obtain a final output result, and a lightweight tensor multi-head attention mechanism is constructed;

step 30, integrating similar linear operations in the multi-head attention layer of the encoder layer and the multi-head attention layer of the first sublayer of the decoder layer, namely splicing the corresponding weight tensors to form the weight tensors

Integrating similar linear operations in the second sublayer of the decoder layer to form a weight tensor

Constructing a lightweight tensor multi-head attention mechanism +;

step 40Weight matrix in Position-wise feedforward neural network

And

mapping to a tensor space, representing the tensor space as a tensor decomposition form of an m mode, and operating with input data to construct a lightweight tensor Position-wise feedforward network;

and step 50, forming a lightweight tensor multi-head attention mechanism and a lightweight tensor Position-Wise feedforward network into an LTtensorized _ transformer, forming the lightweight tensor Position-Wise feedforward network and the lightweight tensor multi-head attention mechanism + + into an LTensorized _ transformer + +, and constructing a lightweight Transrormer framework.

Further, the step 10 includes the following specific steps:

packaging the series, keys and values into a matrix Q _att ，K _att And V _att And to matrix Q _att ，K _att And V _att H times of linear projection is carried out, wherein the involved weight matrixes can be uniformly expressed as

And

will d _model Expressed as the product of a plurality of positive factors, d _model ＝{d ₁ ×...×d _k }×{d _k+1 ×...×d _2k }×...×{d _(m-1)k+1 ×...×d _mk }；

Weighting matrix

And W _O Mapping to tensor space yields the weight tensor

And

tensor of weight according to equation (1)

And

expressed as a tensor decomposed form of the k-mode, equation (1) is defined as follows:

further, the step 20 includes the following specific steps:

step 21, by mapping the input data to the tensor space

And performing operation with the corresponding small weight tensor kernel, wherein the operation process is shown as formula (2), and the definition of the formula (2) is as follows:

step 22, reshape operation is performed on the operation result, the operation result is split into h equal parts, the split result is stored by a list L, and the whole calculation process is as follows:

D′＝Reshape(D，[-1，d _model ]) (3)

T＝Split(D′)＝(D′ ₁ ，...，D′ _h ) (4)

step 23, the stored data in the list L is taken out,

and

each of which is composed of h matrices, and the input matrix (5) is used to obtain the corresponding subscript

And

) And performing attention calculation so as to obtain a corresponding attention output, wherein the formula (5) is defined as follows:

R _i ＝Get(R，i) (5)

step 24, calculating the output result (head) of each attention by using the formula (6) _i ) And concatenating the results of each attention with a small weight tensor kernel: (

The k-mode tensor decomposition form) and using the formula (2) to perform multi-step feature calculation to obtain the final output result of the multi-head attention layer, wherein the formula (6) is defined as follows:

further, the step 30 includes the following specific steps:

step 31, the multi-head attention layer of the encoder layer and the first sub-layer of the decoder layer have the same structural property, and the query matrix Q _att Key matrix K _att Sum matrix V _att Similar linear mapping operation is carried out, and weight matrixes are used

And

concatenated into a large weight matrix

As shown in equation (7), the weight matrix W is then transformed using equation (1) _QKV Mapping to tensor space and weighting tensor

Expressed as a tensor decomposed form of the k-mode, said formula (7) is defined as follows:

step 32, reshape operation is performed on the input data M, and then a formula (2) is used for calculating a tensor chain of M and the small weight

The results are reported as

To pair

Carrying out reshape operation, then slicing to obtain corresponding

And

the specific process is as follows:

to pair

And

splitting, storing the splitting result in a list, and performing the operations of the step 23 and the step 24 on the splitting result to obtain final output;

step 33. the linear projection process of the key matrix and the value matrix in the multi-headed attention layer of the second sub-layer of the decoder layer is similar, so the weight matrix is used

And

are connected to form a weight matrix

The weight matrix W is also expressed by equation (14) _KV Mapping to tensor space and weighting tensor

Expressed as a tensor decomposed form of the k-mode, said equation (14) is defined as follows:

step 34, reshape operation is performed on the input data N, and then a formula (2) is used for calculating a N and small weight tensor chain

The results are reported as

To pair

Carrying out reshape operation, then slicing to obtain corresponding

And

the specific process is as follows:

wherein,

and the calculation flow of (2) and step 21

Are consistent in the calculation process, and are also in the same way

And

and (5) performing splitting operation, storing the splitting result by using a list, and performing the operations of the step 23 and the step 24 on the splitting result to obtain a final output.

Further, the step 40 includes the following specific steps:

step 41, d _model And d _ff Conversion into a positive integer factor product of smaller value, d _model ＝{O ₁ ×...×O _m }×{O _m+1 ×...×O _2m }×...×{O _(n-1)m+1 ×...×O _nm And d _ff ＝{P ₁ ×...×P _m }×{P _m+1 ×...×P _2m }×...×{P _(n-1)m+1 ×...×P _nm }, weight matrix

And

become weight tensors

And

wherein

And

offset vector

And

becomes the offset tensor

And

step 42, tensor weight using formula (1)

And

the tensor decomposition form expressed as the m mode is used for reducing the number of training parameters and the calculation complexity in the network model;

step 43, mapping the input data of the Position-wise feedforward network to a tensor space, and performing multi-step calculation with a small weight tensor chain, wherein the calculation process is shown as formulas (20) and (21):

further, the sequence of the step 3 and the step 4 is not sequential.

The invention has the beneficial effects that:

(1) a plug-and-play lightweight tensor multi-head attention mechanism and a lightweight tensor position-wise feedforward network are designed, the number of learning parameters and floating point operation of a Transformer are reduced, the training process of the model is accelerated, and the energy consumption in the training process is reduced.

(2) According to the characteristics of the encoding process and the decoding process of the transform model, the weight matrixes of the multi-head attention layer are spliced in different modes, the spliced weight matrixes are expressed in a low-rank multi-modal tensor decomposition form, and a lightweight tensor quantized multi-head attention mechanism + +, so that the number of LTensorized _ transform learning parameters and floating point operation are further reduced, and the network model can be deployed on edge equipment with limited resources.

(3) The efficient lightweight tensor coupling Transformer can check input data through a plurality of small weight tensors to extract features step by step, the performance of a Transformer model is kept, the design concept of the Transformer is considered, three lightweight modules can be flexibly embedded into various network models, and the number of training parameters of corresponding networks and the calculation complexity are reduced.

Drawings

Fig. 1 is a flowchart of an implementation of a tensor-based efficient Transformer architecture method according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The following detailed description of specific implementations of the present invention is provided in conjunction with specific embodiments:

example (b):

fig. 1 shows an implementation flow of the tensor-based efficient Transformer architecture method provided by the embodiment of the present invention, and for convenience of description, only the relevant parts to the embodiment of the present invention are shown. The details are as follows:

step 10, weighting matrix of multi-head attention layer

step 20, inputting data

Mapping to tensor space and then corresponding

Constructing a lightweight tensor multi-head attention mechanism +;

step 40, weighting matrix in Position-wise feedforward neural network

And

and step 50, forming a lightweight tensor multi-head attention mechanism and a lightweight tensor Position-Wise feedforward network into an LTtensorized _ Transformer, and forming the lightweight tensor Position-Wise feedforward network and the lightweight tensor multi-head attention mechanism + + into an LTensorized _ Transformer + +, so as to construct a lightweight Transformer framework.

Further, the step 10 includes the following specific steps:

And

Weighting matrix

And W _O Mapping to tensor space yields the weight tensor

And

tensor of weight according to equation (1)

And

further, step 20 includes the following specific steps:

step 21, by mapping the input data to the tensor space

And the calculation is carried out with the corresponding small weight tensor kernel, the calculation process is shown as formula (2), and the definition of formula (2) is as follows:

D′＝Reshape(D，[-1，d _model ]) (3)

T＝Split(D′)＝(D′ ₁ ，...，D′ _h ) (4)

step 23, the stored data in the list L is taken out,

to know

And

) And attention calculations are performed to obtain corresponding attention outputs, and equation (5) is defined as follows:

R _i ＝Get(R，i) (5)

The k-mode tensor decomposition form) and performs multi-step feature calculation by using the formula (2) to obtain the final output result of the multi-head attention layer, wherein the formula (6) is defined as follows:

further, step 30 includes the following specific steps:

And

concatenated into a large weight matrix

Expressed as a tensor decomposed form of the k-mode, equation (7) is defined as follows:

The results are reported as

To pair

Carrying out reshape operation, then slicing to obtain corresponding

And

the specific process is as follows:

to pair

And

And

are connected to form a weight matrix

Expressed as a tensor decomposed form of the k-mode, equation (14) is defined as follows:

The results are reported as

For is to

Carrying out reshape operation, then slicing to obtain corresponding

And

the specific process is as follows:

wherein,

and the calculation flow of (2) and step 21

Are consistent in the calculation process, and are also in the same way

And

Further, step 40 includes the following specific steps:

And

become weight tensors

And

wherein

And

offset vector

And

becomes the offset tensor

And

step 42, tensor weight using formula (1)

And

further, the sequence of step 3 and step 4 is not sequential.

The present invention is not limited to the above preferred embodiments, and any modifications, equivalents and improvements made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A tensor-based efficient Transformer architecture method, comprising:

step 10, weighting matrix of multi-head attention layer

Constructing a lightweight tensor multi-head attention mechanism +;

step 40, weighting matrix in Position-wise feedforward neural network

And

mapping to tensor space and mapping itThe tensor decomposition form expressed as the m mode is operated with input data to construct a lightweight tensor Position-wise feedforward network;

2. The tensor-based efficient Transformer architecture method as recited in claim 1, wherein the step 10 comprises the following specific steps:

And

Weighting matrix

And W _O Mapping to tensor space yields the weight tensor

And

tensor of weight according to equation (1)

And

3. the tensor-based efficient Transformer architecture method as recited in claim 2, wherein the step 20 comprises the following specific steps:

step 21, by mapping the input data to the tensor space

D′＝Reshape(D，[-1，d _model ]) (3)

T＝Spht(D′)＝(D′ ₁ ，...，D′ _h ) (4)

step 23, the stored data in the list L is taken out,

and

And

R _i ＝Get(R，i) (5)

4. the tensor-based efficient Transformer architecture method as recited in claim 3, wherein the step 30 comprises the following specific steps:

And

concatenated into a large weight matrix

The results are given as

To pair

Carrying out reshape operation, then slicing to obtain corresponding

And

the specific process is as follows:

to pair

And

And

are connected to form a weight matrix

The weight matrix W is also expressed by equation (14) _KV Mapping to tensor space and tensor of weights obtained

The results are reported as

To pair

Carrying out reshape operation, then slicing to obtain corresponding

And

the specific process is as follows:

wherein,

and the calculation flow of (2) and step 21

Are consistent in the calculation process, and are also in the same way

And

5. The tensor-based efficient Transformer architecture method as recited in claim 4, wherein the step 40 comprises the following specific steps:

step 41, willd _model And d _ff Conversion into a positive integer factor product of smaller value, d _model ＝{O ₁ ×...×O _m }×{O _m+1 ×...×O _2m }×...×{O _(n-1)m+1 ×...×O _nm And d _ff ＝{P ₁ ×...×P _m }×{P _m+1 ×...×P _2m }×...×{P _(n-1)m+1 ×...×P _nm }, weight matrix