CN114841342A - Tensor-based efficient Transformer construction method - Google Patents

Tensor-based efficient Transformer construction method Download PDF

Info

Publication number
CN114841342A
CN114841342A CN202210556441.4A CN202210556441A CN114841342A CN 114841342 A CN114841342 A CN 114841342A CN 202210556441 A CN202210556441 A CN 202210556441A CN 114841342 A CN114841342 A CN 114841342A
Authority
CN
China
Prior art keywords
tensor
weight
matrix
att
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210556441.4A
Other languages
Chinese (zh)
Inventor
朱晨露
刘德彬
阮一恒
张立杰
邓贤君
杨天若
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hubei Chutian High Speed Digital Technology Co ltd
Original Assignee
Hubei Chutian High Speed Digital Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hubei Chutian High Speed Digital Technology Co ltd filed Critical Hubei Chutian High Speed Digital Technology Co ltd
Priority to CN202210556441.4A priority Critical patent/CN114841342A/en
Publication of CN114841342A publication Critical patent/CN114841342A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Complex Calculations (AREA)

Abstract

The invention is suitable for the field of artificial intelligence and provides a tensor-based high-efficiency Transformer architecture method.

Description

Tensor-based efficient Transformer construction method
Technical Field
The invention belongs to the field of artificial intelligence, and particularly relates to a tensor-based high-efficiency Transformer construction method.
Background
Along with the rapid development of artificial intelligence technology, communication technology and electronic chip technology, the industrial internet of things tightly connects billions of terminal devices such as intelligent mobile devices, wearable devices and sensors in green infrastructures, smart cities, smart medical treatment, smart power grids and intelligent traffic systems, so that more intelligent services are provided, and the production and life of people become more convenient. However, the industrial internet of things can generate a huge amount of multi-source heterogeneous data, and secondly, due to the excellent performance of the deep neural network model, the deep neural network model is widely applied to feature extraction and intelligent decision making. However, due to the complexity of the task and the huge amount of data, a large deep neural network model is usually required to train the data so as to obtain excellent task performance.
The Transformer is a popular deep neural network model at present, and has been widely applied to the industrial internet of things. The Transformer completely abandons the design concept of a convolution module and a circulation module, and the network model is only composed of an attention layer and a full connection layer. Transformers and other derived models, such as BERT, VIT, GPT, and Universal transformers, achieve superior task performance in terms of natural language processing, computer vision, recommendation systems, intelligent transportation, and the like. However, these network models are very large in size, typically containing hundreds of millions of training parameters, and thus require high performance chips that take weeks or even months to train these network models. Therefore, a large amount of computing resources and energy resources are consumed in the training process of the network model. Furthermore, with the advent of the federal learning paradigm and the increasing demand for real-time applications, smart mobile devices and other embedded devices need to participate in the training and decision-making of intelligent tasks. However, smart mobile devices and other embedded devices often have poor chip performance and high performance chips are bulky and cannot be mounted on edge devices. Therefore, the existing large-scale deep neural network model cannot be trained and deployed on the smart mobile device and other embedded devices. In addition, in the federal learning paradigm, each client needs to upload network model data to the cloud, and then the model fusion process is completed at the cloud. However, since the network model is huge in scale and data security of the network model is protected during uploading, the model data needs to be encrypted by using an encryption method. Encrypting the network model can effectively prevent privacy disclosure, but further increases the amount of communication data. In order to reduce the communication data volume in the federal learning process, reduce the bandwidth occupation and improve the communication efficiency, it becomes important how to reduce the number of training parameters in the network model. In addition, in some real-time applications such as traffic identification and fault detection, data is easily attacked by a network in the process of uploading the data to a cloud end, so that privacy disclosure and safety accidents are caused. Moreover, certain time is consumed for uploading data to the cloud, processing the data by the cloud and issuing a result to the terminal by the cloud, so that time delay is increased, and the requirement of a real-time task is not met. In summary, offloading the computing task to the cloud is not the best option. In addition, the large-scale network model consumes a large amount of energy resources in the training process, so that the carbon emission is increased, and the deterioration of the global environment is aggravated. Therefore, how to reduce the number of training parameters and floating point operation in the deep neural network model on a large scale on the premise of ensuring that the performance of the network model is not changed or the precision loss is within a tolerable range, quicken the training process of the network model, reduce the energy consumption in the training process, and facilitate the deployment on edge equipment with limited resources is a problem which needs to be solved urgently. Fortunately, a large number of redundant learning parameters and floating point operations usually exist in a large-scale network model, so that redundant parts in the network model can be eliminated, the number of learning parameters and the calculation complexity of the model are reduced, the training process of the model is accelerated, the consumption of power resources is reduced, and the development of green economy is promoted.
Since the Transformer shows excellent performance in various fields, the complexity of the network model limits the training and deployment of the Transformer on intelligent mobile devices and other embedded devices. Therefore, more and more researchers are beginning to investigate how to reduce the complexity of the model. And optimizing according to the structural characteristics of the transform model directly to reduce the number of blocks in the transform, so that the size and complexity of the model can be reduced, but the performance of the model is greatly reduced. Therefore, how to reduce the number of parameters and the computational complexity of the network model on the premise of ensuring that the model performance is not changed is a research hotspot. The current methods for model compression mainly include model pruning, low-order approximation, model quantification and knowledge distillation. Liu et al quantizes the trained vision transform model, reducing the memory storage and calculation costs of the model. Chung et al propose a mixed precision quantization strategy, which represents the weight of a transform by a small number of bits, so as to reduce the memory occupation of the model and improve the reasoning speed of the model. Zhu et al reduced the number of model parameters by pruning the Vision transform. Mao et al prune the Transformer components by analyzing their attributes, thereby reducing the number of parameters of the model and reducing the inference time of the model. Jiao et al propose a new Transformer distillation method that is used to transfer the knowledge of a complex teacher model to a small student model. Tensor decomposition method as an emerging high-efficiency network model compression method has achieved excellent performance in other network model compression. Hrinchunk et al compress the parameters of the embedded layer using a tensor chain decomposition method, thereby reducing the complexity of the model. Ma and the like propose a self-attention model based on BT decomposition based on the thought of tensor decomposition and parameter sharing, and reduce the number of parameters of a transform model. Although the above method can efficiently reduce the number of learning parameters and the computational complexity of the model, the method still has defects. Although the memory occupation of the model can be greatly reduced by the model quantization method, the performance of the model is usually greatly lost. Model pruning and knowledge-based distillation methods, while reducing the complexity of the model, are often overly cumbersome and require redesigning the compression scheme for the new model. Therefore, the two methods are less reusable. Hrinchunk only compresses the embedded layer, but has no compression effect for other models that do not contain embedded layers. Ma et al, although reducing the number of transform learning parameters, undermines the characteristics of attention mechanism.
Disclosure of Invention
The invention aims to provide a tensor-based high-efficiency Transformer architecture method, and aims to solve the problem that a current Transformer model cannot be deployed to intelligent mobile equipment and other embedded equipment for training and reasoning in the background art.
The invention provides a tensor-based high-efficiency Transformer construction method, which comprises the following steps of:
step 10, weighting matrix of multi-head attention layer
Figure BDA0003652446460000041
Mapping to tensor space, and then expressing the tensor space as a tensor decomposition form chain of a k mode;
step 20, inputting data (Q) att ,K att ,V att ) Mapping to tensor space and then corresponding
Figure BDA0003652446460000042
The weight tensor chain of (a) is operated on, and the result is operated on by 'Attention' and then is compared with
Figure BDA0003652446460000043
The weight tensor chain is operated to obtain a final output result, and a lightweight tensor multi-head attention mechanism is constructed;
step 30, integrating similar linear operations in the multi-head attention layer of the encoder layer and the multi-head attention layer of the first sublayer of the decoder layer, namely splicing the corresponding weight tensors to form the weight tensors
Figure BDA0003652446460000044
Integrating similar linear operations in the second sublayer of the decoder layer to form a weight tensor
Figure BDA0003652446460000045
Constructing a lightweight tensor multi-head attention mechanism +;
step 40Weight matrix in Position-wise feedforward neural network
Figure BDA0003652446460000046
And
Figure BDA0003652446460000047
mapping to a tensor space, representing the tensor space as a tensor decomposition form of an m mode, and operating with input data to construct a lightweight tensor Position-wise feedforward network;
and step 50, forming a lightweight tensor multi-head attention mechanism and a lightweight tensor Position-Wise feedforward network into an LTtensorized _ transformer, forming the lightweight tensor Position-Wise feedforward network and the lightweight tensor multi-head attention mechanism + + into an LTensorized _ transformer + +, and constructing a lightweight Transrormer framework.
Further, the step 10 includes the following specific steps:
packaging the series, keys and values into a matrix Q att ,K att And V att And to matrix Q att ,K att And V att H times of linear projection is carried out, wherein the involved weight matrixes can be uniformly expressed as
Figure BDA0003652446460000048
And
Figure BDA0003652446460000049
will d model Expressed as the product of a plurality of positive factors, d model ={d 1 ×...×d k }×{d k+1 ×...×d 2k }×...×{d (m-1)k+1 ×...×d mk };
Weighting matrix
Figure BDA00036524464600000410
And W O Mapping to tensor space yields the weight tensor
Figure BDA00036524464600000411
Figure BDA00036524464600000412
And
Figure BDA00036524464600000413
Figure BDA00036524464600000414
Figure BDA0003652446460000051
tensor of weight according to equation (1)
Figure BDA0003652446460000052
And
Figure BDA0003652446460000053
expressed as a tensor decomposed form of the k-mode, equation (1) is defined as follows:
Figure BDA0003652446460000054
further, the step 20 includes the following specific steps:
step 21, by mapping the input data to the tensor space
Figure BDA0003652446460000055
And performing operation with the corresponding small weight tensor kernel, wherein the operation process is shown as formula (2), and the definition of the formula (2) is as follows:
Figure BDA0003652446460000056
step 22, reshape operation is performed on the operation result, the operation result is split into h equal parts, the split result is stored by a list L, and the whole calculation process is as follows:
D′=Reshape(D,[-1,d model ]) (3)
T=Split(D′)=(D′ 1 ,...,D′ h ) (4)
step 23, the stored data in the list L is taken out,
Figure BDA0003652446460000057
and
Figure BDA0003652446460000058
each of which is composed of h matrices, and the input matrix (5) is used to obtain the corresponding subscript
Figure BDA0003652446460000059
And
Figure BDA00036524464600000510
) And performing attention calculation so as to obtain a corresponding attention output, wherein the formula (5) is defined as follows:
R i =Get(R,i) (5)
step 24, calculating the output result (head) of each attention by using the formula (6) i ) And concatenating the results of each attention with a small weight tensor kernel: (
Figure BDA00036524464600000511
The k-mode tensor decomposition form) and using the formula (2) to perform multi-step feature calculation to obtain the final output result of the multi-head attention layer, wherein the formula (6) is defined as follows:
Figure BDA0003652446460000061
further, the step 30 includes the following specific steps:
step 31, the multi-head attention layer of the encoder layer and the first sub-layer of the decoder layer have the same structural property, and the query matrix Q att Key matrix K att Sum matrix V att Similar linear mapping operation is carried out, and weight matrixes are used
Figure BDA0003652446460000062
And
Figure BDA0003652446460000063
concatenated into a large weight matrix
Figure BDA0003652446460000064
Figure BDA0003652446460000065
As shown in equation (7), the weight matrix W is then transformed using equation (1) QKV Mapping to tensor space and weighting tensor
Figure BDA0003652446460000066
Expressed as a tensor decomposed form of the k-mode, said formula (7) is defined as follows:
Figure BDA0003652446460000067
step 32, reshape operation is performed on the input data M, and then a formula (2) is used for calculating a tensor chain of M and the small weight
Figure BDA0003652446460000068
The results are reported as
Figure BDA0003652446460000069
To pair
Figure BDA00036524464600000610
Carrying out reshape operation, then slicing to obtain corresponding
Figure BDA00036524464600000611
And
Figure BDA00036524464600000612
the specific process is as follows:
Figure BDA00036524464600000613
Figure BDA00036524464600000614
Figure BDA00036524464600000615
Figure BDA00036524464600000616
Figure BDA00036524464600000617
Figure BDA00036524464600000618
to pair
Figure BDA00036524464600000619
And
Figure BDA00036524464600000620
splitting, storing the splitting result in a list, and performing the operations of the step 23 and the step 24 on the splitting result to obtain final output;
step 33. the linear projection process of the key matrix and the value matrix in the multi-headed attention layer of the second sub-layer of the decoder layer is similar, so the weight matrix is used
Figure BDA00036524464600000621
And
Figure BDA00036524464600000622
are connected to form a weight matrix
Figure BDA0003652446460000071
The weight matrix W is also expressed by equation (14) KV Mapping to tensor space and weighting tensor
Figure BDA0003652446460000072
Expressed as a tensor decomposed form of the k-mode, said equation (14) is defined as follows:
Figure BDA0003652446460000073
step 34, reshape operation is performed on the input data N, and then a formula (2) is used for calculating a N and small weight tensor chain
Figure BDA0003652446460000074
The results are reported as
Figure BDA0003652446460000075
To pair
Figure BDA0003652446460000076
Carrying out reshape operation, then slicing to obtain corresponding
Figure BDA0003652446460000077
And
Figure BDA0003652446460000078
the specific process is as follows:
Figure BDA0003652446460000079
Figure BDA00036524464600000710
Figure BDA00036524464600000711
Figure BDA00036524464600000712
Figure BDA00036524464600000713
wherein,
Figure BDA00036524464600000714
and the calculation flow of (2) and step 21
Figure BDA00036524464600000715
Are consistent in the calculation process, and are also in the same way
Figure BDA00036524464600000716
And
Figure BDA00036524464600000717
and (5) performing splitting operation, storing the splitting result by using a list, and performing the operations of the step 23 and the step 24 on the splitting result to obtain a final output.
Further, the step 40 includes the following specific steps:
step 41, d model And d ff Conversion into a positive integer factor product of smaller value, d model ={O 1 ×...×O m }×{O m+1 ×...×O 2m }×...×{O (n-1)m+1 ×...×O nm And d ff ={P 1 ×...×P m }×{P m+1 ×...×P 2m }×...×{P (n-1)m+1 ×...×P nm }, weight matrix
Figure BDA00036524464600000718
And
Figure BDA00036524464600000719
become weight tensors
Figure BDA00036524464600000720
And
Figure BDA00036524464600000721
wherein
Figure BDA00036524464600000722
And
Figure BDA00036524464600000723
Figure BDA0003652446460000081
offset vector
Figure BDA0003652446460000082
And
Figure BDA0003652446460000083
becomes the offset tensor
Figure BDA0003652446460000084
And
Figure BDA0003652446460000085
step 42, tensor weight using formula (1)
Figure BDA0003652446460000086
And
Figure BDA0003652446460000087
the tensor decomposition form expressed as the m mode is used for reducing the number of training parameters and the calculation complexity in the network model;
step 43, mapping the input data of the Position-wise feedforward network to a tensor space, and performing multi-step calculation with a small weight tensor chain, wherein the calculation process is shown as formulas (20) and (21):
Figure BDA0003652446460000088
further, the sequence of the step 3 and the step 4 is not sequential.
The invention has the beneficial effects that:
(1) a plug-and-play lightweight tensor multi-head attention mechanism and a lightweight tensor position-wise feedforward network are designed, the number of learning parameters and floating point operation of a Transformer are reduced, the training process of the model is accelerated, and the energy consumption in the training process is reduced.
(2) According to the characteristics of the encoding process and the decoding process of the transform model, the weight matrixes of the multi-head attention layer are spliced in different modes, the spliced weight matrixes are expressed in a low-rank multi-modal tensor decomposition form, and a lightweight tensor quantized multi-head attention mechanism + +, so that the number of LTensorized _ transform learning parameters and floating point operation are further reduced, and the network model can be deployed on edge equipment with limited resources.
(3) The efficient lightweight tensor coupling Transformer can check input data through a plurality of small weight tensors to extract features step by step, the performance of a Transformer model is kept, the design concept of the Transformer is considered, three lightweight modules can be flexibly embedded into various network models, and the number of training parameters of corresponding networks and the calculation complexity are reduced.
Drawings
Fig. 1 is a flowchart of an implementation of a tensor-based efficient Transformer architecture method according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The following detailed description of specific implementations of the present invention is provided in conjunction with specific embodiments:
example (b):
fig. 1 shows an implementation flow of the tensor-based efficient Transformer architecture method provided by the embodiment of the present invention, and for convenience of description, only the relevant parts to the embodiment of the present invention are shown. The details are as follows:
step 10, weighting matrix of multi-head attention layer
Figure BDA0003652446460000091
Mapping to tensor space, and then expressing the tensor space as a tensor decomposition form chain of a k mode;
step 20, inputting data
Figure BDA0003652446460000092
Mapping to tensor space and then corresponding
Figure BDA0003652446460000093
The weight tensor chain of (a) is operated on, and the result is operated on by 'Attention' and then is compared with
Figure BDA0003652446460000094
The weight tensor chain is operated to obtain a final output result, and a lightweight tensor multi-head attention mechanism is constructed;
step 30, integrating similar linear operations in the multi-head attention layer of the encoder layer and the multi-head attention layer of the first sublayer of the decoder layer, namely splicing the corresponding weight tensors to form the weight tensors
Figure BDA0003652446460000095
Integrating similar linear operations in the second sublayer of the decoder layer to form a weight tensor
Figure BDA0003652446460000096
Constructing a lightweight tensor multi-head attention mechanism +;
step 40, weighting matrix in Position-wise feedforward neural network
Figure BDA0003652446460000097
And
Figure BDA0003652446460000098
mapping to a tensor space, representing the tensor space as a tensor decomposition form of an m mode, and operating with input data to construct a lightweight tensor Position-wise feedforward network;
and step 50, forming a lightweight tensor multi-head attention mechanism and a lightweight tensor Position-Wise feedforward network into an LTtensorized _ Transformer, and forming the lightweight tensor Position-Wise feedforward network and the lightweight tensor multi-head attention mechanism + + into an LTensorized _ Transformer + +, so as to construct a lightweight Transformer framework.
Further, the step 10 includes the following specific steps:
packaging the series, keys and values into a matrix Q att ,K att And V att And to matrix Q att ,K att And V att H times of linear projection is carried out, wherein the involved weight matrixes can be uniformly expressed as
Figure BDA0003652446460000101
And
Figure BDA0003652446460000102
will d model Expressed as the product of a plurality of positive factors, d model ={d 1 ×...×d k }×{d k+1 ×...×d 2k }×...×{d (m-1)k+1 ×...×d mk };
Weighting matrix
Figure BDA0003652446460000103
And W O Mapping to tensor space yields the weight tensor
Figure BDA0003652446460000104
Figure BDA0003652446460000105
And
Figure BDA0003652446460000106
Figure BDA0003652446460000107
tensor of weight according to equation (1)
Figure BDA0003652446460000108
And
Figure BDA0003652446460000109
expressed as a tensor decomposed form of the k-mode, equation (1) is defined as follows:
Figure BDA00036524464600001010
further, step 20 includes the following specific steps:
step 21, by mapping the input data to the tensor space
Figure BDA00036524464600001011
And the calculation is carried out with the corresponding small weight tensor kernel, the calculation process is shown as formula (2), and the definition of formula (2) is as follows:
Figure BDA00036524464600001012
step 22, reshape operation is performed on the operation result, the operation result is split into h equal parts, the split result is stored by a list L, and the whole calculation process is as follows:
D′=Reshape(D,[-1,d model ]) (3)
T=Split(D′)=(D′ 1 ,...,D′ h ) (4)
step 23, the stored data in the list L is taken out,
Figure BDA0003652446460000111
to know
Figure BDA0003652446460000112
Each of which is composed of h matrices, and the input matrix (5) is used to obtain the corresponding subscript
Figure BDA0003652446460000113
And
Figure BDA0003652446460000114
) And attention calculations are performed to obtain corresponding attention outputs, and equation (5) is defined as follows:
R i =Get(R,i) (5)
step 24, calculating the output result (head) of each attention by using the formula (6) i ) And concatenating the results of each attention with a small weight tensor kernel: (
Figure BDA0003652446460000115
The k-mode tensor decomposition form) and performs multi-step feature calculation by using the formula (2) to obtain the final output result of the multi-head attention layer, wherein the formula (6) is defined as follows:
Figure BDA0003652446460000116
further, step 30 includes the following specific steps:
step 31, the multi-head attention layer of the encoder layer and the first sub-layer of the decoder layer have the same structural property, and the query matrix Q att Key matrix K att Sum matrix V att Similar linear mapping operation is carried out, and weight matrixes are used
Figure BDA0003652446460000117
And
Figure BDA0003652446460000118
concatenated into a large weight matrix
Figure BDA0003652446460000119
Figure BDA00036524464600001110
As shown in equation (7), the weight matrix W is then transformed using equation (1) QKV Mapping to tensor space and weighting tensor
Figure BDA00036524464600001111
Expressed as a tensor decomposed form of the k-mode, equation (7) is defined as follows:
Figure BDA00036524464600001112
step 32, reshape operation is performed on the input data M, and then a formula (2) is used for calculating a tensor chain of M and the small weight
Figure BDA0003652446460000121
The results are reported as
Figure BDA0003652446460000122
To pair
Figure BDA0003652446460000123
Carrying out reshape operation, then slicing to obtain corresponding
Figure BDA0003652446460000124
And
Figure BDA0003652446460000125
the specific process is as follows:
Figure BDA0003652446460000126
Figure BDA0003652446460000127
Figure BDA0003652446460000128
Figure BDA0003652446460000129
Figure BDA00036524464600001210
Figure BDA00036524464600001211
to pair
Figure BDA00036524464600001212
And
Figure BDA00036524464600001213
splitting, storing the splitting result in a list, and performing the operations of the step 23 and the step 24 on the splitting result to obtain final output;
step 33. the linear projection process of the key matrix and the value matrix in the multi-headed attention layer of the second sub-layer of the decoder layer is similar, so the weight matrix is used
Figure BDA00036524464600001214
And
Figure BDA00036524464600001215
are connected to form a weight matrix
Figure BDA00036524464600001216
The weight matrix W is also expressed by equation (14) KV Mapping to tensor space and weighting tensor
Figure BDA00036524464600001217
Expressed as a tensor decomposed form of the k-mode, equation (14) is defined as follows:
Figure BDA00036524464600001218
step 34, reshape operation is performed on the input data N, and then a formula (2) is used for calculating a N and small weight tensor chain
Figure BDA00036524464600001219
The results are reported as
Figure BDA00036524464600001220
For is to
Figure BDA00036524464600001221
Carrying out reshape operation, then slicing to obtain corresponding
Figure BDA00036524464600001222
And
Figure BDA00036524464600001223
the specific process is as follows:
Figure BDA00036524464600001224
Figure BDA00036524464600001225
Figure BDA00036524464600001226
Figure BDA00036524464600001227
Figure BDA0003652446460000131
wherein,
Figure BDA0003652446460000132
and the calculation flow of (2) and step 21
Figure BDA0003652446460000133
Are consistent in the calculation process, and are also in the same way
Figure BDA0003652446460000134
And
Figure BDA0003652446460000135
and (5) performing splitting operation, storing the splitting result by using a list, and performing the operations of the step 23 and the step 24 on the splitting result to obtain a final output.
Further, step 40 includes the following specific steps:
step 41, d model And d ff Conversion into a positive integer factor product of smaller value, d model ={O 1 ×...×O m }×{O m+1 ×...×O 2m }×...×{O (n-1)m+1 ×...×O nm And d ff ={P 1 ×...×P m }×{P m+1 ×...×P 2m }×...×{P (n-1)m+1 ×...×P nm }, weight matrix
Figure BDA0003652446460000136
And
Figure BDA0003652446460000137
become weight tensors
Figure BDA0003652446460000138
And
Figure BDA0003652446460000139
wherein
Figure BDA00036524464600001310
And
Figure BDA00036524464600001311
offset vector
Figure BDA00036524464600001312
And
Figure BDA00036524464600001313
becomes the offset tensor
Figure BDA00036524464600001314
And
Figure BDA00036524464600001315
step 42, tensor weight using formula (1)
Figure BDA00036524464600001316
And
Figure BDA00036524464600001317
the tensor decomposition form expressed as the m mode is used for reducing the number of training parameters and the calculation complexity in the network model;
step 43, mapping the input data of the Position-wise feedforward network to a tensor space, and performing multi-step calculation with a small weight tensor chain, wherein the calculation process is shown as formulas (20) and (21):
Figure BDA00036524464600001318
further, the sequence of step 3 and step 4 is not sequential.
The present invention is not limited to the above preferred embodiments, and any modifications, equivalents and improvements made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (6)

1. A tensor-based efficient Transformer architecture method, comprising:
step 10, weighting matrix of multi-head attention layer
Figure FDA0003652446450000011
Mapping to tensor space, and then expressing the tensor space as a tensor decomposition form chain of a k mode;
step 20, inputting data (Q) att ,K att ,V att ) Mapping to tensor space and then corresponding
Figure FDA0003652446450000012
The weight tensor chain of (a) is operated on, and the result is operated on by 'Attention' and then is compared with
Figure FDA0003652446450000013
The weight tensor chain is operated to obtain a final output result, and a lightweight tensor multi-head attention mechanism is constructed;
step 30, integrating similar linear operations in the multi-head attention layer of the encoder layer and the multi-head attention layer of the first sublayer of the decoder layer, namely splicing the corresponding weight tensors to form the weight tensors
Figure FDA0003652446450000014
Integrating similar linear operations in the second sublayer of the decoder layer to form a weight tensor
Figure FDA0003652446450000015
Constructing a lightweight tensor multi-head attention mechanism +;
step 40, weighting matrix in Position-wise feedforward neural network
Figure FDA0003652446450000016
And
Figure FDA0003652446450000017
mapping to tensor space and mapping itThe tensor decomposition form expressed as the m mode is operated with input data to construct a lightweight tensor Position-wise feedforward network;
and step 50, forming a lightweight tensor multi-head attention mechanism and a lightweight tensor Position-Wise feedforward network into an LTtensorized _ Transformer, and forming the lightweight tensor Position-Wise feedforward network and the lightweight tensor multi-head attention mechanism + + into an LTensorized _ Transformer + +, so as to construct a lightweight Transformer framework.
2. The tensor-based efficient Transformer architecture method as recited in claim 1, wherein the step 10 comprises the following specific steps:
packaging the series, keys and values into a matrix Q att ,K att And V att And to matrix Q att ,K att And V att H times of linear projection is carried out, wherein the involved weight matrixes can be uniformly expressed as
Figure FDA0003652446450000018
And
Figure FDA0003652446450000021
will d model Expressed as the product of a plurality of positive factors, d model ={d 1 ×...×d k }×{d k+1 ×...×d 2k }×...×{d (m-1)k+1 ×...×d mk };
Weighting matrix
Figure FDA0003652446450000022
And W O Mapping to tensor space yields the weight tensor
Figure FDA0003652446450000023
Figure FDA0003652446450000024
And
Figure FDA0003652446450000025
Figure FDA0003652446450000026
tensor of weight according to equation (1)
Figure FDA0003652446450000027
And
Figure FDA0003652446450000028
expressed as a tensor decomposed form of the k-mode, equation (1) is defined as follows:
Figure FDA0003652446450000029
3. the tensor-based efficient Transformer architecture method as recited in claim 2, wherein the step 20 comprises the following specific steps:
step 21, by mapping the input data to the tensor space
Figure FDA00036524464500000210
And performing operation with the corresponding small weight tensor kernel, wherein the operation process is shown as formula (2), and the definition of the formula (2) is as follows:
Figure FDA00036524464500000211
step 22, reshape operation is performed on the operation result, the operation result is split into h equal parts, the split result is stored by a list L, and the whole calculation process is as follows:
D′=Reshape(D,[-1,d model ]) (3)
T=Spht(D′)=(D′ 1 ,...,D′ h ) (4)
step 23, the stored data in the list L is taken out,
Figure FDA00036524464500000212
and
Figure FDA00036524464500000213
each of which is composed of h matrices, and the input matrix (5) is used to obtain the corresponding subscript
Figure FDA00036524464500000214
And
Figure FDA0003652446450000031
) And performing attention calculation so as to obtain a corresponding attention output, wherein the formula (5) is defined as follows:
R i =Get(R,i) (5)
step 24, calculating the output result (head) of each attention by using the formula (6) i ) And concatenating the results of each attention with a small weight tensor kernel: (
Figure FDA0003652446450000032
The k-mode tensor decomposition form) and using the formula (2) to perform multi-step feature calculation to obtain the final output result of the multi-head attention layer, wherein the formula (6) is defined as follows:
Figure FDA0003652446450000033
4. the tensor-based efficient Transformer architecture method as recited in claim 3, wherein the step 30 comprises the following specific steps:
step 31, the multi-head attention layer of the encoder layer and the first sub-layer of the decoder layer have the same structural property, and the query matrix Q att Key matrix K att Sum matrix V att Similar linear mapping operation is carried out, and weight matrixes are used
Figure FDA0003652446450000034
And
Figure FDA0003652446450000035
concatenated into a large weight matrix
Figure FDA0003652446450000036
Figure FDA0003652446450000037
As shown in equation (7), the weight matrix W is then transformed using equation (1) QKV Mapping to tensor space and weighting tensor
Figure FDA0003652446450000038
Expressed as a tensor decomposed form of the k-mode, said formula (7) is defined as follows:
Figure FDA0003652446450000039
step 32, reshape operation is performed on the input data M, and then a formula (2) is used for calculating a tensor chain of M and the small weight
Figure FDA00036524464500000310
The results are given as
Figure FDA00036524464500000311
To pair
Figure FDA00036524464500000312
Carrying out reshape operation, then slicing to obtain corresponding
Figure FDA00036524464500000313
And
Figure FDA00036524464500000314
the specific process is as follows:
Figure FDA00036524464500000315
Figure FDA00036524464500000316
Figure FDA00036524464500000317
Figure FDA0003652446450000041
Figure FDA0003652446450000042
Figure FDA0003652446450000043
to pair
Figure FDA0003652446450000044
And
Figure FDA0003652446450000045
splitting, storing the splitting result in a list, and performing the operations of the step 23 and the step 24 on the splitting result to obtain final output;
step 33. the linear projection process of the key matrix and the value matrix in the multi-headed attention layer of the second sub-layer of the decoder layer is similar, so the weight matrix is used
Figure FDA0003652446450000046
And
Figure FDA0003652446450000047
are connected to form a weight matrix
Figure FDA0003652446450000048
The weight matrix W is also expressed by equation (14) KV Mapping to tensor space and tensor of weights obtained
Figure FDA0003652446450000049
Expressed as a tensor decomposed form of the k-mode, said equation (14) is defined as follows:
Figure FDA00036524464500000410
step 34, reshape operation is performed on the input data N, and then a formula (2) is used for calculating a N and small weight tensor chain
Figure FDA00036524464500000411
The results are reported as
Figure FDA00036524464500000412
To pair
Figure FDA00036524464500000413
Carrying out reshape operation, then slicing to obtain corresponding
Figure FDA00036524464500000414
And
Figure FDA00036524464500000415
the specific process is as follows:
Figure FDA00036524464500000416
Figure FDA00036524464500000417
Figure FDA00036524464500000418
Figure FDA00036524464500000419
Figure FDA00036524464500000420
wherein,
Figure FDA00036524464500000421
and the calculation flow of (2) and step 21
Figure FDA00036524464500000422
Are consistent in the calculation process, and are also in the same way
Figure FDA00036524464500000423
And
Figure FDA00036524464500000424
and (5) performing splitting operation, storing the splitting result by using a list, and performing the operations of the step 23 and the step 24 on the splitting result to obtain a final output.
5. The tensor-based efficient Transformer architecture method as recited in claim 4, wherein the step 40 comprises the following specific steps:
step 41, willd model And d ff Conversion into a positive integer factor product of smaller value, d model ={O 1 ×...×O m }×{O m+1 ×...×O 2m }×...×{O (n-1)m+1 ×...×O nm And d ff ={P 1 ×...×P m }×{P m+1 ×...×P 2m }×...×{P (n-1)m+1 ×...×P nm }, weight matrix
Figure FDA0003652446450000051
And
Figure FDA0003652446450000052
become weight tensors
Figure FDA0003652446450000053
And
Figure FDA0003652446450000054
wherein
Figure FDA0003652446450000055
And
Figure FDA0003652446450000056
offset vector
Figure FDA0003652446450000057
And
Figure FDA0003652446450000058
becomes the offset tensor
Figure FDA0003652446450000059
And
Figure FDA00036524464500000510
step 42, tensor weight using formula (1)
Figure FDA00036524464500000511
And
Figure FDA00036524464500000512
the tensor decomposition form expressed as the m mode is used for reducing the number of training parameters and the calculation complexity in the network model;
step 43, mapping the input data of the Position-wise feedforward network to a tensor space, and performing multi-step calculation with a small weight tensor chain, wherein the calculation process is shown as formulas (20) and (21):
Figure FDA00036524464500000513
Figure FDA00036524464500000514
6. the tensor-based efficient Transformer architecture method as recited in claim 1, wherein the steps 3 and 4 are not ordered sequentially.
CN202210556441.4A 2022-05-19 2022-05-19 Tensor-based efficient Transformer construction method Pending CN114841342A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210556441.4A CN114841342A (en) 2022-05-19 2022-05-19 Tensor-based efficient Transformer construction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210556441.4A CN114841342A (en) 2022-05-19 2022-05-19 Tensor-based efficient Transformer construction method

Publications (1)

Publication Number Publication Date
CN114841342A true CN114841342A (en) 2022-08-02

Family

ID=82571464

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210556441.4A Pending CN114841342A (en) 2022-05-19 2022-05-19 Tensor-based efficient Transformer construction method

Country Status (1)

Country Link
CN (1) CN114841342A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115861762A (en) * 2023-02-27 2023-03-28 中国海洋大学 Plug-and-play infinite deformation fusion feature extraction method and application thereof

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115861762A (en) * 2023-02-27 2023-03-28 中国海洋大学 Plug-and-play infinite deformation fusion feature extraction method and application thereof

Similar Documents

Publication Publication Date Title
ZainEldin et al. Image compression algorithms in wireless multimedia sensor networks: A survey
Wang et al. Qtt-dlstm: A cloud-edge-aided distributed lstm for cyber–physical–social big data
CN110751265A (en) Lightweight neural network construction method and system and electronic equipment
CN110503135B (en) Deep learning model compression method and system for power equipment edge side recognition
CN113516133B (en) Multi-modal image classification method and system
Ding et al. Slimyolov4: lightweight object detector based on yolov4
CN114841342A (en) Tensor-based efficient Transformer construction method
Xiong et al. STC: Significance-aware transform-based codec framework for external memory access reduction
CN111193618B (en) 6G mobile communication system based on tensor calculation and data processing method thereof
US20230308681A1 (en) End-to-end stereo image compression method and device based on bi-directional coding
CN114817773A (en) Time sequence prediction system and method based on multi-stage decomposition and fusion
Liu et al. Scalable tensor-train-based tensor computations for cyber–physical–social big data
CN118260689A (en) Log anomaly detection method based on high-efficiency fine adjustment of self-adaptive low-rank parameters
Hsieh et al. C3-SL: Circular convolution-based batch-wise compression for communication-efficient split learning
He et al. Towards real-time practical image compression with lightweight attention
CN111479286B (en) Data processing method for reducing communication flow of edge computing system
EP4354872A1 (en) Point cloud attribute information encoding and decoding method and apparatus, and related device
Rao et al. Image compression using artificial neural networks
CN115170613A (en) Human motion prediction method based on time sequence grading and recombination mechanism
Liu et al. MoTransFrame: Model Transfer Framework for CNNs on Low-Resource Edge Computing Node.
CN117388716B (en) Battery pack fault diagnosis method, system and storage medium based on time sequence data
CN117915107B (en) Image compression system, image compression method, storage medium and chip
CN116128737B (en) Image super-resolution network robustness improving device based on data compression
Hassan et al. SpikeBottleNet: Energy Efficient Spike Neural Network Partitioning for Feature Compression in Device-Edge Co-Inference Systems
Liu Analysis on lightweight network methods and technologies

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination