Disclosure of Invention
It is an object of the present invention to overcome the above-mentioned drawbacks of the prior art and to provide a system on a chip for a neural network, which improves the computational energy efficiency ratio by improving the processor architecture and the resource scheduling of the system on a chip.
According to a first aspect of the present invention, a system on a chip for a neural network is provided. The system-on-chip comprises a plurality of computing clusters, a forward data forwarding path, a backward data sharing path and a task allocation unit, wherein:
the multiple computing clusters are used for realizing multiplication operation of an input neuron matrix and a weight matrix in the neural network, wherein each computing cluster comprises a local on-chip memory and a corresponding off-chip memory;
the forward data forwarding path is used for forwarding input neuron data among the plurality of computing clusters;
the backward data sharing path is used for transmitting weight data or calculation results among the plurality of calculation clusters;
the task allocation unit is used for determining a task allocation strategy of each computing cluster according to the input neuron matrix size to be computed, so that input neuron data for performing matrix multiplication operation is allocated to each computing cluster.
In one embodiment, the computing cluster includes a data flow control module, a data buffer module, a multiply-accumulate module, a data transfer module, and an on-chip memory, wherein:
the data caching module is used for storing neuron data, weight data or calculation result data;
the multiplication accumulation module is used for realizing multiplication operation of the input neuron matrix and the corresponding weight matrix;
the data flow control module is used for controlling loading of data to the data caching module, the multiply-accumulate module, the data transmission module and the on-chip memory;
the data transfer module is used for forwarding the neuron data to other computing clusters.
In one embodiment, the backward data sharing path is formed by a plurality of transponders connected in sequence, wherein each transponder corresponds to one computing cluster for transmitting weight data or computation results received from other transponders to the corresponding computing cluster.
In one embodiment, the task allocation unit is further configured to determine a storage policy of the weight matrix on local on-chip memories of the plurality of computing clusters and corresponding off-chip memories according to at least one of a size of the weight matrix or a computing capability of the plurality of computing clusters.
In one embodiment, where the input neuron matrix is bxnxk, the weight matrix is kxm, there are B computing clusters, and each computing cluster has a computing power of kxm, N, M, K, K, M, B is any positive integer:
the task allocation strategy is that each computing cluster is allocated in parallel
A matrix of input neuron data.
In one embodiment, the storage strategy of the weight matrix is that the weight matrix is stored in each off-chip memory corresponding to each computing cluster, or the weight matrix is stored in each off-chip memory corresponding to one computing cluster, or the weight matrix is divided into a plurality of sub-matrices in average and is stored in each off-chip memory corresponding to each computing cluster.
In one embodiment, when the weight matrix is stored in the off-chip memory corresponding to one computing cluster, the computing cluster loads the weight matrix from the off-chip memory corresponding to the computing cluster to the local on-chip memory, and transmits the weight matrix to the rest of computing clusters through the backward data sharing path.
In one embodiment, when the weight matrix is averagely divided into a plurality of sub-matrices and stored in the off-chip memory corresponding to each computing cluster, and a matrix multiplication operation is performed, each computing cluster loads the weight matrix from the off-chip memory corresponding to the computing cluster to the local on-chip memory, and transmits the weight matrix to the rest of computing clusters through the backward data sharing channel.
In one embodiment, where the input neuron matrix is B N K, the weight matrix is K M, there are B computing clusters, each computing cluster has a computing power of K M, N, M, K, K, M, B is any positive integer, and M+.gtoreq.b.times.m:
the task allocation strategy is that each computing cluster is allocated in parallel
A matrix of input neurons; />
The storage strategy of the weight matrix is to divide the weight matrix into a plurality of submatrices according to the computing capacity of the plurality of computing clusters and distribute the plurality of submatrices in the off-chip memories corresponding to the plurality of computing clusters.
In one embodiment, the forward data forwarding path sequentially connects the plurality of computing clusters in series in a first direction to form a loop for transferring input neuron data, and the backward data sharing path sequentially connects the plurality of computing clusters in series in a second direction to form a loop for transferring weight data or computation results.
According to a second aspect of the present invention, an electronic device is provided. The electronic device comprises the system on chip of the invention.
Compared with the prior art, the invention has the advantages that: aiming at the operation characteristics of different layers in the neural network reasoning application, a unified multi-computing cluster coordinated system-on-chip architecture is provided, and the problem that the calculation efficiency of a single operation unit in the reasoning application shallow layer is low is solved; realizing data sharing among multiple computing clusters through designing a special data forwarding path and a network-on-chip; and scheduling the required heavier bandwidth load inside the computing cluster according to the input neuron matrix scale or the weight matrix scale, and transmitting the lighter bandwidth load through a data forwarding path, thereby realizing the energy consumption optimization of the local memory access.
Detailed Description
The present invention will be further described in detail with reference to the following specific examples, which are given by way of illustration, in order to make the objects, technical solutions, design methods and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
In all examples shown and discussed herein, any specific values should be construed as merely illustrative, and not a limitation. Thus, other examples of exemplary embodiments may have different values.
Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.
In the description herein, input neuron data is node data in a neural network model, weights refer to coefficients connecting two nodes, and are obtainable by training, and data generally refers to various types of data such as input neuron data, weight data, and calculation results, unless otherwise indicated by the context.
According to an embodiment of the present invention, a system on a chip for neural network processing is provided, see fig. 1, which comprises a plurality of computing clusters (or processor clusters), a computing cluster 101, a computing cluster 102, a computing cluster 103, and a computing cluster 104 are shown, a forward data forwarding path 120, a backward data sharing path 130, and a task allocation unit 140.
The computing clusters 101-104 are configured to perform matrix multiplication operations and may be formed of one or more processing units, e.g., including only matrix multiplication processing units, or including matrix multiplication processing units and other types of units. Each computing cluster may have the same or different circuit structures, for example, may be implemented by various types of circuit structures such as ASICs or DSPs, and the computing capabilities of each computing cluster may be the same or different. Furthermore, each compute cluster has its own on-chip memory (also referred to herein as local on-chip memory) and off-chip memory, where on-chip memory may be, for example, SRAM or other types, and off-chip memory may be, for example, DDR granules or other types, as the invention is not limited in this regard.
The forward data forwarding path 120 forms a ring path for forwarding input neuron data between a plurality of computing clusters, and each computing cluster may sequentially forward, via the forward data forwarding path 120, the neuron data read from the outside (e.g., off-chip memory) or the received neuron data forwarded by other computing clusters to other computing clusters connected thereto, so that the neuron data may flow cyclically between the plurality of computing clusters.
The backward data sharing path 130 forms a ring path for transferring weight data or a calculation result of matrix multiplication between the plurality of computing clusters, and each computing cluster may sequentially forward weight data read from the outside (e.g., off-chip memory) or weight data received from other computing clusters to other computing clusters connected thereto via the backward data sharing path 130, so that the weight data may circulate between the plurality of computing clusters. In this way, each computing cluster is able to achieve access to on-chip memory and off-chip memory.
In one embodiment, each computing cluster may be uniformly addressed for on-chip memory resources for data sharing and also uniformly addressed for off-chip memory resources, the addressing bits including portions for identifying off-chip memory and on-chip memory, portions for identifying selected computing clusters, and portions for identifying specific off-chip memory addresses or on-chip memory addresses. For example, the unified address bit number is 35 bits, where the highest bit34 is used to identify whether on-chip or off-chip, bits 33 and 32 are used to select a compute cluster, and for the case of 4 compute clusters, any one of the corresponding compute clusters may be selected using 2bits, and the lower 32bits, bit31-bit0, may represent a 4G address. See table 1.
Table 1: bit identification
As can be seen from table 1, in this way, each compute cluster is able to access on-chip 4G memory space as well as off-chip 4G memory space.
The task allocation unit 140 is configured to determine a task allocation policy and on-chip and off-chip storage policies of the multiple computing clusters according to requirements of a task to be computed and computing capabilities of each computing cluster. The task allocation unit 140 may be implemented by a software module of a system on chip. The task allocation policies and on-chip, off-chip storage policies are described further below.
Fig. 2 shows a block diagram of a computing cluster including a data flow control module 210, a data caching module 220, a multiply-accumulate module 230, a data transfer module 240, an on-chip memory 250, and a repeater 260, according to one embodiment of the invention.
The data flow control module 210 has a communication connection with the data caching module 220, the multiply-accumulate module 230, the data transfer module 240, and the repeater 260 (connections to the repeater 260 not shown) and is operable to receive the task allocation policies from the task allocation unit and the on-chip, off-chip memory storage policies and to control the transfer of data (including neuron data, weights or matrix multiplication results, etc.) between the modules in the computing cluster and to interact with data outside the computing cluster in accordance with these policies and task execution conditions. For example, including but not limited to, controlling the data cache module 220 to select data from the on-chip memory 250, controlling the repeater 260 to receive weight data from a repeater of another computing cluster, controlling the loading of data from outside the computing cluster to the data cache module 220 and thus control to pass to the multiply-accumulate module 230, or controlling the passing of neuron data to the data transfer module 240 and thus controlling the data transfer module 240 to pass the neuron data to a subsequent computing cluster after the multiply-accumulate module 230 performs a matrix multiplication operation, etc.
The data buffer module 220 is used for buffering various types of data. For example, including but not limited to weight data to be subjected to matrix multiplication, input neuron data, the result of the computation by multiply-accumulate module 230, and the like.
The multiply-accumulate module 230 is configured to perform multiplication operations of the weight matrix and the input neuron matrix, and may include one or more matrix multiplication processing units to rapidly process matrix multiplication operations of different scales.
The data transfer module 240 is configured to form a forward data forwarding path that has a communication connection with the data flow control module 240 within the computing cluster and also with other computing clusters to pass data to the other computing clusters. For example, to data caching modules of other computing clusters.
The on-chip memory 250, i.e., the local on-chip memory of the computing cluster, is used to store various types of data, such as neuron data or weight data.
The repeater 260 is configured to form a backward data sharing path, and may load weight data from an external memory, receive weight data from other computing clusters, or forward the weight data to other computing clusters (e.g., by interacting with repeaters of other computing clusters), or receive matrix multiplication results from other computing clusters and store them in the on-chip memory 250, or forward the matrix multiplication results to other computing clusters.
For clarity of illustration, the connection between the data flow control module 210 and the on-chip memory 250 and the repeater 260 is not shown in fig. 2. Such connection relationships are understood by those of ordinary skill in the art to implement the functionality of the present invention. Moreover, it is also possible for one of ordinary skill in the art to add, delete, and change some components according to the needs and purposes of the system, without being limited to the components and the connection relationships between the components shown in fig. 2.
In connection with the computing cluster architecture of fig. 2, when the system-on-chip includes multiple computing clusters, a forward data forwarding path and a backward data sharing path may be formed under the control of the data flow control module 210.
For example, the forward data forwarding path is formed by these modules in the data caching module 220, the multiply-accumulate module 230, the data transfer module 240, and other computing clusters, i.e., the neuron data may be forwarded to the data caching module 220, the multiply-accumulate module 230, the data transfer module 240, and the data caching module, the multiply-accumulate module, and the data transfer module of other computing clusters in sequence. As another example, the forward data forwarding path is formed by the data buffer module 220, the data transfer module 240, and the multiply-accumulate module and the data transfer module of the data buffer modules of other computing clusters, in which case some of the neuron data may be forwarded directly to other computing clusters without participating in the multiply-accumulate operation.
For example, the backward data sharing path comprises a repeater 260 and a repeater in other computing clusters, i.e. the weight data or the calculation result of the matrix multiplication is sent by the repeater 260 to the repeater of the computing cluster to which it is connected.
It should be understood that the connection between the modules in fig. 2 is only for illustration, and those skilled in the art can make appropriate modifications in practical applications, for example, the calculation result of the multiply-accumulate module 230 can also be temporarily stored in the data buffer module 220 and forwarded to other computing clusters at appropriate time.
Fig. 3 is a system on a chip according to another embodiment of the invention, which is similar to the system shown in fig. 1, and also comprises a computation cluster 301-304, a forward data forwarding path 320, a backward data sharing path 430 and a task allocation unit (not shown), but in contrast to fig. 1, the transponders for constituting the backward data sharing path are arranged outside the computation cluster and the backward data sharing path 430 is shown. The backward data sharing path 430 is formed by connecting a plurality of transponders, which are in one-to-one correspondence with the computing clusters, and can interact data with the corresponding computing clusters, and are respectively labeled as a transponder 401, a transponder 402, a transponder 403, and a transponder 404.
The task allocation policies and on-chip, off-chip storage policies and corresponding data processing procedures are described below in connection with fig. 3.
In one embodiment, the tasks to be executed by each computing cluster and on-chip and off-chip storage strategies in the computing process are determined according to the scale of the matrix to be computed (including the scale of the input neuron matrix and/or the scale of the weight matrix) or the computing capacity of each processor, so that efficient utilization of computing resources and minimum data transmission are realized to the greatest extent by selecting different schemes.
For example, let the input neuron data matrix be b×n×k in size, represent an input neuron data matrix having B n×k (N is a row dimension, K is a column dimension), have a plurality of weight matrices, the weight matrix be k×m (K is a row dimension, M is a column dimension), and, for convenience of explanation, let the computing power of a total of B computing clusters be the same, k×m (i.e., when a matrix multiplication operation is performed once, the row dimension of the weight matrix is K, the column dimension is M), wherein N, M, K, K, M is any positive integer. The task allocation strategy and the on-chip and off-chip storage strategy under the two conditions of small weight scale and large weight scale are respectively introduced below.
1) Under the condition of smaller weight scale
For example, if M.ltoreq.m, such computation typically occurs in the shallower layers of the image recognition application, where the value of M is generally smaller and K is also smaller, so is the size of the weight matrix KxM.
In this case, the subtask allocation strategy is to allocate each computing cluster in parallel by means of input neuron matrix parallelism
And (3) input neuron matrices to be calculated.
In one embodiment, the storage strategy for the weight matrix is: and each computing cluster loads the weight matrix from the off-chip memory to the local on-chip memory when performing matrix multiplication operation, so that all input neuron matrixes and the weight matrix are processed locally by the computing clusters in operation pushing, and the computing clusters do not need to perform data communication, and in this case, a backward data sharing channel is not used. In this way, access delay and access power consumption can be reduced.
In another embodiment, the storage strategy of the weight matrix is that the weight matrix is averagely distributed in the on-chip memory of each computing cluster, and when the matrix multiplication operation is executed, the multiply-accumulate module in each computing cluster loads the weight matrix from the local on-chip memory, and the weight matrix of other computing clusters is obtained through the backward data sharing path.
For ease of understanding, in connection with the system on chip shown in fig. 3, table 2 below illustrates the behavior of the computing clusters at different times, and table 3 illustrates the behavior of the transponders at different times. Specifically, parallel allocation with each computing cluster
The input neuron matrix is exemplified and marked as +.>
The weight matrix is also equally divided into four sub-weight matrices, labeled weight portions 1-4, which are assigned to the computing clusters 301-304, respectively, and at time T0, the computing cluster 301 executes the neuron matrix +.>
Matrix multiplication with weight part 1 and the
transponder 401 corresponding to the computation cluster reads weight part 2 from the
computation cluster 402, at time T1 the computation cluster 301 performs the neuron matrix +.>
The other computing clusters behave similarly to the corresponding transponders as a matrix multiplication operation of weight part 2, see tables 2 and 3./>
Table 2: computing cluster behavior at different moments
Table 3: repeater behavior at different times
As can be seen from tables 2 and 3, each computing cluster performs matrix multiplication, and at the same time, its corresponding transponder can read weight data to be involved in matrix multiplication at a subsequent time from other computing clusters via a backward data sharing path, so that the weight data flows between the transponders, so that the computing clusters can be loaded when needed, and can be loaded into a data buffer module, an on-chip memory, and the like. In this way, the flow of weight data between the computing clusters can be controlled, thereby improving the resource utilization of the computing clusters and the operation efficiency of matrix multiplication.
2) The case of large weight scale
If M is larger than or equal to bXm, the scale of the weight matrix is larger, the scale of the neuron matrix is smaller, and the computing cluster cannot complete multiplication operation of the input neuron matrix and the weight matrix at one time. In this case, each computing cluster may still be allocated in parallel
To-be-calculated matrices, the weight matrix is divided into a plurality of smaller matrices distributed in different computing clusters, for example, the computing clusters 101 are distributed with sub-matrices of K x [0, m-1]The computing cluster 102 assigns a submatrix of K x [ m,2m-1]And so on. In this way, when performing matrix multiplication operations, the larger communication bandwidth, i.e., the large scale weights, will remain local, while the neuron data may be read once from on-chip memory (e.g., SDRAM) and borrowed from forward data transfer in the compute clusterThe outgoing path propagates to other computing clusters of the system-on-chip. In this way, only the result of the matrix multiplication, or intermediate calculation result, is written back to the on-chip shared memory through the backward data sharing path, and other accesses occur inside the calculation cluster.
Still in connection with fig. 3, tables 4 and 5 below illustrate the behavior of the processor and the behavior of the repeater, respectively, at different times. Specifically, the allocation is still performed in parallel with each computing cluster
For example, a matrix of input neurons, each labeled as
And one weight matrix is divided into four sub weight matrices, which are marked as weight parts 1-4 and respectively distributed to the computing clusters 301-304, and when the multiplication operation of the neuron matrix and the one weight matrix is executed, the operation results of the four sub weight matrices are needed to be spliced.
In this example, at time T0, the computing cluster 301 executes a matrix of neurons
And weight part 1, the computing cluster 302 performs a neuron matrix +.>
Matrix multiplication with weight part 2; at time T1, the computing cluster 301 executes the neuron matrix +.>
And weight part 1, the computing cluster 302 performs a neuron matrix +.>
Matrix multiplication with weight part 2; at time T2, the
transponder 401 corresponding to the computing cluster 301 reads +.>
With the result of weight part 2, at time T3, the
repeater 401 reads from the
computing cluster 403
And analogizing with the result of the weight part 3, after the computing clusters obtain the results of a plurality of sub-weight matrixes aiming at one weight matrix, the computing results of the neuron matrix and the weight matrix can be obtained through splicing, and other computing clusters and corresponding transponders are similar in behavior, and can be seen in tables 4 and 5.
Table 4: processor behavior at different times
Table 5: repeater behavior at different times
As can be seen from tables 4 and 5, when each computing cluster performs matrix multiplication, the corresponding transponder can read the result of the matrix multiplication at the previous time from other computing clusters via the backward data sharing path, so that the computing result flows sequentially between the transponders for the computing clusters to splice when needed.
It should be understood that the above-mentioned timing of the flow of weight data and calculation results between processors is not fixed, and the transfer order of data between the respective modules may be controlled according to the scale of data processing, the calculation capability of the multiply-accumulate module, and the capacities of the data buffer module and the on-chip memory. For example, each computing cluster
The matrices to be calculated are not necessarily all processed and the calculation result is forwarded to the data sharing path through the backward direction, and the calculation result of a part of the matrices is forwarded after a part of the matrices are processed. Further, although tables 4 and 5 illustrate the processing procedure of one weight matrix, similarly to the case of a plurality of weight matrices, only sequential processing is required.
The system on a chip of the invention provides a unified and coordinated multiprocessor architecture aiming at the operation characteristics of different layers in the neural network reasoning application, each computing cluster has own larger storage and can efficiently access own storage, thereby solving the problems of lower shallow processing efficiency and higher energy consumption in data carrying of the first type of architecture and solving the problems caused by limited storage of the second type of architecture. In addition, by coordinating the scheduling processing task and selecting different storage strategies, the invention can be suitable for the operation characteristics of different layers in the neural network, thereby reducing the problem of unbalanced performance of the third type architecture. In terms of software and hardware coordination, the method and the device are based on the mode of task division and the non-uniform memory storage strategy applied to the software layer, so that the heavy bandwidth load required by the operation layer is concentrated in the computing cluster, and the light bandwidth load is transmitted through the on-chip interconnection network, thereby realizing the energy consumption optimization of local memory access.
The invention improves the calculation energy efficiency ratio in the field of artificial intelligence reasoning, and is especially suitable for application scenes such as a data center, unmanned driving and the like in high-performance reasoning demands.
The system on a chip of the invention can be applied to various electronic devices, such as mobile devices, embedded electronic devices, intelligent computing processing devices, robots and the like, and can be applied to the fields of word processing, voice recognition and processing, multi-language translation, image recognition, biological feature recognition, intelligent control and the like.
It should be noted that, although the steps are described above in a specific order, it is not meant to necessarily be performed in the specific order, and in fact, some of the steps may be performed concurrently or even in a changed order, as long as the required functions are achieved.
The present invention may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present invention.
The computer readable storage medium may be a tangible device that retains and stores instructions for use by an instruction execution device. The computer readable storage medium may include, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing.
The foregoing description of embodiments of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.