CN112232497A

CN112232497A - Method, system, device and medium for compiling AI chip

Info

Publication number: CN112232497A
Application number: CN202011083320.XA
Authority: CN
Inventors: 沈付旺; 景璐
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2020-10-12
Filing date: 2020-10-12
Publication date: 2021-01-15

Abstract

The invention discloses a method, a system, equipment and a storage medium for compiling an AI chip, wherein the method comprises the following steps: quantizing the network model of the AI chip, and performing precision adjustment on the quantized network model based on the weight of the network model; distributing the calculation of the convolutional neural network to different MAC calculation units of the AI chip in the network model after the precision adjustment so as to optimize the calculation process of the network model; and acquiring hardware architecture parameters, and generating a continuous calculation flow according to the hardware architecture parameters and the network model. According to the invention, the AI chip is quantized and the precision is adjusted, so that the performance is more perfect; the calculation of the network is optimized to the maximum extent, the calculation steps are reduced, and the utilization rate of the MAC calculation unit is maximized; by generating a computation flow that can keep the computing device in a busy state all the time during the network reasoning process, the utilization rate of the AI device is improved.

Description

Method, system, device and medium for compiling AI chip

Technical Field

The present invention relates to the field of artificial intelligence, and more particularly, to a method, system, computer device and readable medium for compiling an AI chip.

Background

Artificial Intelligence (AI), the most popular and potential technology at present, has been greatly developed in recent years under the support of powerful computing devices. Related applications penetrate into aspects of life such as weather prediction, automatic driving, natural language translation, and so forth. The basis of AI techniques, however, is a vast mathematical computation that relies on powerful computing equipment. In the early stage of AI technology application, because there is no AI computation-dedicated device, a general-purpose computing architecture processor such as a CPU and a GPU is often used to complete the deployment of relevant computation and AI computation examples, and although these general-purpose computing devices have strong flexibility and good versatility, when performing AI-related computation, they are often low in computation efficiency, high in power consumption, and low in hardware cost, so that people seek an AI computation-dedicated computing device AI ASIC (AI application specific chip). The purpose of the AI special chip is to make the AI calculation efficiency higher, the AI equipment calculation power stronger, and the energy consumption and hardware cost lower, thereby reducing the cost of deploying the AI equipment on the whole and improving the processing speed of the AI equipment.

The flow of example calculation to be performed for the AI chip or the AI device is: 1) acquiring an algorithm or a deep network model file to be subjected to instance operation from an existing deep learning framework such as TensorFlow, PyTorch, Caffe and ONNX; 2) generating an instruction set or a calculation configuration file which can be directly executed on the AI device by utilizing an AI compiler; 3) directly distributing the instruction set or the file in the step 2) to a driver by utilizing a top-level application or runtime; 4) the driver forwards the instruction set or the relevant computing configuration to the AI device; 5) and the AI equipment completes the calculation and outputs a calculation result or precision data.

However, the prior art has the following disadvantages for compiling AI chips:

(1) the quantization function is not perfect enough, and the quantization precision is to be improved;

(2) the calculation optimization can only support the optimization of operator fusion class, and the calculation cannot be well optimized comprehensively and deeply;

(3) the method can not directly support some mainstream frameworks such as PyTorch and the like on the aspect of flexibility, and brings few limitations to users;

(4) only aiming at certain specific hardware platforms, the expansion can not be carried out;

(5) the use process is too complicated, which is not beneficial to the user to deploy the actual calculation example;

(6) the computing equipment has an idle state when performing network reasoning computation, and the utilization rate of the equipment needs to be improved;

(7) the calculation optimization can only support the optimization of the very primary operator fusion class, the calculation of the network cannot be optimized to the maximum extent, the calculation steps are reduced, and the utilization rate of the MAC calculation unit is maximized.

Disclosure of Invention

In view of this, an object of the embodiments of the present invention is to provide a method, a system, a computer device, and a computer-readable storage medium for compiling an AI chip, which enable performance to be more complete by quantizing and performing precision adjustment on the AI chip; the calculation of the network is optimized to the maximum extent, the calculation steps are reduced, and the utilization rate of the MAC calculation unit is maximized; by generating a computation flow that can keep the computing device in a busy state all the time during the network reasoning process, the utilization rate of the AI device is improved.

In view of the above, an aspect of the embodiments of the present invention provides a method for compiling an AI chip, including the following steps: quantizing the network model of the AI chip, and performing precision adjustment on the quantized network model based on the weight of the network model; distributing the calculation of the convolutional neural network to different MAC calculation units of an AI chip in the network model after the precision adjustment so as to optimize the calculation process of the network model; and acquiring hardware architecture parameters, and generating continuous calculation flow according to the hardware architecture parameters and the network model.

In some embodiments, the performing precision adjustment on the quantized network model based on the weight of the network model includes: carrying out linear quantization on the weight of the network model to generate first measurement data; generating a feature map of the network model and quantifying the network model based on the feature map to generate second measurement data; and carrying out quantitative reasoning by utilizing the first measured data and the second measured data, comparing a reasoning result with a reasoning result of the original network model, and adjusting the network model according to the comparison result.

In some embodiments, the allocating, in the precision-adjusted network model, the calculation of the convolutional neural network to different MAC calculation units of an AI chip to optimize the calculation flow of the network model includes: judging whether the size of the picture input into the network model exceeds a threshold value; and in response to the fact that the size of the picture input into the network model exceeds a threshold value, fragmenting the picture, and sending the fragmented sub-picture to different AI chip cores.

In some embodiments, the allocating, in the precision-adjusted network model, the calculation of the convolutional neural network to different MAC calculation units of an AI chip to optimize the calculation flow of the network model includes: judging whether the number of channels of the network model input pictures exceeds a second threshold value; and in response to the fact that the number of channels of the network model input pictures exceeds a second threshold value, segmenting the weight of the network model, and sending the segmented sub-weight to different AI chip cores.

In some embodiments, the generating a continuous computation flow from the hardware architecture parameters and the network model comprises: in response to generating a new computing stream, the current usage of computing power per time period is obtained and the new computing stream is set to the time period of minimum current computing power usage.

In some embodiments, the method further comprises: the network model is retrained to crop and compress the network model.

In some embodiments, the method further comprises: and carrying out the same classification on the calculation in the calculation chart of the network model so as to simplify the calculation process.

In another aspect of the embodiments of the present invention, an AI compiling chip system is further provided, including: the quantification module is configured to quantify the network model of the AI chip and adjust the precision of the quantified network model based on the weight of the network model; the calculation module is configured to distribute the calculation of the convolutional neural network to different MAC calculation units of an AI chip in the network model after the precision adjustment so as to optimize the calculation process of the network model; and the flow module is configured to acquire hardware architecture parameters and generate continuous computation flow according to the hardware architecture parameters and the network model.

In another aspect of the embodiments of the present invention, there is also provided a computer device, including: at least one processor; and a memory storing computer instructions executable on the processor, the instructions when executed by the processor implementing the steps of the method as above.

In a further aspect of the embodiments of the present invention, a computer-readable storage medium is also provided, in which a computer program for implementing the above method steps is stored when the computer program is executed by a processor.

The invention has the following beneficial technical effects:

(1) the interface problems of different deep learning frames PyTorch, TensorFlow, Caffe and ONNX are solved, and a comprehensive and reliable quantification function can be provided;

(2) comprehensive calculation optimization functions are provided, including operator fusion, clipping compression, sparse deployment, hardware-oriented adaptive optimization functions based on AutoML and reinforcement learning, automatic selection of an optimal calculation graph, and realization of maximized utilization rate of MAC calculation units, so that calculation speed is maximized, and delay is minimized;

(3) according to a hardware architecture and a network computing process, a computing flow which maximally utilizes AI equipment is automatically generated, so that the computing efficiency approaches to 100% of an ideal one;

(4) the method provides a plurality of flexible development modes for users, such as the development of a user-defined operator and a user-defined network, and the users can select the development language which is good at themselves to realize the user-defined operator and the network which are not supported by the current compiler;

(5) through an implementation method of a LLVM (Low Level Virtual Machine) rear end, the adaptation of a user to an AI compiler and hardware of the user can be met, a large amount of secondary development of the AI compiler is not needed due to the change of a hardware architecture, and the like, so that a large amount of research and development time can be saved for the user, and the user can focus on the design of the hardware architecture.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.

FIG. 1 is a diagram illustrating an embodiment of a method for compiling an AI chip according to the present invention;

FIG. 2 is a flow chart of the custom operator and network provided by the present invention;

FIG. 3 is a flow diagram of a customized underlying compiler provided by the present invention;

fig. 4 is a schematic diagram of a hardware structure of a computer device for compiling an AI chip according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.

It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.

In view of the above objects, a first aspect of an embodiment of the present invention proposes an embodiment of a method for compiling an AI chip. Fig. 1 is a schematic diagram illustrating an embodiment of a method for compiling an AI chip according to the present invention. As shown in fig. 1, the embodiment of the present invention includes the following steps:

s1, quantizing the network model of the AI chip, and adjusting the precision of the quantized network model based on the weight of the network model;

s2, distributing the calculation of the convolutional neural network to different MAC calculation units of the AI chip in the network model after the precision adjustment so as to optimize the calculation process of the network model; and

and S3, acquiring the hardware architecture parameters, and generating a continuous calculation flow according to the hardware architecture parameters and the network model.

The basis of the AI technology is deep learning and a special AI computing chip, which is a chip specially aiming at deep learning and computing. A powerful and efficient chip is designed, and the comprehensive consideration needs to be given to the aspects of algorithm top layer, AI compiler, architecture, hardware design and the like. The algorithm is mainly a deep neural network, such as a convolution type network, a circulation type neural network and the like. The algorithms are existing open source algorithms, accuracy verification is carried out on the existing open source algorithms through related data sets, and a user needs to select a corresponding algorithm according to an actual application scene of the user in the future. The chip is designed to support the algorithms more flexibly, comprehensively and efficiently.

And quantizing the network model of the AI chip, and performing precision adjustment on the quantized network model based on the weight of the network model. The embodiment of the invention has the function of quantizing the network model, and can quantize the original FP32 high-bit network model based on TensorFlow, PyTorch, ONNX, Caffe and the like to INT16/Fp16, INT8 and even INT4 low-bit types. And the quantization accuracy of the network satisfies the index on accuracy.

At present, a plurality of deep learning algorithms are provided, a plurality of deep learning frames for realizing the algorithms are provided, an AI compiler can compile an instruction set or a configuration file which can be directly operated on hardware, and firstly, the AI compiler can analyze the original algorithms realized based on different frames, so that more comprehensive model interfaces and analysis functions are required.

In some embodiments, the performing precision adjustment on the quantized network model based on the weight of the network model includes: carrying out linear quantization on the weight of the network model to generate first measurement data; generating a feature map of the network model and quantifying the network model based on the feature map to generate second measurement data; and carrying out quantitative reasoning by utilizing the first measured data and the second measured data, comparing a reasoning result with a reasoning result of the original network model, and adjusting the network model according to the comparison result. Firstly, carrying out linear quantization on a weight of a network model, and generating Scale data (first measurement data) of the weight; then, reasoning the deployment network model on the corresponding data set to generate a reasoning data result of FP32, namely a feature map file; then, quantizing the feature diagram or the activation value of the network model to generate a quantization coefficient or Scale data (second measurement data) between layers; and finally, carrying out quantitative reasoning by using the quantized weight value, the weight value and Scale data of the feature map, comparing the final reasoning result with the original FP32 reasoning result, and carrying out fine adjustment according to the error size of a certain layer or the precision on a data set, thereby finally realizing a quantitative model with no precision loss or 1% precision loss.

After the network model quantification is completed, the calculation process of the whole network model is further optimized and combined, the data moving and calculating steps are reduced to the maximum extent, and the utilization rate of the MAC calculating unit is maximized.

In some embodiments, the method further comprises: the network model is retrained to crop and compress the network model. For example, the compiler evaluates the network model and performs necessary clipping on the network model, specifically, the clipping of the network model is to remove layers or channels that are not important in the network model by retraining until the precision cannot be recovered after clipping, so as to achieve the maximum clipping ratio without losing the precision, thereby realizing the miniaturization of the network model and the reduction of the large calculation amount ratio. After network clipping and compression, the computing structure of each layer becomes sparse, so that sparse computation needs to be matched with a hardware architecture to complete deployment of sparse computation.

In some embodiments, the method further comprises: and carrying out the same classification on the calculation in the calculation chart of the network model so as to simplify the calculation process. For example, in one operator or in one calculation step, the involved calculations of different types are subjected to the same class, common children are eliminated, the calculation step of the operator is simplified, and the simplification of the calculation is further realized. Operators or operations which can be combined in the calculation graph are fused, so that the reduction of calculation steps is realized, the movement of data before and after each calculation is reduced, and a large amount of calculation time is saved.

The method can utilize the latest reinforcement learning and the adaptive learning AutoML to carry out the calculation optimization of the adaptive learning aiming at the current AI equipment of the user, so that the calculation optimization reaches the optimal matching of calculation and hardware.

And distributing the calculation of the convolutional neural network to different MAC calculation units of the AI chip in the network model after the precision adjustment so as to optimize the calculation process of the network model. A convolutional network is often composed of many convolutional layers, and the basic process of computation for each convolutional layer is: feature graph (Feature) data input, Weight (Weight) loading, multiplication and addition calculation, result output or temporary storage. When a plurality of pictures are input, each picture can be respectively distributed to each core, and the weights of the cores are shared and multiplexed because the weights are the same.

In some embodiments, the allocating, in the precision-adjusted network model, the calculation of the convolutional neural network to different MAC calculation units of an AI chip to optimize the calculation flow of the network model includes: judging whether the size of the picture input into the network model exceeds a threshold value; and in response to the fact that the size of the picture input into the network model exceeds a threshold value, fragmenting the picture, and sending the fragmented sub-picture to different AI chip cores. In some practical scenes, the size of the picture may be very large, when the size exceeds a preset threshold, the picture can be sent to different AI chip cores in a slicing mode, and the weight at this moment is shared and multiplexed.

In some embodiments, the allocating, in the precision-adjusted network model, the calculation of the convolutional neural network to different MAC calculation units of an AI chip to optimize the calculation flow of the network model includes: judging whether the number of channels of the network model input pictures exceeds a second threshold value; and in response to the fact that the number of channels of the network model input pictures exceeds a second threshold value, segmenting the weight of the network model, and sending the segmented sub-weight to different AI chip cores. In the network computing process, the size of the picture may be smaller and smaller, but the number of channels of the picture may be larger and larger, when the number of channels exceeds a preset second threshold, the weight values may be sent to different computing cores respectively in a slicing mode, and at this time, the input (feature map) is shared and multiplexed.

And acquiring hardware architecture parameters, and generating a continuous calculation flow according to the hardware architecture parameters and the network model. The AI compiler can carry out self-adaptation according to the hardware configuration and the calculation flow of the network model, thereby ensuring that the hardware does not have the idle state of the hardware caused by data waiting, moving, hardware subtask configuration and the like after the hardware deploys the calculation task, and leading the utilization rate of the hardware to be close to the ideal 100 percent. Specifically, the method comprises the following steps: firstly, importing a hardware architecture file, and if the parameter does not exist or the file is not imported, defaulting a compiler to a self-contained hardware architecture; importing a network model file; and then generating an efficient and seamless calculation flow according to the hardware architecture, the network model and the related calculation configuration file data.

For a novel network or a user-defined network, the AI compiler of the invention provides a user-defined network and an implementation interface of a user-defined operator. Users can choose to use the self-adept language to carry out self-defined programming, such as Python, C + + implementation of self operator and network definition.

FIG. 2 is a flow chart illustrating the custom operators and network provided by the present invention. As shown in fig. 2, analyzing the specific function and mathematical expression of the operator, making explicit input and output, determining the development mode and the used calculation interface of the operator, and determining the file name, the operator name, and the operator type of the operator implementation; and respectively defining operator information and a prototype, and realizing corresponding operator functions through codes. The operator compiling comprises the following steps: and carrying out Makefile writing on the realized operator, and carrying out make compiling. The operator test comprises the following steps: and testing whether the output of the operator is consistent with the actual output through specific input data, if so, carrying out the next step, otherwise, checking the operator definition and realizing codes, and re-developing until the final test is passed. And (4) recompiling the AI compiler, and automatically adding the newly developed operator to an operator library of the AI compiler by the AI compiler. And further performing model conversion on the self-defined network model according to the regenerated AI compiler, generating an instruction set or a configuration file and the like which can be directly executed by hardware, and verifying a network reasoning result. And similarly, carrying out the next task according to the final reasoning result or checking the problem in the operator development process.

The programming language is ultimately compiled into corresponding machine instructions for execution by the hardware. However, the target platform of the machine code generated by the compiler of these languages is mainly a CPU platform, and it is necessary to customize the compiler for a specific hardware platform so that the compiled code can be executed efficiently not only on an architecture platform such as a general-purpose CPU but also on a designed AI chip. The embodiment of the invention provides a customized compiler development function aiming at the hardware architecture of a user based on the LLVM. The LLVM compiler architecture is mainly divided into a front part, a middle part and a rear part. The front-end development mainly comprises the steps of adding support to a language, analyzing grammar of the language through the LLVM, and generating an intermediate code LLVM IR (intermediate code), wherein the intermediate part is an LLVM optimizer and is an independent part of the language and a target platform, and the front-end development and the intermediate part do not need a user to care about and can use the existing implementation. The back end is mainly the executable code for generating the target platform, and the AI compiler realizes the customized LLVM back end supporting the specific AI chip ASIC.

FIG. 3 illustrates a flow diagram of a customized underlying compiler provided by the present invention. As shown in FIG. 3, the customized underlying compiler includes create target machine, target machine registration, create register set, instruction set selection, instruction scheduling, and JIT support. The target machine is created, i.e. by creating the TargetMachine subclass, for describing the characteristics of its own target machine. Target registration is to register the target machine through the targetalgorithm interface. Creating a register set for the target machine hardware is by TableGen to generate code about register definitions, register aliases and register classes, representing information that facilitates register allocation and inter-register interaction by implementing a subclass that inherits the targetresiginfo class. Instruction set selection is the conversion of device-independent IR instructions into device-dependent dag (directed Acycle graph) nodes. Instruction scheduling is the efficient scheduling and execution of instructions by way of table scheduling, a greedy heuristic. The JIT (Just-In-Time) support is to realize the compilation execution of a runtime calling function or program segment by writing a subclass inherited from a TargetJITInfo class.

The embodiment of the invention solves the interface problems of different deep learning frames PyTorch, TensorFlow, Caffe and ONNX, and can provide a comprehensive and reliable quantification function. The invention provides comprehensive calculation optimization functions, which comprise operator fusion, cutting compression, sparse deployment, hardware-oriented adaptive optimization functions based on AutoML and reinforcement learning, and automatic selection of an optimal calculation graph, so that the utilization rate of an MAC calculation unit is maximized, the calculation speed is maximized, and the delay is minimized. The invention can automatically generate the calculation flow which maximally utilizes the AI equipment according to the hardware architecture and the network calculation flow, so that the calculation efficiency approaches to 100 percent of ideal. The invention provides a plurality of flexible development modes for users, such as the development of user-defined operators and user-defined networks, and users can select the development language which is good at themselves to realize the user-defined operators and networks which are not supported by the compiler in advance. The expandability of the invention can meet the adaptation of users to the AI compiler and the hardware of the users through the implementation method of the LLVM back end, does not need to carry out secondary development of a large number of AI compilers and the like due to the change of the hardware architecture, can save a large amount of research and development time for the users, and leads the users to focus on the design of the hardware architecture.

It should be particularly noted that, the steps in the embodiments of the method for compiling an AI chip can be mutually intersected, replaced, added, and deleted, so that these methods for compiling an AI chip with reasonable permutation and combination transformation also belong to the scope of the present invention, and the scope of the present invention should not be limited to the embodiments.

In view of the above object, according to a second aspect of the embodiments of the present invention, there is provided a system for compiling an AI chip, including: the quantification module is configured to quantify the network model of the AI chip and adjust the precision of the quantified network model based on the weight of the network model; the calculation module is configured to distribute the calculation of the convolutional neural network to different MAC calculation units of an AI chip in the network model after the precision adjustment so as to optimize the calculation process of the network model; and the flow module is configured to acquire hardware architecture parameters and generate continuous computation flow according to the hardware architecture parameters and the network model.

In some embodiments, the quantization module is configured to: carrying out linear quantization on the weight of the network model to generate first measurement data; generating a feature map of the network model and quantifying the network model based on the feature map to generate second measurement data; and carrying out quantitative reasoning by utilizing the first measured data and the second measured data, comparing a reasoning result with a reasoning result of the original network model, and adjusting the network model according to the comparison result.

In some embodiments, the computing module is configured to: judging whether the size of the picture input into the network model exceeds a threshold value; and in response to the fact that the size of the picture input into the network model exceeds a threshold value, fragmenting the picture, and sending the fragmented sub-picture to different AI chip cores.

In some embodiments, the computing module is configured to: judging whether the number of channels of the network model input pictures exceeds a second threshold value; and in response to the fact that the number of channels of the network model input pictures exceeds a second threshold value, segmenting the weight of the network model, and sending the segmented sub-weight to different AI chip cores.

In some embodiments, the flow module is configured to: in response to generating a new computing stream, the current usage of computing power per time period is obtained and the new computing stream is set to the time period of minimum current computing power usage.

In some embodiments, the system further comprises: and the retraining module is configured to retrain the network model so as to cut and compress the network model.

In some embodiments, the system further comprises: and the simplifying module is configured for carrying out similarity on calculation in the calculation chart of the network model so as to simplify the calculation process.

In view of the above object, a third aspect of the embodiments of the present invention provides a computer device, including: at least one processor; and a memory storing computer instructions executable on the processor, the instructions being executable by the processor to perform the steps of: s1, quantizing the network model of the AI chip, and adjusting the precision of the quantized network model based on the weight of the network model; s2, distributing the calculation of the convolutional neural network to different MAC calculation units of the AI chip in the network model after the precision adjustment so as to optimize the calculation process of the network model; and S3, acquiring the hardware architecture parameters, and generating a continuous calculation flow according to the hardware architecture parameters and the network model.

In some embodiments, the steps further comprise: the network model is retrained to crop and compress the network model.

In some embodiments, the steps further comprise: and carrying out the same classification on the calculation in the calculation chart of the network model so as to simplify the calculation process.

Fig. 4 is a schematic diagram of a hardware structure of an embodiment of the computer device for compiling the AI chip according to the present invention.

Taking the apparatus shown in fig. 4 as an example, the apparatus includes a processor 301 and a memory 302, and may further include: an input device 303 and an output device 304.

The processor 301, the memory 302, the input device 303 and the output device 304 may be connected by a bus or other means, and fig. 4 illustrates the connection by a bus as an example.

The memory 302, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the method for compiling the AI chip in the embodiment of the present application. The processor 301 executes various functional applications of the server and data processing, that is, a method of compiling the AI chip of the above-described method embodiment, by executing the nonvolatile software program, instructions, and modules stored in the memory 302.

The memory 302 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the method of compiling the AI chip, and the like. Further, the memory 302 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 302 optionally includes memory located remotely from processor 301, which may be connected to a local module via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 303 may receive information such as a user name and a password that are input. The output means 304 may comprise a display device such as a display screen.

One or more program instructions/modules corresponding to the method of compiling the AI chip are stored in the memory 302, and when executed by the processor 301, perform the method of compiling the AI chip in any of the above-described method embodiments.

Any embodiment of a computer device for executing the method for compiling the AI chip can achieve the same or similar effects as any corresponding method embodiment.

The invention also provides a computer readable storage medium storing a computer program which, when executed by a processor, performs the method as above.

Finally, it should be noted that, as one of ordinary skill in the art can appreciate that all or part of the processes of the methods of the above embodiments can be implemented by a computer program to instruct related hardware, and the program of the method for compiling the AI chip can be stored in a computer readable storage medium, and when executed, the program can include the processes of the embodiments of the methods as described above. The storage medium of the program may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a Random Access Memory (RAM), or the like. The embodiments of the computer program may achieve the same or similar effects as any of the above-described method embodiments.

The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.

It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.

The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of the embodiments of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims

1. A method of compiling an AI chip, comprising the steps of:

quantizing the network model of the AI chip, and performing precision adjustment on the quantized network model based on the weight of the network model;

distributing the calculation of the convolutional neural network to different MAC calculation units of an AI chip in the network model after the precision adjustment so as to optimize the calculation process of the network model; and

and acquiring hardware architecture parameters, and generating a continuous calculation flow according to the hardware architecture parameters and the network model.

2. The method of claim 1, wherein the performing precision adjustment on the quantized network model based on the weight of the network model comprises:

carrying out linear quantization on the weight of the network model to generate first measurement data;

generating a feature map of the network model and quantifying the network model based on the feature map to generate second measurement data; and

and carrying out quantitative reasoning by utilizing the first measured data and the second measured data, comparing a reasoning result with a reasoning result of the original network model, and adjusting the network model according to the comparison result.

3. The method of claim 1, wherein the assigning of the convolutional neural network computation to different MAC computation units of an AI chip in the precision-adjusted network model to optimize the computation flow of the network model comprises:

judging whether the size of the picture input into the network model exceeds a threshold value; and

and in response to the fact that the size of the picture input into the network model exceeds a threshold value, fragmenting the picture, and sending the fragmented sub-picture to different AI chip cores.

4. The method of claim 1, wherein the assigning of the convolutional neural network computation to different MAC computation units of an AI chip in the precision-adjusted network model to optimize the computation flow of the network model comprises:

judging whether the number of channels of the network model input pictures exceeds a second threshold value; and

and in response to the fact that the number of channels of the network model input pictures exceeds a second threshold value, segmenting the weight of the network model, and sending the segmented sub-weight to different AI chip cores.

5. The method of claim 1, wherein generating a continuous computational flow from the hardware architecture parameters and the network model comprises:

in response to generating a new computing stream, the current usage of computing power per time period is obtained and the new computing stream is set to the time period of minimum current computing power usage.

6. The method of claim 1, further comprising:

the network model is retrained to crop and compress the network model.

7. The method of claim 1, further comprising:

and carrying out the same classification on the calculation in the calculation chart of the network model so as to simplify the calculation process.

8. A system for compiling an AI chip, comprising:

the quantification module is configured to quantify the network model of the AI chip and adjust the precision of the quantified network model based on the weight of the network model;

the calculation module is configured to distribute the calculation of the convolutional neural network to different MAC calculation units of an AI chip in the network model after the precision adjustment so as to optimize the calculation process of the network model; and

and the flow module is configured to acquire hardware architecture parameters and generate continuous computation flow according to the hardware architecture parameters and the network model.

9. A computer device, comprising:

at least one processor; and

a memory storing computer instructions executable on the processor, the instructions when executed by the processor implementing the steps of the method of any one of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.