CN118051779B - Automatic parameter searching method and device for large model training and electronic equipment - Google Patents
Automatic parameter searching method and device for large model training and electronic equipment Download PDFInfo
- Publication number
- CN118051779B CN118051779B CN202410438532.7A CN202410438532A CN118051779B CN 118051779 B CN118051779 B CN 118051779B CN 202410438532 A CN202410438532 A CN 202410438532A CN 118051779 B CN118051779 B CN 118051779B
- Authority
- CN
- China
- Prior art keywords
- parameter
- training
- parameters
- configuration
- model training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012549 training Methods 0.000 title claims abstract description 318
- 238000000034 method Methods 0.000 title claims abstract description 66
- 238000011156 evaluation Methods 0.000 claims abstract description 44
- 230000008569 process Effects 0.000 claims abstract description 33
- 238000012986 modification Methods 0.000 claims description 12
- 230000004048 modification Effects 0.000 claims description 12
- 238000004590 computer program Methods 0.000 claims description 7
- 238000012545 processing Methods 0.000 claims description 7
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000012544 monitoring process Methods 0.000 claims description 4
- 238000011161 development Methods 0.000 abstract description 17
- 238000013135 deep learning Methods 0.000 abstract description 6
- 238000004891 communication Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 238000005457 optimization Methods 0.000 description 3
- 238000007667 floating Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000002790 cross-validation Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000012821 model calculation Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention relates to the technical field of deep learning, in particular to a parameter automatic searching method and device for large model training and electronic equipment, comprising the following steps: acquiring a parameter configuration file which comprises a large model training frame name, a plurality of parameters and a parameter interval of each parameter; determining a target model training frame according to the name of the large model training frame, and determining training flows of all configuration combinations according to the target model training frame, a plurality of parameters and parameter intervals of each parameter; and starting the training processes of all the configuration combinations, and determining the optimal parameter combination for training the large model from the training results of the training processes of all the configuration combinations based on the evaluation indexes. Therefore, the optimal parameter configuration combination can be obtained by enumeration training of the parameter configuration combination through the target model training framework, the problem that the model development period is long due to the fact that the current process of determining the optimal parameter configuration is tedious and time-consuming is solved, the efficiency of determining the optimal parameter configuration by a user is improved, and the development cost is reduced.
Description
Technical Field
The invention relates to the technical field of deep learning, in particular to an automatic parameter searching method and device for large model training and electronic equipment.
Background
Large models refer to neural network models with a large number of model parameters (typically 10 billion and more), and are typically used in many fields such as images, text, audio, etc. Before the large model training framework appears, the user needs to realize the model structure by himself or herself, and when the model parameters are large, the user also needs to realize the parallel strategy of the model by himself or herself so as to speed up the training speed of the model and reduce the occupation of the video memory. However, developing a model parallel strategy has high requirements on development thresholds, an inefficient distributed training strategy may greatly reduce training efficiency of the model and even cause abnormal training results, and manually implementing the model-by-model parallel strategy one by one easily causes poor code maintainability and scalability. Therefore, the large model open source framework supporting the efficient distributed training is provided, the development cost of a user on a parallel strategy of developing a model is reduced, and the method is an important research direction in the large model training.
In the related art, a large model training framework supporting efficient distributed training includes: a large model training framework Megatron-LM based on a deep learning framework PyTorch; an open source large model training framework Megatron-DEEPSPEED adopting a zero redundancy optimizer memory optimization technique (Zero Redundancy Optimizer, zeRO for short); a large model training framework for a seamless integrated mainstream deep learning framework PyTorch.
However, the process of determining the optimal parameter configuration by using the large model training framework is still complicated and time-consuming, and when the user changes the cluster topology or the machine model, the user needs to search the optimal configuration of the model training again manually, which results in a longer model development period and needs to be solved.
Disclosure of Invention
The invention provides a parameter automatic searching method and device for large model training and electronic equipment, which are used for solving the problem that the current process of determining the optimal parameter configuration is tedious and time-consuming, so that the model development period is longer, improving the efficiency of determining the optimal parameter configuration by a user and reducing the development cost.
To achieve the above object, an embodiment of a first aspect of the present invention provides an automatic parameter searching method for training a large model, including the following steps:
Acquiring a parameter configuration file, wherein the parameter configuration file comprises a large model training frame name, a plurality of parameters used for arrangement and combination and a parameter interval of each parameter, and the parameters comprise model structure parameters and parallel training parameters;
Determining a target model training frame according to the name of the large model training frame, and determining training flows of all configuration combinations according to the target model training frame, the plurality of parameters and the parameter interval of each parameter;
And starting the training flows of all the configuration combinations, and determining the optimal parameter combination for large model training from the training results of the training flows of all the configuration combinations based on the evaluation indexes.
According to one embodiment of the present invention, after the parameter configuration file is obtained, the method further includes:
Identifying a target parameter in the plurality of parameters in the parameter configuration file for which no parameter interval is given;
And acquiring a default parameter interval of the target parameter, and taking the default parameter interval as a parameter interval of the target parameter.
According to one embodiment of the present invention, after determining the target model training frame according to the large model training frame name, further comprising:
Checking whether incompatible parameters which do not meet preset compatibility conditions exist in the plurality of parameters or not by utilizing the target model training framework;
If the incompatible parameters which do not meet the preset compatible conditions exist in the plurality of parameters, error reporting reminding is conducted on the incompatible parameters.
According to one embodiment of the present invention, after performing the error reporting reminder for the incompatible parameter, the method further includes:
receiving a parameter modification instruction fed back by a user aiming at the incompatible parameters;
Modifying the incompatible parameter based on the parameter modification instruction.
According to one embodiment of the present invention, the determining a training process of all configuration combinations according to the target model training framework, the plurality of parameters and the parameter interval of each parameter includes:
acquiring iteration times of each training from the parameter configuration file;
Determining configuration combinations of all parameters according to the plurality of parameters and the parameter interval of each parameter;
And determining the training flow of all configuration combinations based on the iteration times and the configuration combinations of all parameters.
According to one embodiment of the present invention, the determining, based on the evaluation index, an optimal parameter combination for training the large model from training results of the training flows of all configuration combinations includes:
acquiring the number of the optimal parameter combinations to be reserved from the parameter configuration file;
acquiring an evaluation index value of each configuration combination based on the training result;
and determining the optimal parameter combination based on the number of the optimal parameter combinations and the evaluation index value of each configuration combination.
According to one embodiment of the present invention, when the training process of all configuration combinations is started, the method further includes:
Configuration combinations of training start failures are recorded.
According to the parameter automatic searching method for large model training, which is provided by the embodiment of the invention, through obtaining the parameter configuration file, which comprises a large model training frame name, a plurality of parameters and a parameter interval of each parameter, a target model training frame can be determined according to the large model training frame name, training flows of all configuration combinations are determined according to the target model training frame, the plurality of parameters and the parameter interval of each parameter, the training flows of all configuration combinations are started, and the optimal parameter combination for large model training is determined from training results of the training flows of all configuration combinations based on the evaluation index. Therefore, the optimal parameter configuration combination can be obtained by enumeration training of the parameter configuration combination through the target model training framework, the problem that the model development period is long due to the fact that the current process of determining the optimal parameter configuration is tedious and time-consuming is solved, the efficiency of determining the optimal parameter configuration by a user is improved, and the development cost is reduced.
To achieve the above object, a second aspect of the present invention provides an automatic parameter searching apparatus for large model training, comprising:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a parameter configuration file, the parameter configuration file comprises a large model training frame name, a plurality of parameters used for arrangement and combination and a parameter interval of each parameter, and the parameters comprise model structure parameters and parallel training parameters;
The first determining module is used for determining a target model training frame according to the name of the large model training frame and determining training flows of all configuration combinations according to the target model training frame, the parameters and the parameter interval of each parameter;
And the second determining module is used for starting the training flows of all the configuration combinations and determining the optimal parameter combination for training the large model from the training results of the training flows of all the configuration combinations based on the evaluation indexes.
According to one embodiment of the present invention, after the parameter configuration file is acquired, the acquiring module is further configured to:
Identifying a target parameter in the plurality of parameters in the parameter configuration file for which no parameter interval is given;
And acquiring a default parameter interval of the target parameter, and taking the default parameter interval as a parameter interval of the target parameter.
According to one embodiment of the present invention, after determining the target model training frame according to the large model training frame name, the first determining module further includes:
The verification unit is used for verifying whether incompatible parameters which do not meet preset compatibility conditions exist in the plurality of parameters or not by utilizing the target model training framework;
and the error reporting unit is used for reporting error reminding aiming at the incompatible parameters when the incompatible parameters which do not meet the preset compatible conditions exist in the plurality of parameters.
According to one embodiment of the present invention, after performing the error reporting on the incompatible parameter, the error reporting unit is further configured to:
receiving a parameter modification instruction fed back by a user aiming at the incompatible parameters;
Modifying the incompatible parameter based on the parameter modification instruction.
According to one embodiment of the present invention, the first determining module is specifically configured to:
acquiring iteration times of each training from the parameter configuration file;
Determining configuration combinations of all parameters according to the plurality of parameters and the parameter interval of each parameter;
And determining the training flow of all configuration combinations based on the iteration times and the configuration combinations of all parameters.
According to an embodiment of the present invention, the second determining module is specifically configured to:
acquiring the number of the optimal parameter combinations to be reserved from the parameter configuration file;
acquiring an evaluation index value of each configuration combination based on the training result;
and determining the optimal parameter combination based on the number of the optimal parameter combinations and the evaluation index value of each configuration combination.
According to an embodiment of the present invention, when the training process of all configuration combinations is started, the second determining module is further configured to:
Configuration combinations of training start failures are recorded.
According to the parameter automatic searching device for large model training, provided by the embodiment of the invention, through obtaining the parameter configuration file, which comprises a large model training frame name, a plurality of parameters and a parameter interval of each parameter, a target model training frame can be determined according to the large model training frame name, training flows of all configuration combinations are determined according to the target model training frame, the plurality of parameters and the parameter interval of each parameter, the training flows of all configuration combinations are started, and the optimal parameter combination for large model training is determined from training results of the training flows of all configuration combinations based on the evaluation index. Therefore, the optimal parameter configuration combination can be obtained by enumeration training of the parameter configuration combination through the target model training framework, the problem that the model development period is long due to the fact that the current process of determining the optimal parameter configuration is tedious and time-consuming is solved, the efficiency of determining the optimal parameter configuration by a user is improved, and the development cost is reduced.
To achieve the above object, an embodiment of a third aspect of the present invention provides an electronic device, including: the system comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the parameter automatic searching method for large model training as described in the embodiment.
To achieve the above object, a fourth aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program for execution by a processor for implementing the parameter automatic search method for large model training as described in the above embodiments.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:
FIG. 1 is a flow chart of a method for automatic searching of parameters for large model training according to an embodiment of the present invention;
FIG. 2 is a flow chart of a method of automatic searching for parameters for large model training in accordance with another embodiment of the present invention;
FIG. 3 is a block diagram of an automatic parameter searching apparatus for large model training according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.
The following describes a parameter automatic searching method, a device and electronic equipment for training a large model according to an embodiment of the invention with reference to the accompanying drawings.
Before introducing the parameter automatic searching method for large model training provided by the embodiment of the invention, a large model training framework which can realize efficient parallel strategies in the related technology is simply introduced.
In the related technology, (1) the large model training framework Megatron-LM is based on the deep learning framework PyTorch, so that an efficient parallel strategy comprising model parallel, data parallel and pipeline parallel is realized, and meanwhile, hybrid precision training is supported, so that the calculation performance can be improved and the memory consumption can be reduced; (2) The main stream open source large model training frame Megatron-DEEPSPEED adopts a zero redundancy optimizer memory optimization technology to reduce the video memory occupation of model training, and meanwhile, the model training frame still has good expandability when the nodes of the cluster machine are increased; (3) The large model training framework generally integrates the mainstream deep learning framework PyTorch seamlessly, provides an efficient video memory optimization technology and model parallel policy interface, and a user can adjust appropriate model parameters and parallel policies according to the provided interface.
However, currently mainstream large model training frames always directly expose interfaces to users, the users can manually adjust the structure and the parameter number of the model and parallel strategies during model training in one training process, in order to obtain the highest training efficiency on a specific cluster, the users often need to try multiple times between the arrangement combinations of multiple parameter configurations and wait for obtaining the data of the model training efficiency to record, after all the arrangement combinations try, the more suitable training parameter configurations are selected according to the recorded data, and the process is very tedious and time-consuming, and when the users replace the cluster topology or the machine model, the users often need to manually search the best configuration of the model training again, so that the model development period is longer.
Based on the above problems, the embodiment of the invention provides an automatic parameter searching method for large model training, which performs enumeration training on provided parameter configuration combinations through a target model training framework, acquires indexes such as display memory occupation conditions, training speed and the like in real time according to the running conditions of different parameter configuration combinations, sorts and filters the running results of different parameter configuration combinations, helps a user to efficiently search out optimal parameter configuration and parallel strategy configuration under given hardware topology and model structure, and reduces development cost of the user in searching training configuration.
FIG. 1 is a flow chart of a method of automatic search of parameters for large model training in accordance with one embodiment of the present invention.
Illustratively, as shown in fig. 1, the parameter automatic search method for large model training includes the following steps:
In step S101, a parameter configuration file is obtained, where the parameter configuration file includes a large model training frame name, a plurality of parameters for permutation and combination, and a parameter interval of each parameter, and the parameters include model structure parameters and parallel training parameters.
It will be appreciated that the parameter profile may be provided by a user that includes a large model training frame name, a plurality of parameters for permutation and combination, and a parameter interval for each parameter, where the parameters include model structure parameters and parallel training parameters. Wherein, the big model training frame refers to a tool specially used for training a large-scale deep learning model, which supports efficient large-scale parallel computation and can process large-scale data and models, and currently, the mainstream big model training frame comprises: tensorFlow (developed by google, supporting distributed training, having a strong ecosystem and extensive community support), pyTorch (developed by Facebook, having a compact and easy-to-use API and flexible dynamic diagram features), PADDLEPADDLE (flyer, developed by PADDLEPADDLE open source community, supporting a variety of hardware platforms and a variety of application scenarios), and the like; in order to achieve a better search result under a specific model configuration, a user may provide a plurality of parameters for permutation and combination, and common parameters include: (1) Model structure parameters such as num-layers (number of layers of large language model), hidden-size (hidden layer dimension), seq-length (input sequence length), micro-batch-size (number of samples selected for a single training), train-iters (number of training iterations), etc. (2) parallel training parameters including but not limited to: nproc-per-node (number of single processes, i.e., number of GPUs called (Graphics Processing Unit, graphics processor)), tensor-model-parallel-size (model parallelism, i.e., parallelism in which model parameters are split equally across all GPUs), pipeline-model-parallel-size (pipeline parallelism), sequence-parallel, etc.; in addition, the user also needs to provide a parameter interval (a value range or list) of each parameter, such as seq-length= {1024, 2048, 4096}.
In step S102, a target model training frame is determined according to the name of the large model training frame, and training flows of all configuration combinations are determined according to the target model training frame, the plurality of parameters, and the parameter interval of each parameter.
That is, after the large model training frame name is provided in step S101, the target model training frame may be determined according to the large model training frame name, and the training flow for all configuration combinations may be determined based on the target model training frame, the plurality of parameters, and the parameter interval of each parameter, where the target training frame may automatically enumerate the configuration combinations of all parameters as a configuration for a single training.
For further understanding, the following details how the training process for all configuration combinations is determined from the goal model training framework, the plurality of parameters, and the parameter intervals for each parameter.
As one possible implementation, determining a training process of all configuration combinations according to the target model training framework, the plurality of parameters and the parameter interval of each parameter includes: acquiring iteration times of each training from a parameter configuration file; determining configuration combinations of all parameters according to the plurality of parameters and the parameter interval of each parameter; the training process for all configuration combinations is determined based on the number of iterations and the configuration combinations for all parameters.
Specifically, the number of iterations of each training depends on total-iters (total training iteration number) parameters in the parameter configuration file, and from the multiple parameters and the parameter interval of each parameter, the configuration combination of all the parameters can be determined. For example, assume that the parameter interval of the parameter seq-length is: when the configuration combination of the seq-length is enumerated, three values of 1024, 2048 and 4096 can be used for sequentially and orderly arranging and combining with other parameters, wherein the three values are used as the configuration combination of a primary training model which mainly runs the parameter seq-length, namely (1024,num-layers)、(1024,hidden-size)、……、(4096,pipeline-model-parallel-size)、(4096,sequence-parallel) and the like, after one training is finished by using the parameter (seq-length), the configuration combination mainly comprises the following parameter can be enumerated in a similar way, and the next training is started, so that the configuration combination of all the parameters can be determined according to a plurality of parameters and parameter intervals of each parameter, and after the configuration combination of all the parameters is obtained, the training flow of all the configuration combinations is determined based on the iteration times and the configuration combination of all the parameters until the configuration combination of all the parameters is enumerated.
It should be noted that the total training number is the radix of the cartesian product (i.e., the number of elements in the cartesian product) of the parameter interval of all the parameters, where the cartesian product is also called a direct product, such as two sets X and Y in the field of optics, each element in set X and each element in set Y form an ordered pair, and the sets formed by all ordered pairs are called the cartesian product of sets X and Y, i.e., assuming that the sets x= { a, b }, the sets y= {0, 1, 2}, the cartesian products of the two sets are { (a, 0), (a, 1), (a, 2), (b, 0), (b, 1), (b, 2) }, and the cartesian products of two or more sets are the same, so as to avoid redundancy, and no redundant description is made herein.
In step S103, the training flows of all the configuration combinations are started, and the optimal parameter combination for large model training is determined from the training results of the training flows of all the configuration combinations based on the evaluation index.
That is, based on the provided parameters and the configuration combinations of all the parameters, the training process can be split into a plurality of parallel tasks, each task corresponds to a specific configuration combination of the parameters, the configuration combination mainly comprising each parameter is distributed to a plurality of computing nodes for parallel processing, each computing node uses the configuration combination of the corresponding parameters for large model training, after each node completes training, the training results of the training process of all the configuration combinations can be obtained, and the best parameter combination for large model training can be determined by comprehensively evaluating all the training results by using proper evaluation indexes (such as accuracy, cross validation loss and the like).
How to determine the optimal parameter combination for large model training from the training results of the training flows of all configuration combinations based on the evaluation index is described in detail below.
As one possible implementation, determining an optimal parameter combination for large model training from training results of training flows of all configuration combinations based on the evaluation index includes: acquiring the number of the optimal parameter combinations to be reserved from the parameter configuration file; acquiring an evaluation index value of each configuration combination based on the training result; an optimal parameter combination is determined based on the number of optimal parameter combinations and the evaluation index value of each configuration combination.
Specifically, in connection with the illustration of fig. 2, each time a training instance is started, the index recorded in the training process may write the training result into the report file through the report module of the target model training framework, the report file may record the training result of the configuration combination of all the parameters provided by the user, and the training result is measured by the evaluation index provided by the user in the parameter configuration file. When the target model training framework is used for model training, the target model training framework can start a process in the last step of training, and an evaluation index value (such as training time, memory occupation, MFU (Model FLOPs Utilization, model calculation power utilization rate) of each configuration combination is obtained from standard output through a log and a regular expression. Wherein, partial evaluation index values such as video memory occupation and the like can be directly obtained through monitoring tools such as NVTOP (NVidia TOP, injeida video card monitoring tool) and the like, and if the evaluation index values are multiple video cards, average values are required to be calculated; for the evaluation index values which cannot be directly obtained, such as TFLOPS (Tera Floating Point Operations Per Second, three trillion times of floating point operation per second), MFU and the like, the evaluation index values can be obtained by calculation by a target model training framework; for the user-defined (e.g., user-defined in the parameter configuration file) evaluation index, the evaluation index value can be calculated by calling an index calculation mode which can be realized through the target model training framework.
It will be appreciated that after all the parameter configuration combinations are enumerated, a report of the training results needs to be output to the user, and if the training results of the training processes of all the configuration combinations are directly output to the user, the report is too lengthy, and it is difficult to determine the optimal parameter combinations for large model training from the training results of the training processes of many configuration combinations, so that it is also necessary to determine the number of the optimal parameter combinations that need to be retained. Based on the number of the optimal parameter combinations to be reserved, the target model training framework can sort and filter training results of training processes of all configuration combinations according to the evaluation indexes of interest of the user to obtain a screened optimal configuration combination set, and then output a report of the training results of the optimal configuration combination set to the user, so that the algorithm user can finally determine the optimal parameter combinations based on the number of the optimal parameter combinations and the evaluation index value of each configuration combination.
For example, the training results of the training processes of all configuration combinations may be ordered according to the training time of one step, and further filtered according to the size of the memory occupation or the hardware utilization, so as to screen out the training results of the training processes of the configuration combinations meeting the requirements, and further determine the optimal parameter combination. For the index customized by the user, the user can filter and sort the training results while realizing the index calculation mode.
Furthermore, in some embodiments, when the training process of all configuration combinations is started, further comprising: configuration combinations of training start failures are recorded.
That is, if the training process of the configuration combination fails to start due to insufficient machine video memory caused by too many parameters or unreasonable parallel policy configuration, the configuration combination with the training configuration failure can be recorded, so that subsequent screening configuration is facilitated, and the parameters which cause insufficient machine video memory or other training failures can be directly removed.
Further, in some embodiments, after obtaining the parameter configuration file, further comprising: identifying a target parameter in the parameter configuration file, wherein a parameter interval is not given in the plurality of parameters; and acquiring a default parameter interval of the target parameter, and taking the default parameter interval as a parameter interval of the target parameter.
If a certain parameter user does not provide the parameter interval or does not list the parameter interval in the parameter configuration file, the target parameter, which does not provide the parameter interval, in the plurality of parameters may be identified, a default parameter interval of the target parameter when the user starts training may be obtained, and the default parameter interval may be used as the parameter interval of the target parameter.
Further, in some embodiments, after determining the target model training frame from the large model training frame name, further comprising: checking whether incompatible parameters which do not meet preset compatibility conditions exist in the plurality of parameters or not by utilizing a target model training framework; if the incompatible parameters which do not meet the preset compatible conditions exist in the plurality of parameters, error reporting reminding is conducted on the incompatible parameters.
It can be understood that, in order to be compatible with a plurality of mainstream large model training frames, after determining the target model training frame according to the large model training frame names, whether incompatible parameters which do not meet preset compatibility conditions exist in the plurality of parameters or not can be checked by using the target model training frame, wherein when the preset compatibility conditions are defined and implemented, the compatibility conditions are ensured to be reasonable and necessary, and the preset compatibility conditions can be adjusted and updated at any time along with the change of technical and business requirements, and the method is not particularly limited herein. When incompatible parameters which do not meet preset compatible conditions exist in the plurality of parameters, error reporting reminding can be conducted on the incompatible parameters, and users can know the conditions in time conveniently.
For example, when the stage zero-stage parameter of the zero redundancy optimizer ZeRO is 2, it is not compatible with the pipeline parallel pipeline-model-parallel-size parameter, and when the incompatible parameter is set, an error should be reported in advance at the frame level to alert the user.
Further, in some embodiments, after performing the error notification for the incompatible parameter, the method further includes: receiving a parameter modification instruction fed back by a user aiming at incompatible parameters; the incompatible parameters are modified based on the parameter modification instructions.
That is, after the error alert is made for the incompatible parameter, the user may issue a parameter modification instruction based on the incompatible parameter to modify the incompatible parameter and the desired index before starting the enumeration training parameter, for example, if the stage zero-stage parameter of the zero redundancy optimizer is set to 2, the pipeline parallelism parameter pipeline-parallel-size may only be set to 0, indicating that the pipeline parallelism is not turned on.
According to the parameter automatic searching method for large model training, which is provided by the embodiment of the invention, through obtaining the parameter configuration file, which comprises a large model training frame name, a plurality of parameters and a parameter interval of each parameter, a target model training frame can be determined according to the large model training frame name, training flows of all configuration combinations are determined according to the target model training frame, the plurality of parameters and the parameter interval of each parameter, the training flows of all configuration combinations are started, and the optimal parameter combination for large model training is determined from training results of the training flows of all configuration combinations based on the evaluation index. Therefore, the optimal parameter configuration combination can be obtained by enumeration training of the parameter configuration combination through the target model training framework, the problem that the model development period is long due to the fact that the current process of determining the optimal parameter configuration is tedious and time-consuming is solved, the efficiency of determining the optimal parameter configuration by a user is improved, and the development cost is reduced.
The parameter automatic searching device for large model training according to the embodiment of the invention is described next with reference to the accompanying drawings.
FIG. 3 is a block schematic diagram of an automatic parameter search apparatus for large model training according to one embodiment of the present invention.
As shown in fig. 3, the parameter automatic search device 10 for large model training includes: the acquisition module 100, the first determination module 200 and the second determination module 300.
The acquiring module 100 is configured to acquire a parameter configuration file, where the parameter configuration file includes a name of a large model training frame, a plurality of parameters for permutation and combination, and a parameter interval of each parameter, and the parameters include a model structure parameter and a parallel training parameter;
the first determining module 200 is configured to determine a target model training frame according to the name of the large model training frame, and determine training flows of all configuration combinations according to the target model training frame, the plurality of parameters and a parameter interval of each parameter;
the second determining module 300 is configured to start the training flows of all configuration combinations, and determine the optimal parameter combination for training the large model from the training results of the training flows of all configuration combinations based on the evaluation index.
Further, in some embodiments, after the parameter configuration file is obtained, the obtaining module 100 is further configured to:
identifying a target parameter in the parameter configuration file, wherein a parameter interval is not given in the plurality of parameters;
and acquiring a default parameter interval of the target parameter, and taking the default parameter interval as a parameter interval of the target parameter.
Further, in some embodiments, after determining the target model training frame from the large model training frame name, the first determination module 200 further includes:
The verification unit is used for utilizing the target model training framework to verify whether incompatible parameters which do not meet preset compatibility conditions exist in the plurality of parameters;
and the error reporting unit is used for reporting error reminding aiming at the incompatible parameters when the incompatible parameters which do not meet the preset compatible conditions exist in the plurality of parameters.
Further, in some embodiments, after performing the error reporting alert for the incompatible parameter, the error reporting unit is further configured to:
Receiving a parameter modification instruction fed back by a user aiming at incompatible parameters;
the incompatible parameters are modified based on the parameter modification instructions.
Further, in some embodiments, the first determining module 200 is specifically configured to:
Acquiring iteration times of each training from a parameter configuration file;
determining configuration combinations of all parameters according to the plurality of parameters and the parameter interval of each parameter;
the training process for all configuration combinations is determined based on the number of iterations and the configuration combinations for all parameters.
Further, in some embodiments, the second determining module 300 is specifically configured to:
acquiring the number of the optimal parameter combinations to be reserved from the parameter configuration file;
acquiring an evaluation index value of each configuration combination based on the training result;
an optimal parameter combination is determined based on the number of optimal parameter combinations and the evaluation index value of each configuration combination.
Further, in some embodiments, when the training process of all configuration combinations is started, the second determining module 300 is further configured to:
Configuration combinations of training start failures are recorded.
It should be noted that the foregoing explanation of the embodiment of the automatic parameter searching method for large model training is also applicable to the automatic parameter searching device for large model training of this embodiment, and will not be repeated here.
According to the parameter automatic searching device for large model training, provided by the embodiment of the invention, through obtaining the parameter configuration file, which comprises a large model training frame name, a plurality of parameters and a parameter interval of each parameter, a target model training frame can be determined according to the large model training frame name, training flows of all configuration combinations are determined according to the target model training frame, the plurality of parameters and the parameter interval of each parameter, the training flows of all configuration combinations are started, and the optimal parameter combination for large model training is determined from training results of the training flows of all configuration combinations based on the evaluation index. Therefore, the optimal parameter configuration combination can be obtained by enumeration training of the parameter configuration combination through the target model training framework, the problem that the model development period is long due to the fact that the current process of determining the optimal parameter configuration is tedious and time-consuming is solved, the efficiency of determining the optimal parameter configuration by a user is improved, and the development cost is reduced.
Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. The electronic device may include:
Memory 401, processor 402, and a computer program stored on memory 401 and executable on processor 402.
The processor 402, when executing the program, implements the parameter automatic search method for large model training provided in the above-described embodiment.
Further, the electronic device further includes:
A communication interface 403 for communication between the memory 401 and the processor 402.
A memory 401 for storing a computer program executable on the processor 402.
Memory 401 may include high-speed RAM (Random Access Memory ) memory, and may also include non-volatile memory, such as at least one disk memory.
If the memory 401, the processor 402, and the communication interface 403 are implemented independently, the communication interface 403, the memory 401, and the processor 402 may be connected to each other by a bus and perform communication with each other. The bus may be an ISA (Industry Standard Architecture ) bus, a PCI (PERIPHERAL COMPONENT INTERCONNECT, external device interconnect) bus, or EISA (Extended Industry Standard Architecture ) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one thick line is shown in fig. 4, but not only one bus or one type of bus.
Alternatively, in a specific implementation, if the memory 401, the processor 402, and the communication interface 403 are integrated on a chip, the memory 401, the processor 402, and the communication interface 403 may perform communication with each other through internal interfaces.
The processor 402 may be a CPU (Central Processing Unit ) or an ASIC (Application SPECIFIC INTEGRATED Circuit, application specific integrated Circuit) or one or more integrated circuits configured to implement embodiments of the present invention.
The embodiment of the invention also provides a computer readable storage medium, on which a computer program is stored, which when being executed by a processor, implements the parameter automatic search method for large model training as above.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.
Claims (8)
1. An automatic parameter searching method for large model training is characterized by comprising the following steps:
Acquiring a parameter configuration file, wherein the parameter configuration file comprises a large model training frame name, a plurality of parameters used for arrangement and combination and a parameter interval of each parameter, and the parameters comprise model structure parameters and parallel training parameters;
Determining a target model training frame according to the name of the large model training frame, and determining training flows of all configuration combinations according to the target model training frame, the plurality of parameters and the parameter interval of each parameter;
Starting the training flows of all configuration combinations, and determining the optimal parameter combination for training the large model from the training results of the training flows of all configuration combinations based on the evaluation indexes;
The determining the training process of all configuration combinations according to the target model training framework, the plurality of parameters and the parameter interval of each parameter includes: acquiring iteration times of each training from the parameter configuration file; determining configuration combinations of all parameters according to the plurality of parameters and the parameter interval of each parameter; determining a training flow of all configuration combinations based on the iteration times and the configuration combinations of all parameters;
Wherein the determining, based on the evaluation index, an optimal parameter combination for training the large model from training results of the training flows of all configuration combinations includes: acquiring the number of the optimal parameter combinations to be reserved from the parameter configuration file; acquiring an evaluation index value of each configuration combination based on the training result; determining the optimal parameter combinations based on the number of the optimal parameter combinations and the evaluation index value of each configuration combination;
Wherein the evaluating the index value includes: and (3) directly acquiring the evaluation index value by using a monitoring tool, and/or acquiring the evaluation index value which is calculated by the target model training framework and cannot be acquired directly, and/or invoking the evaluation index value acquired by a user-defined index calculation mode through the target model training framework.
2. The method for automatic searching for parameters for large model training according to claim 1, further comprising, after obtaining the parameter profile:
Identifying a target parameter in the plurality of parameters in the parameter configuration file for which no parameter interval is given;
And acquiring a default parameter interval of the target parameter, and taking the default parameter interval as a parameter interval of the target parameter.
3. The method for automatic searching of parameters for large model training according to claim 1, further comprising, after determining the target model training frame from the large model training frame name:
Checking whether incompatible parameters which do not meet preset compatibility conditions exist in the plurality of parameters or not by utilizing the target model training framework;
If the incompatible parameters which do not meet the preset compatible conditions exist in the plurality of parameters, error reporting reminding is conducted on the incompatible parameters.
4. The method for automatic searching of parameters for large model training according to claim 3, further comprising, after performing an error notification for the incompatible parameters:
receiving a parameter modification instruction fed back by a user aiming at the incompatible parameters;
Modifying the incompatible parameter based on the parameter modification instruction.
5. The method for automatic searching of parameters for large model training according to any one of claims 1 to 4, further comprising, when starting the training flow of all configuration combinations:
Configuration combinations of training start failures are recorded.
6. An automatic parameter searching device for large model training, comprising:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a parameter configuration file, the parameter configuration file comprises a large model training frame name, a plurality of parameters used for arrangement and combination and a parameter interval of each parameter, and the parameters comprise model structure parameters and parallel training parameters;
The first determining module is used for determining a target model training frame according to the name of the large model training frame and determining training flows of all configuration combinations according to the target model training frame, the parameters and the parameter interval of each parameter;
The second determining module is used for starting the training flows of all the configuration combinations and determining the optimal parameter combination for training the large model from the training results of the training flows of all the configuration combinations based on the evaluation indexes;
The first determining module is specifically configured to: acquiring iteration times of each training from the parameter configuration file; determining configuration combinations of all parameters according to the plurality of parameters and the parameter interval of each parameter; determining a training flow of all configuration combinations based on the iteration times and the configuration combinations of all parameters;
The second determining module is specifically configured to: acquiring the number of the optimal parameter combinations to be reserved from the parameter configuration file; acquiring an evaluation index value of each configuration combination based on the training result; determining the optimal parameter combinations based on the number of the optimal parameter combinations and the evaluation index value of each configuration combination;
Wherein the evaluating the index value includes: and (3) directly acquiring the evaluation index value by using a monitoring tool, and/or acquiring the evaluation index value which is calculated by the target model training framework and cannot be acquired directly, and/or invoking the evaluation index value acquired by a user-defined index calculation mode through the target model training framework.
7. An electronic device, comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the program to implement the method for automatic searching of parameters for large model training of any of claims 1-5.
8. A computer-readable storage medium, on which a computer program is stored, characterized in that the program is executed by a processor for implementing the parameter automatic search method for large model training according to any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410438532.7A CN118051779B (en) | 2024-04-12 | 2024-04-12 | Automatic parameter searching method and device for large model training and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410438532.7A CN118051779B (en) | 2024-04-12 | 2024-04-12 | Automatic parameter searching method and device for large model training and electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN118051779A CN118051779A (en) | 2024-05-17 |
CN118051779B true CN118051779B (en) | 2024-07-16 |
Family
ID=91045125
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410438532.7A Active CN118051779B (en) | 2024-04-12 | 2024-04-12 | Automatic parameter searching method and device for large model training and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118051779B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116629352A (en) * | 2023-04-10 | 2023-08-22 | 苏州互微智速科技有限公司 | Hundred million-level parameter optimizing platform |
CN117407713A (en) * | 2023-10-17 | 2024-01-16 | 支付宝(杭州)信息技术有限公司 | Training management method and related device for distributed model training |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109032671B (en) * | 2018-06-25 | 2022-05-03 | 电子科技大学 | Distributed deep learning method and system based on data parallel strategy |
CN110991658A (en) * | 2019-11-28 | 2020-04-10 | 重庆紫光华山智安科技有限公司 | Model training method and device, electronic equipment and computer readable storage medium |
TW202312042A (en) * | 2021-09-06 | 2023-03-16 | 財團法人資訊工業策進會 | Automatic optimization method and automatic optimization system of diagnosis model |
WO2023123275A1 (en) * | 2021-12-30 | 2023-07-06 | 华为技术有限公司 | Method, device, and system for determining distributed training algorithm framework configuration |
CN115996173B (en) * | 2022-11-14 | 2023-06-20 | 中国科学技术大学 | Communication optimization method and system for parallel training of distributed deep learning operator |
CN116128019A (en) * | 2022-11-17 | 2023-05-16 | 北京大学 | Parallel training method and device for transducer model |
CN116956991B (en) * | 2023-09-21 | 2024-01-09 | 牛津大学(苏州)科技有限公司 | Multi-layer perceptron model parameter adjustment method, device, equipment and storage medium |
CN117093871B (en) * | 2023-10-16 | 2024-02-13 | 之江实验室 | Deep learning-oriented distributed training evaluation method and system |
-
2024
- 2024-04-12 CN CN202410438532.7A patent/CN118051779B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116629352A (en) * | 2023-04-10 | 2023-08-22 | 苏州互微智速科技有限公司 | Hundred million-level parameter optimizing platform |
CN117407713A (en) * | 2023-10-17 | 2024-01-16 | 支付宝(杭州)信息技术有限公司 | Training management method and related device for distributed model training |
Also Published As
Publication number | Publication date |
---|---|
CN118051779A (en) | 2024-05-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP4369180A2 (en) | Callpath finder | |
CN110647999A (en) | Method and device for improving deep learning training speed based on topological structure | |
CN105700956A (en) | Distributed job processing method and system | |
CN115827636B (en) | Method for storing and reading simulation data of logic system design from waveform database | |
US11663113B2 (en) | Real time fault localization using combinatorial test design techniques and test case priority selection | |
US11176026B2 (en) | Assignment of test case priorities based on combinatorial test design model analysis | |
CN112685275B (en) | Algorithm policy search method and device, electronic equipment and storage medium | |
CN118051779B (en) | Automatic parameter searching method and device for large model training and electronic equipment | |
CN110704620B (en) | Method and device for identifying same entity based on knowledge graph | |
JP5206268B2 (en) | Rule creation program, rule creation method and rule creation device | |
CN116166967B (en) | Data processing method, equipment and storage medium based on meta learning and residual error network | |
CN110362294A (en) | Development task executes method, apparatus, electronic equipment and storage medium | |
CN115599401A (en) | Publishing method, device, equipment and medium of user-defined model | |
US9818078B1 (en) | Converting a non-workflow program to a workflow program using workflow inferencing | |
CN114443141A (en) | Method and device for determining cyclic constraint fault of measurement and control instruction | |
CN113076237B (en) | Memory performance testing method and system and computer readable storage medium | |
CN111782641A (en) | Data error repairing method and system | |
US10552760B2 (en) | Training set creation for classifying features of a system under agile development | |
CN116955342B (en) | Service data consistency rate verification method and device | |
CN109002287B (en) | Management method and device for software in cloud data system | |
CN112650679B (en) | Test verification method, device and computer system | |
JP6916327B1 (en) | Derived test equipment, derived test methods, and derived test programs | |
US20230185653A1 (en) | Fault diagnosis in complex systems | |
CN117743070A (en) | Method, device, equipment and medium for testing register bit flash | |
CN118377804A (en) | Attention optimization combination searching method based on automatic enumeration |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |