JP7419843B2

JP7419843B2 - parallel processing device

Info

Publication number: JP7419843B2
Application number: JP2020014221A
Authority: JP
Inventors: 直樹末安; 克己一瀬
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2020-01-30
Filing date: 2020-01-30
Publication date: 2024-01-23
Anticipated expiration: 2040-01-30
Also published as: JP2021120824A

Description

本発明は、並列処理装置に関する。 The present invention relates to a parallel processing device.

並列処理プログラムの並列処理パラメータを決定する技術が開示されている。例えば、特許文献１では、処理部が、評価対象の第１プログラムの推奨するノード数とプロセス数とを決定する。具体的には、処理部は、ノード数とプロセス数とを示す第１サンプル点に従って第１プログラムを実行させ、第１サンプル点それぞれの間の評価値の変動率に基づく第１統計量を算出する。そして、処理部は、第１統計量が第１閾値以下になるまで第１サンプル点４を生成する。また、処理部は、第１サンプル点から所定距離内に第２サンプル点を追加し、第２サンプル点に従って第１プログラムを実行したときの評価値を用いて第１サンプル点ごとの第２統計量を算出する。そして、処理部は、第２統計量が第２閾値以下になるまで第２サンプル点を生成する。そして、処理部は、第１サンプル点と第２サンプル点との評価値を補完して、推奨するノード数とプロセス数とを決定する。 A technique for determining parallel processing parameters of a parallel processing program is disclosed. For example, in Patent Document 1, a processing unit determines the number of nodes and the number of processes recommended for a first program to be evaluated. Specifically, the processing unit executes the first program according to the first sample points indicating the number of nodes and the number of processes, and calculates the first statistic based on the rate of change in the evaluation value between each of the first sample points. do. Then, the processing unit generates the first sample points 4 until the first statistic becomes equal to or less than the first threshold. Further, the processing unit adds a second sample point within a predetermined distance from the first sample point, and calculates second statistics for each first sample point using the evaluation value when the first program is executed according to the second sample point. Calculate the amount. The processing unit then generates second sample points until the second statistic becomes equal to or less than the second threshold. The processing unit then determines the recommended number of nodes and processes by complementing the evaluation values of the first sample point and the second sample point.

また、２重ループを有するプログラムを並列処理で実行するときに、適切な並列処理パラメータを設定する必要がある。スレッド並列化規格ＯｐｅｎＭＰの並列処理パラメータとして、スケジューリング方式、チャンクサイズおよびプロセッサコアの割り当てが挙げられる。従来では、並列処理パラメータの組み合わせごとに、組み合わせのパラメータをプログラムに設定し、プログラムを実行し、実行結果を参照して適切な並列化パラメータとなるように調整していた。あるいは、並列化パラメータを経験者による経験で調整していた。 Furthermore, when a program having a double loop is executed in parallel, it is necessary to set appropriate parallel processing parameters. Parallel processing parameters of the thread parallelization standard OpenMP include scheduling method, chunk size, and processor core allocation. Conventionally, for each combination of parallel processing parameters, the parameters for the combination were set in a program, the program was executed, and the execution results were referenced to adjust the parallelization parameters to be appropriate. Alternatively, the parallelization parameters were adjusted based on the experience of experienced people.

特開２０１８－１２０３８７号公報Unexamined Japanese Patent Publication No. 2018-120387 特開２０１６－９９７２号公報Japanese Patent Application Publication No. 2016-9972

しかしながら、ループを有するプログラムを並列処理で実行するときに、プログラムに設定する最適な並列処理パラメータを高速に推定することが難しいという問題がある。 However, when a program having a loop is executed in parallel, there is a problem in that it is difficult to quickly estimate optimal parallel processing parameters to be set in the program.

例えば、推奨するノード数とプロセス数とを決定する技術では、評価対象の第１プログラムをノード数とプロセス数とを示す第１サンプル点に従って評価対象の第１プログラムを何度も実行させて、推奨するノード数とプロセス数とを決定する。また、決定する並列処理パラメータは、ノード数とプロセス数だけである。したがって、スケジューリング方式、チャンクサイズおよびプロセッサコアの割り当てを含む最適な並列処理パラメータを高速に推定することが難しい。 For example, in a technique for determining the recommended number of nodes and processes, a first program to be evaluated is executed many times according to a first sample point indicating the number of nodes and processes; Determine the recommended number of nodes and processes. Further, the only parallel processing parameters to be determined are the number of nodes and the number of processes. Therefore, it is difficult to quickly estimate optimal parallel processing parameters including scheduling method, chunk size, and processor core allocation.

また、２重ループを有するプログラムを並列処理で実行するときの並列化パラメータを決定する場合では、並列処理パラメータの組み合わせごとに、プログラムを実行して、最適な並列処理パラメータを求める。したがって、かかる場合でも、最適な並列処理パラメータを高速に推定することが難しい。 Furthermore, when determining parallelization parameters when a program having a double loop is executed in parallel, the program is executed for each combination of parallel processing parameters to find the optimal parallel processing parameters. Therefore, even in such a case, it is difficult to estimate optimal parallel processing parameters at high speed.

本発明は、１つの側面では、ループを有するプログラムを並列処理で実行するときの最適な並列処理パラメータを高速に推定することを目的とする。 One aspect of the present invention is to quickly estimate optimal parallel processing parameters when executing a program having a loop in parallel processing.

１つの態様では、並列処理装置は、所定の並列化パラメータを設定した、２重ループを有する最適化対象のプログラムを並列処理で１回だけ実行する実行部と、前記実行部によって実行された結果から前記最適化対象のプログラムの特徴値を算出する算出部と、２重ループを有するサンプルプログラムにおける特徴値と最適な並列化パラメータとの相関関係を示すデータベースを参照して、前記算出部によって算出された特徴値に対応する最適な並列化パラメータを抽出する抽出部と、を有する。 In one aspect, the parallel processing device includes an execution unit that executes an optimization target program having a double loop only once in parallel processing, and a result executed by the execution unit, in which a predetermined parallelization parameter is set. A calculation unit that calculates the feature value of the program to be optimized from the above, and a database that shows the correlation between the feature value and the optimal parallelization parameter in the sample program having a double loop, and the calculation unit calculates the feature value. and an extraction unit that extracts an optimal parallelization parameter corresponding to the calculated feature value.

１実施態様によれば、ループを有するプログラムを並列処理で実行するときの最適な並列処理パラメータを高速に推定することができる。 According to one embodiment, optimal parallel processing parameters when executing a program having a loop in parallel processing can be estimated at high speed.

図１は、実施例に係る並列処理装置の機能構成を示すブロック図である。FIG. 1 is a block diagram showing the functional configuration of a parallel processing device according to an embodiment. 図２は、プログラムの一例を示す図である。FIG. 2 is a diagram showing an example of a program. 図３は、実施例に係る関係データベースの一例を示す図である。FIG. 3 is a diagram illustrating an example of a relational database according to the embodiment. 図４は、プログラムを実行するハードウェアの一例を示す図である。FIG. 4 is a diagram showing an example of hardware that executes a program. 図５は、プロセッサコアの割り当ての一例を示す図である。FIG. 5 is a diagram illustrating an example of allocation of processor cores. 図６は、ＤＯループの実行例を示す図である。FIG. 6 is a diagram showing an example of execution of a DO loop. 図７は、最適化パラメータ抽出の一例を示す図である。FIG. 7 is a diagram showing an example of optimization parameter extraction. 図８は、最適化されたプログラムの一例を示す図である。FIG. 8 is a diagram showing an example of an optimized program. 図９は、実施例に係る生成処理のフローチャートの一例を示す図である。FIG. 9 is a diagram illustrating an example of a flowchart of generation processing according to the embodiment. 図１０は、実施例に係る最適化パラメータ抽出処理のフローチャートの一例を示す図である。FIG. 10 is a diagram illustrating an example of a flowchart of optimization parameter extraction processing according to the embodiment. 図１１は、実施例に係るプログラム本実行処理のフローチャートの一例を示す図である。FIG. 11 is a diagram illustrating an example of a flowchart of the program main execution process according to the embodiment. 図１２は、並列処理プログラムを実行するコンピュータの一例を示す図である。FIG. 12 is a diagram illustrating an example of a computer that executes a parallel processing program.

以下に、本願の開示する並列処理装置の実施例を図面に基づいて詳細に説明する。なお、本発明は、実施例により限定されるものではない。 Embodiments of the parallel processing device disclosed in the present application will be described in detail below with reference to the drawings. Note that the present invention is not limited to the examples.

［実施例に係る並列処理装置の機能構成］
図１は、実施例に係る並列処理装置の機能構成を示すブロック図である。図１に示す並列処理装置１は、複数レベルの並列化が行なえるアプリケーションプログラムに対する最適な並列化パラメータを選択して、実行性能をチューニングする。実施例では、複数レベルの並列化を行なえるアプリケーションプログラムとしてスレッド並列化規格であるＯｐｅｎＭＰを適用するものとする。また、実施例では、例えば、２重ループを有するアプリケーションプログラムを並列処理で実行する際に適切な並列化パラメータを選択する場合について説明する。なお、ＯｐｅｎＭＰ規格は、例えば、「OpenMP Application Program Interface Version 5.0」の記載に準じるものであり、詳細な説明を省略する。 [Functional configuration of parallel processing device according to embodiment]
FIG. 1 is a block diagram showing the functional configuration of a parallel processing device according to an embodiment. A parallel processing device 1 shown in FIG. 1 selects optimal parallelization parameters for an application program that can be parallelized at multiple levels, and tunes execution performance. In the embodiment, it is assumed that OpenMP, which is a thread parallelization standard, is applied as an application program that can perform multiple levels of parallelization. Further, in the embodiment, a case will be described in which, for example, an appropriate parallelization parameter is selected when an application program having a double loop is executed in parallel processing. Note that the OpenMP standard is based on, for example, the description of "OpenMP Application Program Interface Version 5.0", and detailed explanation will be omitted.

並列処理装置１は、制御部１０および記憶部２０を有する。制御部１０は、ＣＰＵ（Central Processing Unit）などの電子回路に対応する。そして、制御部１０は、各種の処理手順を規定したプログラムや制御データを格納するための内部メモリを有し、これらによって種々の処理を実行する。制御部１０は、生成部１１、プログラム仮実行部１２、特徴値算出部１３、最適化パラメータ抽出部１４およびプログラム本実行部１５を有する。なお、プログラム仮実行部１２は、実行部の一例である。特徴値算出部１３は、算出部の一例である。最適化パラメータ抽出部１４は、抽出部の一例である。生成部１１は、生成部の一例である。 The parallel processing device 1 includes a control section 10 and a storage section 20. The control unit 10 corresponds to an electronic circuit such as a CPU (Central Processing Unit). The control unit 10 has an internal memory for storing programs and control data that define various processing procedures, and executes various processes using these. The control unit 10 includes a generation unit 11 , a temporary program execution unit 12 , a feature value calculation unit 13 , an optimization parameter extraction unit 14 , and a main program execution unit 15 . Note that the program temporary execution unit 12 is an example of an execution unit. The feature value calculation unit 13 is an example of a calculation unit. The optimization parameter extraction unit 14 is an example of an extraction unit. The generation unit 11 is an example of a generation unit.

記憶部２０は、例えば、ＲＡＭ（Random Access Memory）、フラッシュメモリ（Flash Memory）などの半導体メモリ素子、または、ハードディスク、光ディスクなどの記憶装置である。記憶部２０は、基準プログラム２１および関係データベース２２を有する。 The storage unit 20 is, for example, a semiconductor memory element such as a RAM (Random Access Memory) or a flash memory, or a storage device such as a hard disk or an optical disk. The storage unit 20 has a reference program 21 and a relational database 22.

基準プログラム２１は、並列化パラメータに影響を与える、プログラムの複数の実行特徴値の組み合わせごとに、それぞれの組み合わせの特徴値を持つプログラムのことをいう。基準プログラム２１は、並列化パラメータを設定する変数を有するとともに、２重ループを有する。ここでは、プログラムの複数の実行特徴値の組み合わせを「アプリ実行特徴値セット」というものとする。アプリ実行特徴値セットごとに、基準プログラム２１が作成される。 The reference program 21 refers to a program that has feature values for each combination of a plurality of execution feature values of the program that affect parallelization parameters. The standard program 21 has variables for setting parallelization parameters and has a double loop. Here, a combination of multiple execution feature values of a program is referred to as an "application execution feature value set." A reference program 21 is created for each application execution feature value set.

ここで、実施例で扱うプログラムの一例を、図２を参照して説明する。図２は、プログラムの一例を示す図である。図２では、Ｆｏｒｔｒａｎ言語のプログラムの一例である。図２に示すように、プログラムは、２レベルのＤＯループの並列化を行う。プログラム内のＬ１，Ｕ１が、１次元目のループの変動範囲の定数である。Ｌ２，Ｕ２が、２次元目のループの変動範囲の定数である。 Here, an example of a program used in the embodiment will be explained with reference to FIG. 2. FIG. 2 is a diagram showing an example of a program. FIG. 2 shows an example of a Fortran language program. As shown in FIG. 2, the program performs two levels of parallelization of the DO loop. L1 and U1 in the program are constants for the variation range of the first-dimensional loop. L2 and U2 are constants of the variation range of the second-dimensional loop.

かかるプログラムは、スレッド並列化規格であるＯｐｅｎＭＰの並列化パラメータを設定する変数を有する。並列化パラメータには、変数Ｐ１，Ｐ２，Ｄ１，Ｄ２，Ｃ１およびＣ２が挙げられる。変数Ｐ１，Ｐ２には、ＤＯループへのプロセッサコアの割り当てを示す値が設定される。変数Ｄ１，Ｄ２は、スケジューリング方式を示す値が設定される。変数Ｃ１，Ｃ２は、チャンク数が設定される。スケジューリング方式には、例えば、ｓｔａｔｉｃ、ｄｙｎａｍｉｃ、ｇｕｉｄｅｄが設定可能である。チャンクには、例えば、任意の整数が設定可能である。任意の整数は、例えば、１，８または３２である。ＤＯループへのプロセッサコアの割り当てとして、例えば４８コアを例とすると、（１，４８），（２，２４），（３，１６），（４，１２），（６，８），（８，６），（１２，４），（１６，３），（２４，２），（４８，１）が設定可能である。 Such a program has variables for setting parallelization parameters of OpenMP, which is a thread parallelization standard. Parallelization parameters include variables P1, P2, D1, D2, C1, and C2. Variables P1 and P2 are set to values indicating allocation of processor cores to DO loops. Variables D1 and D2 are set to values indicating the scheduling method. Variables C1 and C2 are set to the number of chunks. The scheduling method can be set to, for example, static, dynamic, or guided. For example, any integer can be set for the chunk. Any integer is, for example, 1, 8 or 32. Assigning processor cores to the DO loop, taking 48 cores as an example, (1, 48), (2, 24), (3, 16), (4, 12), (6, 8), (8 , 6), (12, 4), (16, 3), (24, 2), and (48, 1) can be set.

かかるプログラムは、ｐａｒａｌｌｅｌｄｏ構文を持ち、変動範囲のｄｏループを並列に実行する。ここでいう並列とは、プロセス並列ではなく、プログラム内のスレッド並列のことをいう。ｐａｒａｌｌｅｌｄｏ構文は、プロセッサコアの割り当てに従ってスレッドにｄｏループの繰り返し処理の一部を割り当てる。 Such a program has a parallel do syntax and executes variable range do loops in parallel. Parallelism here refers to thread parallelism within a program, not process parallelism. The parallel do construct assigns a portion of the do loop iterations to threads according to processor core assignments.

かかるプログラムに設定可能な並列化パラメータの組み合わせは、上記の例では、８１０通り存在する。すなわち、スケジューリング方式に３種類、チャンクに３種類が存在するので、２レベルのＤＯループに適用する場合には、８１（＝３×３×３×３）通りが存在する。加えて、ＤＯループへのプロセッサコアの割り当てに１０種類が存在するので、プログラムに設定可能な並列化パラメータの組み合わせは、８１０（＝１０×８１）通りとなる。 In the above example, there are 810 combinations of parallelization parameters that can be set for such a program. That is, since there are three types of scheduling methods and three types of chunks, there are 81 (=3×3×3×3) ways when applied to a two-level DO loop. In addition, since there are 10 types of allocation of processor cores to DO loops, there are 810 (=10×81) combinations of parallelization parameters that can be set in a program.

実施例では、並列処理装置１は、プログラムに設定可能な並列化パラメータの組み合わせの中から実行時間が最も短い組み合わせを示す最適並列化パラメータを抽出する。 In the embodiment, the parallel processing device 1 extracts an optimal parallelization parameter indicating a combination with the shortest execution time from among combinations of parallelization parameters that can be set in a program.

図１に戻って、ここで、アプリ実行特徴値セットについて説明する。アプリ実行特徴値セットは、並列化パラメータに影響を与える、プログラムの複数の実行特徴値の組み合わせのことをいう。並列化パラメータに影響を与える、プログラムの複数の実行特徴値には、例えば、以下のｘ、ｙ、ｚが考えられる。
ｘ）各ＤＯループの繰り返しごとの処理部分の実行命令数の平均値
ｙ）各ＤＯループの繰り返しごとの処理部分の実行命令数の相対標準偏差値
ｚ）各ＤＯループの繰り返しごとの処理部分のキャッシュ再利用率の平均値 Returning to FIG. 1, the application execution feature value set will now be described. An application execution feature value set is a combination of multiple execution feature values of a program that affect parallelization parameters. For example, the following x, y, and z can be considered as a plurality of program execution characteristic values that affect the parallelization parameter.
x) Average value of the number of instructions executed in the processing section for each iteration of each DO loop y) Relative standard deviation value of the number of instructions executed for the processing section for each iteration of each DO loop z) Average value of the number of instructions executed for the processing section for each iteration of each DO loop Average cache reuse rate

ｘが並列化パラメータに影響を与える、プログラムの実行特徴値として適用される理由は、以下の通りである。処理部分の実行命令数の大小により、例えば並列化パラメータの１つであるスケジューリング方式に影響を与えるからである。すなわち、処理コスト（オーバーヘッド）が大きい場合に採用されるＤｙｎａｍｉｃスケジューリング方式を採用できるかどうかを判定するためである。 The reason why x is applied as a program execution characteristic value that affects parallelization parameters is as follows. This is because the number of instructions executed in a processing part affects, for example, the scheduling method, which is one of the parallelization parameters. That is, this is to determine whether the dynamic scheduling method, which is adopted when the processing cost (overhead) is large, can be adopted.

ｙが並列化パラメータに影響を与える、プログラムの実行特徴値として適用される理由は、以下の通りである。処理部分の負荷バランスの度合いにより、例えば並列化パラメータの１つであるスケジューリング方式に影響を与えるからである。すなわち、処理部分の負荷バランスにバラツキが無い場合にはｓｔａｔｉｃスケジューリング方式を、負荷バランスにバラツキが有る場合にはＤｙｎａｍｉｃスケジューリング方式を採用するためである。 The reason why y is applied as a program execution characteristic value that affects parallelization parameters is as follows. This is because the degree of load balance of the processing parts affects, for example, the scheduling method, which is one of the parallelization parameters. That is, the static scheduling method is used when there is no variation in the load balance of the processing parts, and the dynamic scheduling method is used when there is variation in the load balance.

ｚが並列化パラメータに影響を与える、プログラムの実行特徴値として適用される理由は、以下の通りである。キャッシュの再利用率の高低により、例えば並列化パラメータの１つであるチャンク数に影響を与えるからである。すなわち、キャッシュの再利用率が高い場合には、チャンク数を大きくすることにより、キャッシュの再利用を促進すべく、大きいチャンク数を採用するためである。 The reason why z is applied as a program execution characteristic value that affects parallelization parameters is as follows. This is because the cache reuse rate affects, for example, the number of chunks, which is one of the parallelization parameters. That is, when the cache reuse rate is high, a large number of chunks is adopted in order to promote cache reuse by increasing the number of chunks.

実施例では、並列処理装置１は、この３種類の実行特徴値を、１次元目のＤＯループおよび２次元目のＤＯループでそれぞれ算出し、６個の実行特徴値の組み合わせを、アプリ実行特徴値セットとして定義する。 In the embodiment, the parallel processing device 1 calculates these three types of execution feature values in the first-dimensional DO loop and the second-dimensional DO loop, and calculates the combination of six execution feature values as the application execution feature. Define as a value set.

関係データベース２２は、アプリ実行特徴値セットと、最適な並列化パラメータとの対応関係を保持する。関係データベース２２は、後述する生成部１１によって生成される。なお、関係データベース２２の一例は、後述する。 The relational database 22 holds the correspondence between application execution feature value sets and optimal parallelization parameters. The relational database 22 is generated by the generation unit 11, which will be described later. Note that an example of the relational database 22 will be described later.

生成部１１は、関係データベース２２を生成する。 The generation unit 11 generates a relational database 22.

例えば、生成部１１は、アプリ実行特徴値セットの各要素を、次の刻みで変化させるような複数の基準プログラム２１を生成する。各ＤＯループの繰り返しごとの処理部分の実行命令数の平均値は、Ｄｙｎａｍｉｃスケジューリング方式の場合の実行時ライブラリの実行命令数の１～１０倍とする。すなわち、各ＤＯループの繰り返しごとの処理部分の実行命令数の平均値は、ＤＯループごとに、１，２，・・・，１０の値を採る。各ＤＯループの繰り返しごとの処理部分の実行命令数の相対標準偏差値は、０．１～１．０を０．１刻みとする。すなわち、各ＤＯループの繰り返しごとの処理部分の実行命令数の相対標準偏差値は、ＤＯループごとに、０．１，０．２，・・・，１．０の値を採る。各ＤＯループの繰り返しごとの処理部分のキャッシュ再利用率の平均値は、０．１～１．０を０．１刻みとする。すなわち、各ＤＯループの繰り返しごとの処理部分のキャッシュ再利用率の平均値は、ＤＯループごとに、０．１，０．２，・・・，１．０の値を採る。この例では、アプリ実行特徴値セットは、最終的に、１０^６（＝１００万）個になる。生成部１１は、１００万個のアプリ実行特徴値セットを持つそれぞれの基準プログラム２１を生成する。 For example, the generation unit 11 generates a plurality of reference programs 21 that change each element of the application execution feature value set in the following increments. The average number of instructions executed in the processing portion for each repetition of each DO loop is 1 to 10 times the number of instructions executed in the runtime library in the case of the dynamic scheduling method. That is, the average value of the number of executed instructions in the processing portion for each repetition of each DO loop takes a value of 1, 2, . . . , 10 for each DO loop. The relative standard deviation value of the number of executed instructions in the processing portion for each repetition of each DO loop is set from 0.1 to 1.0 in 0.1 increments. That is, the relative standard deviation value of the number of executed instructions in the processing portion for each repetition of each DO loop takes a value of 0.1, 0.2, . . . , 1.0 for each DO loop. The average value of the cache reuse rate of the processing portion for each repetition of each DO loop is 0.1 to 1.0 in 0.1 increments. That is, the average value of the cache reuse rate of the processing portion for each repetition of each DO loop takes a value of 0.1, 0.2, . . . , 1.0 for each DO loop. In this example, the application execution feature value set ultimately becomes 10 ⁶ (=1 million) pieces. The generation unit 11 generates each reference program 21 having one million application execution feature value sets.

そして、生成部１１は、１つのアプリ実行特徴値セットを持つ基準プログラム２１に対して複数通りの並列化パラメータの組をそれぞれ設定して実行する。一例として、プログラムに設定可能な並列化パラメータの組み合わせが８１０通りあるとすると、生成部１１は、１つの基準プログラム２１に対して８１０通りの並列化パラメータをそれぞれ設定して実行する。 Then, the generation unit 11 sets and executes a plurality of sets of parallelization parameters for the reference program 21 having one application execution feature value set. As an example, assuming that there are 810 combinations of parallelization parameters that can be set in a program, the generation unit 11 sets and executes each of the 810 parallelization parameters for one reference program 21.

そして、生成部１１は、複数通りの実行結果を基に、最も実行時間が短かった並列化パラメータの組を最適並列化パラメータとして決定する。そして、生成部１１は、基準プログラム２１におけるアプリ実行特徴値セットと最適並列化パラメータとを対応付けて関係データベース２２に追加する。そして、生成部１１は、他のアプリ実行特徴値セットについても同様に、アプリ実行特徴値セットを持つ基準プログラム２１における最適並列化パラメータを決定して、関係データベース２２に追加する。 Then, the generation unit 11 determines the set of parallelization parameters with the shortest execution time as the optimal parallelization parameters based on the plurality of execution results. Then, the generation unit 11 associates the application execution feature value set in the standard program 21 with the optimal parallelization parameter and adds them to the relational database 22. Then, the generation unit 11 similarly determines the optimal parallelization parameters for the reference program 21 having the application execution feature value sets for other application execution feature value sets, and adds them to the relational database 22.

ここで、関係データベース２２の一例を、図３を参照して説明する。図３は、実施例に係る関係データベースの一例を示す図である。図３に示す関係データベース２２は、アプリ実行特徴値セット（Ｘ１，Ｙ１，Ｚ１，Ｘ２，Ｙ２，Ｚ２）と、最適並列化パラメータ（Ｐ１，Ｐ２，Ｄ１，Ｄ２，Ｃ１，Ｃ２）とを対応付けて記憶する。 Here, an example of the relational database 22 will be explained with reference to FIG. 3. FIG. 3 is a diagram illustrating an example of a relational database according to the embodiment. The relational database 22 shown in FIG. 3 associates the application execution feature value set (X1, Y1, Z1, X2, Y2, Z2) with the optimal parallelization parameters (P1, P2, D1, D2, C1, C2). memorize it.

アプリ実行特徴値セット（Ｘ１，Ｙ１，Ｚ１，Ｘ２，Ｙ２，Ｚ２）は、１次元目のＤＯループの上記ｘ、ｙ、ｚおよび２次元目のＤＯループの上記ｘ、ｙ、ｚに対応する。つまり、Ｘ１は、１次元目のＤＯループのｘ、すなわち、１次元目のＤＯループの繰り返しごとの処理部分の実行命令数の平均値を示す。Ｙ１は、１次元目のＤＯループのｙ、すなわち、１次元目のＤＯループの繰り返しごとの処理部分の実行命令数の相対標準偏差値を示す。Ｚ１は、１次元目のＤＯループのｚ、すなわち、１次元目のＤＯループの繰り返しごとの処理部分のキャッシュ再利用率の平均値を示す。Ｘ２は、２次元目のＤＯループのｘ、すなわち、２次元目のＤＯループの繰り返しごとの処理部分の実行命令数の平均値を示す。Ｙ２は、２次元目のＤＯループのｙ、すなわち、２次元目のＤＯループの繰り返しごとの処理部分の実行命令数の相対標準偏差値を示す。Ｚ２は、２次元目のＤＯループのｚ、すなわち、２次元目のＤＯループの繰り返しごとの処理部分のキャッシュ再利用率の平均値を示す。ここでは、Ｘ１やＸ２は、１，２，・・・，１０の値を採る。Ｙ１やＹ２は、０．１，０．２，・・・，１．０の値を採る。Ｚ１やＺ２は、０．１，０．２，・・・，１．０の値を採る。 The application execution feature value set (X1, Y1, Z1, X2, Y2, Z2) corresponds to the above x, y, z of the first-dimensional DO loop and the above x, y, z of the second-dimensional DO loop. . That is, X1 indicates x of the first-dimensional DO loop, that is, the average value of the number of instructions executed in the processing portion for each repetition of the first-dimensional DO loop. Y1 indicates y of the first-dimensional DO loop, that is, the relative standard deviation value of the number of executed instructions of the processing portion for each repetition of the first-dimensional DO loop. Z1 indicates z of the first-dimensional DO loop, that is, the average value of the cache reuse rate of the processing portion for each iteration of the first-dimensional DO loop. X2 indicates x of the second-dimensional DO loop, that is, the average value of the number of instructions executed in the processing portion for each repetition of the second-dimensional DO loop. Y2 indicates y of the second-dimensional DO loop, that is, the relative standard deviation value of the number of executed instructions of the processing portion for each repetition of the second-dimensional DO loop. Z2 indicates z of the second-dimensional DO loop, that is, the average value of the cache reuse rate of the processing portion for each iteration of the second-dimensional DO loop. Here, X1 and X2 take values of 1, 2, . . . , 10. Y1 and Y2 take values of 0.1, 0.2, . . . , 1.0. Z1 and Z2 take values of 0.1, 0.2, . . . , 1.0.

最適並列化パラメータ（Ｐ１，Ｐ２，Ｄ１，Ｄ２，Ｃ１，Ｃ２）は、アプリ実行特徴値セット（Ｘ１，Ｙ１，Ｚ１，Ｘ２，Ｙ２，Ｚ２）を持つ基準プログラム２１における最適な並列化パラメータに対応する。Ｐ１，Ｐ２は、ＤＯループへのプロセッサコアの割り当てを示す値である。Ｄ１は、１次元目のＤＯループのスケジューリング方式を示す値である。Ｄ２は、２次元目のＤＯループのスケジューリング方式を示す値である。Ｃ１は、１次元目のＤＯループのチャンク数である。Ｃ２は、２次元目のＤＯループのチャンク数である。 The optimal parallelization parameters (P1, P2, D1, D2, C1, C2) correspond to the optimal parallelization parameters in the standard program 21 having the application execution feature value set (X1, Y1, Z1, X2, Y2, Z2). do. P1 and P2 are values indicating allocation of processor cores to DO loops. D1 is a value indicating the scheduling method of the first-dimensional DO loop. D2 is a value indicating the scheduling method of the second-dimensional DO loop. C1 is the number of chunks of the first-dimensional DO loop. C2 is the number of chunks of the second-dimensional DO loop.

一例として、アプリ実行特徴値セットが（１，０．１，０．１，１，０．１，０．１）である場合に、最適並列化パラメータとして（６，８，ｓｔａｔｉｃ，ｓｔａｔｉｃ，１，１）が記憶されている。また、アプリ実行特徴値セットが（１０，０．１，０．１，１，０．１，０．１）である場合に、最適並列化パラメータとして（６，８，ｄｙｎａｍｉｃ，ｓｔａｔｉｃ，１，１）が記憶されている。 As an example, when the application execution feature value set is (1, 0.1, 0.1, 1, 0.1, 0.1), the optimal parallelization parameter is (6, 8, static, static, 1 , 1) are stored. Furthermore, when the application execution feature value set is (10, 0.1, 0.1, 1, 0.1, 0.1), the optimal parallelization parameters are (6, 8, dynamic, static, 1, 1) is stored.

プログラム仮実行部１２は、最適化パラメータを抽出したい最適化対象プログラムを仮実行する。例えば、プログラム仮実行部１２は、最適化対象プログラムの特徴を求めるために、所定の並列化パラメータを設定した最適化対象プログラムを１回だけ仮実行する。最適化対象プログラムに設定する所定の並列化パラメータ（Ｐ１，Ｐ２，Ｄ１，Ｄ２，Ｃ１，Ｃ２）は、例えば（６，８，ｓｔａｔｉｃ，ｓｔａｔｉｃ，１，１）である。しかしながら、所定の並列化パラメータ（Ｐ１，Ｐ２，Ｄ１，Ｄ２，Ｃ１，Ｃ２）は、これに限定されるものではない。 The program temporary execution unit 12 temporarily executes an optimization target program from which optimization parameters are to be extracted. For example, the program temporary execution unit 12 temporarily executes the optimization target program with predetermined parallelization parameters set only once in order to find the characteristics of the optimization target program. The predetermined parallelization parameters (P1, P2, D1, D2, C1, C2) set in the optimization target program are, for example, (6, 8, static, static, 1, 1). However, the predetermined parallelization parameters (P1, P2, D1, D2, C1, C2) are not limited to these.

特徴値算出部１３は、プログラム仮実行部１２によって実行された際のアプリ実行特徴値セットのそれぞれの特徴値を算出する。実行時のアプリ実行特徴値セットのそれぞれの特定値は、特定のプロファイル等の性能情報取得ツールを用いて求められれば良い。特定のプロファイル等の性能情報取得ツールは、例えば、スーパーコンピュータ「京」に向けたアプリケーション開発支援ツール（“スーパーコンピュータ「京」の性能プロファイル”を参照）を利用しても良い。 The feature value calculation unit 13 calculates each feature value of the application execution feature value set when executed by the program temporary execution unit 12 . Each specific value of the application execution feature value set at the time of execution may be obtained using a performance information acquisition tool such as a specific profile. As a performance information acquisition tool such as a specific profile, for example, an application development support tool for the supercomputer "K" (see "Performance profile of the supercomputer "K") may be used.

最適化パラメータ抽出部１４は、関係データベース２２を参照して、特徴値算出部１３によって算出されたアプリ実行特徴値セットに対応する最適並列化パラメータを抽出する。例えば、最適化パラメータ抽出部１４は、関係データベース２２から、特徴値算出部１３によって算出されたアプリ実行特徴値セットと最も距離が近いアプリ実行特徴値セットを取得する。これは、関係データベース２２に記憶されたアプリ実行特徴値セットは、離散的であるからである。そして、最適化パラメータ抽出部１４は、関係データベース２２から、取得したアプリ実行特徴値セットに対応する最適並列化パラメータを抽出する。これにより、最適化パラメータ抽出部１４は、ループを有する最適化対象プログラムを並列処理で実行するときの最適な並列化パラメータを高速に推定することができる。 The optimization parameter extraction unit 14 refers to the relational database 22 and extracts the optimal parallelization parameter corresponding to the application execution feature value set calculated by the feature value calculation unit 13. For example, the optimization parameter extraction unit 14 obtains, from the relational database 22, an application execution feature value set that is closest in distance to the application execution feature value set calculated by the feature value calculation unit 13. This is because the application execution feature value set stored in the relational database 22 is discrete. Then, the optimization parameter extraction unit 14 extracts the optimal parallelization parameter corresponding to the acquired application execution feature value set from the relational database 22. Thereby, the optimization parameter extraction unit 14 can quickly estimate an optimal parallelization parameter when executing an optimization target program having a loop in parallel processing.

プログラム本実行部１５は、最適化対象プログラムを本実行する。例えば、プログラム本実行部１５は、最適化パラメータ抽出部１４によって抽出された最適並列化パラメータを設定した最適化対象プログラムを本実行する。 The program execution unit 15 executes the optimization target program. For example, the main program execution unit 15 executes the optimization target program in which the optimal parallelization parameters extracted by the optimization parameter extraction unit 14 are set.

［プログラムを実行するハードウェアの一例］
図４は、プログラムを実行するハードウェアの一例を示す図である、図４に示すように、計算機が、１つのメモリを共有する４８個のプロセッサコア（計算コア）を有する場合とする。それぞれの計算コアは、メモリへのアクセスのコストを軽減するために、キャッシュを有する。そして、１２個の計算コアごとに、ＵＭＡグループを生成する。そして、４組のＵＭＡグループが、ＮＵＭＡ（Non-Uniform Memory Access）結合している。なお、図４の例では、計算機に４８個の計算コアがある場合を説明したが、これに限定されるものではない。 [Example of hardware that executes the program]
FIG. 4 is a diagram showing an example of hardware that executes a program. As shown in FIG. 4, it is assumed that a computer has 48 processor cores (computing cores) that share one memory. Each computational core has a cache to reduce the cost of accessing memory. Then, a UMA group is generated for each of the 12 calculation cores. The four UMA groups are connected by NUMA (Non-Uniform Memory Access). In addition, in the example of FIG. 4, the case where a computer has 48 calculation cores was explained, but it is not limited to this.

［プロセッサコアの割り当ての一例］
図５は、プロセッサコアの割り当ての一例を示す図である。すなわち、上記並列化パラメータの変数Ｐ１，Ｐ２で表わしたＤＯループへのプロセッサコアの割り当ての一例を示す。図５の一例では、４８個のプロセッサコア（計算コア）の場合である。 [Example of processor core allocation]
FIG. 5 is a diagram illustrating an example of allocation of processor cores. That is, an example of the allocation of processor cores to the DO loop represented by the variables P1 and P2 of the parallelization parameters is shown. The example in FIG. 5 is a case of 48 processor cores (computation cores).

図５上図は、４８個のプロセッサコアの割り当てが（１２，４）の場合である。かかる場合には、プログラムは、コアごとに、１次元目（１重目）のＤＯループを１２分割で実行され、２次元目（２重目）のＤＯループを４分割で実行される。 The upper diagram of FIG. 5 shows a case where the allocation of 48 processor cores is (12,4). In this case, for each core, the program executes the first dimension (first layer) DO loop in 12 divisions, and the second dimension (second layer) DO loop in four divisions.

図５下図は、４８個のプロセッサコアの割り当てが（８，６）の場合である。かかる場合には、プログラムは、コアごとに、１次元目（１重目）のＤＯループを８分割で実行され、２次元目（２重目）のＤＯループを６分割で実行される。 The lower diagram in FIG. 5 shows a case where the allocation of 48 processor cores is (8,6). In this case, for each core, the program executes the first dimension (first layer) DO loop in eight divisions, and the second dimension (second layer) DO loop in six divisions.

［ＤＯループの実行例］
図６は、ＤＯループの実行例を示す図である。図６の一例では、図２で示したプログラムのＬ１，Ｌ２が「１」、Ｕ１，Ｕ２が「１００」の場合である。すなわち、１次元目（１重目）のループの変動範囲Ｌ１，Ｕ１は、１，１００である。２次元目（２重目）のループの変動範囲Ｌ２，Ｕ２は、１，１００である。また、最適化パラメータのプロセッサコアの割り当て（Ｐ１，Ｐ２）が（８，６）の場合である。 [Execution example of DO loop]
FIG. 6 is a diagram showing an example of execution of a DO loop. In the example of FIG. 6, L1 and L2 of the program shown in FIG. 2 are "1" and U1 and U2 are "100". That is, the variation ranges L1 and U1 of the first dimension (first layer) loop are 1,100. The variation ranges L2 and U2 of the second dimension (second layer) loop are 1,100. Further, this is a case where the processor core allocation (P1, P2) of the optimization parameters is (8, 6).

ここでは、ＣＯＲＥ＃０が示すプロセッサコアは、１重目のＤＯループの変動範囲（１，１３）を担い、２重目のＤＯループの変動範囲（１，１７）を担う。ＣＯＲＥ＃１が示すプロセッサコアは、１重目のＤＯループの変動範囲（１４，２６）を担い、２重目のＤＯループの変動範囲（１，１７）を担う。ＣＯＲＥ＃２が示すプロセッサコアは、１重目のＤＯループの変動範囲（２７，３９）を担い、２重目のＤＯループの変動範囲（１，１７）を担う。・・・ＣＯＲＥ＃７が示すプロセッサコアは、１重目のＤＯループの変動範囲（９２，１００）を担い、２重目のＤＯループの変動範囲（１，１７）を担う。また、ＣＯＲＥ＃８が示すプロセッサコアは、１重目のＤＯループの変動範囲（１，１３）を担い、２重目のＤＯループの変動範囲（１８，３４）を担う。ＣＯＲＥ＃１６が示すプロセッサコアは、１重目のＤＯループの変動範囲（１，１３）を担い、２重目のＤＯループの変動範囲（３５，５１）を担う。ＣＯＲＥ＃２４が示すプロセッサコアは、１重目のＤＯループの変動範囲（１，１３）を担い、２重目のＤＯループの変動範囲（５２，６８）を担う。ＣＯＲＥ＃３２が示すプロセッサコアは、１重目のＤＯループの変動範囲（１，１３）を担い、２重目のＤＯループの変動範囲（６９，８５）を担う。ＣＯＲＥ＃４０が示すプロセッサコアは、１重目のＤＯループの変動範囲（１，１３）を担い、２重目のＤＯループの変動範囲（８６，１００）を担う。このように、プロセッサコアの割り当てに従って、スレッドにｄｏループの繰り返し処理の一部を割り当てる。 here, The processor core indicated by CORE#0 is The fluctuation range of the first DO loop (1, 13) Variation range of the second DO loop (1, 17). The processor core indicated by CORE #1 is Variation range of the first DO loop (14, 26) The fluctuation range of the second DO loop (1, 17). The processor core indicated by CORE#2 is Variation range of the first DO loop (27, 39) The fluctuation range of the second DO loop (1, 17). ...The processor core indicated by CORE#7 is Variation range of the first DO loop (92, 100) The fluctuation range of the second DO loop (1, 17). Also, The processor core indicated by CORE #8 is The fluctuation range of the first DO loop (1, 13) Variation range of the second DO loop (18, 34). The processor core indicated by CORE #16 is The fluctuation range of the first DO loop (1, 13) Variation range of the second DO loop (35, 51). The processor core indicated by CORE #24 is The fluctuation range of the first DO loop (1, 13) Variation range of the second DO loop (52, 68). The processor core indicated by CORE #32 is The fluctuation range of the first DO loop (1, 13) Variation range of the second DO loop (69, 85). The processor core indicated by CORE #40 is The fluctuation range of the first DO loop (1, 13) Variation range of the second DO loop (86, 100). in this way, According to processor core allocation, Allocate part of the iterative processing of the do loop to a thread.

［最適化パラメータ抽出の一例］
図７は、最適化パラメータ抽出の一例を示す図である。図７に示すように、生成部１１によって生成された関係データベース２２が表わされている。なお、最適化対象プログラムに対応するアプリ実行特徴値セットが特徴値算出部１３によって算出されたものとする。 [Example of optimization parameter extraction]
FIG. 7 is a diagram showing an example of optimization parameter extraction. As shown in FIG. 7, the relational database 22 generated by the generation unit 11 is displayed. Note that it is assumed that the application execution feature value set corresponding to the optimization target program has been calculated by the feature value calculation unit 13.

最適化パラメータ抽出部１４は、関係データベース２２を参照して、特徴値算出部１３によって算出されたアプリ実行特徴値セットと最も距離が近いアプリ実行特徴値セットを取得する。ここでは、アプリ実行特徴値セットとして（５，０．８，０．２，１，０．２，０．１）が取得される。すると、最適化パラメータ抽出部１４は、関係データベース２２を参照して、取得したアプリ実行特徴値セットに対応する最適並列化パラメータを抽出する。ここでは、最適並列化パラメータとして（１２，４，ｇｕｉｄｅｄ，ｓｔａｔｉｃ，３２，１）が抽出される。 The optimization parameter extraction unit 14 refers to the relational database 22 and obtains an application execution feature value set that is closest in distance to the application execution feature value set calculated by the feature value calculation unit 13 . Here, (5, 0.8, 0.2, 1, 0.2, 0.1) is acquired as the application execution feature value set. Then, the optimization parameter extraction unit 14 refers to the relational database 22 and extracts the optimal parallelization parameter corresponding to the acquired application execution feature value set. Here, (12, 4, guided, static, 32, 1) is extracted as the optimal parallelization parameter.

［最適化されたプログラムの一例］
図８は、最適化されたプログラムの一例を示す図である。ここでは、最適化対象プログラムの最適並列化パラメータとして（１２，４，ｇｕｉｄｅｄ，ｓｔａｔｉｃ，３２，１）が抽出されたとする。符号ｐ１に示すように、ＤＯループへのプロセッサコアの割り当てを示す値として、１２，４が設定されている。符号ｐ２に示すように、１次元目のＤＯループのスケジューリング方式として「ｇｕｉｄｅｄ」が設定されている。符号ｐ３に示すように、２次元目のＤＯループのスケジューリング方式として「ｓｔａｔｉｃ」が設定されている。符号ｐ４に示すように、１次元目のＤＯループのチャンク数として「３２」が設定されている。符号ｐ５に示すように、２次元目のＤＯループのチャンク数として「１」が設定されている。 [An example of an optimized program]
FIG. 8 is a diagram showing an example of an optimized program. Here, it is assumed that (12, 4, guided, static, 32, 1) is extracted as the optimal parallelization parameter of the optimization target program. As shown by the symbol p1, 12.4 is set as a value indicating the allocation of the processor core to the DO loop. As shown by symbol p2, "guided" is set as the scheduling method for the first-dimensional DO loop. As shown by symbol p3, "static" is set as the scheduling method for the second-dimensional DO loop. As shown by symbol p4, "32" is set as the number of chunks of the first-dimensional DO loop. As shown by symbol p5, "1" is set as the number of chunks of the second-dimensional DO loop.

そして、最適並列化パラメータを設定した最適化対象プログラムを、プログラム本実行部１５は、本実行する。 Then, the program execution unit 15 executes the optimization target program for which the optimal parallelization parameters have been set.

［生成処理のフローチャートの一例］
図９は、実施例に係る生成処理のフローチャートの一例を示す図である。なお、アプリ実行特徴値セットは、図３で示す（Ｘ１，Ｙ１，Ｚ１，Ｘ２，Ｙ２，Ｚ２）であるとする。 [Example of flowchart of generation process]
FIG. 9 is a diagram illustrating an example of a flowchart of generation processing according to the embodiment. It is assumed that the application execution feature value set is (X1, Y1, Z1, X2, Y2, Z2) shown in FIG.

図９に示すように、生成部１１は、アプリ実行特徴値セットの６次元空間でグリッド状に配置される値を取り得る複数の基準プログラム２１を生成する（ステップＳ１１）。 As shown in FIG. 9, the generation unit 11 generates a plurality of reference programs 21 that can take values arranged in a grid in the six-dimensional space of the application execution feature value set (step S11).

そして、生成部１１は、それぞれの基準プログラム２１において、並列化パラメータの全組み合わせの実行を行う。生成部１１は、最も実行時間が短い並列化パラメータを、最適並列化パラメータとして、基準プログラム２１に対応するアプリ実行特徴値セットと対応付けて関係データベース２２に格納する（ステップＳ１２）。 Then, the generation unit 11 executes all combinations of parallelization parameters in each reference program 21. The generation unit 11 stores the parallelization parameter with the shortest execution time as the optimal parallelization parameter in the relational database 22 in association with the application execution feature value set corresponding to the reference program 21 (step S12).

［最適化パラメータ抽出処理のフローチャートの一例］
図１０は、実施例に係る最適化パラメータ抽出処理のフローチャートの一例を示す図である。 [Example of flowchart of optimization parameter extraction process]
FIG. 10 is a diagram illustrating an example of a flowchart of the optimization parameter extraction process according to the embodiment.

図１０に示すように、最適化対象プログラムを受け付けたプログラム仮実行部１２は、受け付けた最適化対象プログラムを１回実行する。そして、特徴値算出部１３は、アプリ実行特徴値セットを取得する（ステップＳ２１）。実行時のアプリ実行特徴値セットのそれぞれの特定値は、特定のプロファイル等の性能情報取得ツールを用いて求められれば良い。 As shown in FIG. 10, the program temporary execution unit 12 that has received the optimization target program executes the received optimization target program once. Then, the feature value calculation unit 13 obtains an application execution feature value set (step S21). Each specific value of the application execution feature value set during execution may be obtained using a performance information acquisition tool such as a specific profile.

そして、最適化パラメータ抽出部１４は、取得されたアプリ実行特徴値セットをまるめる（ステップＳ２２）。 Then, the optimization parameter extraction unit 14 rounds the acquired application execution feature value set (step S22).

そして、最適化パラメータ抽出部１４は、まるめた後のアプリ実行特徴値セットをキーに、関係データベース２２を参照して、最適並列化パラメータを抽出する（ステップＳ２３）。
［プログラム本実行処理のフローチャートの一例］
図１１は、実施例に係るプログラム本実行処理のフローチャートの一例を示す図である。 Then, the optimization parameter extraction unit 14 refers to the relational database 22 using the rounded application execution feature value set as a key and extracts the optimal parallelization parameter (step S23).
[An example of a flowchart of program execution processing]
FIG. 11 is a diagram illustrating an example of a flowchart of the program main execution process according to the embodiment.

プログラム本実行部１５は、最適化対象プログラムに最適並列化パラメータを設定して実行する（ステップＳ３１）。 The program main execution unit 15 sets the optimum parallelization parameter to the optimization target program and executes it (step S31).

［実施例の効果］
上記実施例によれば、並列処理装置１は、所定の並列化パラメータを設定した、２重ループを有する最適化対象のプログラムを並列処理で１回だけ実行する。並列処理装置１は、実行された結果から最適化対象のプログラムの特徴値を算出する。並列処理装置１は、２重ループを有するサンプルプログラムにおける特徴値と最適な並列化パラメータとの相関関係を示すデータベースを参照して、算出された特徴値に対応する最適な並列化パラメータを抽出する。かかる構成によれば、並列処理装置１は、２重ループを有する最適化対象のプログラムを並列処理で実行する場合に、１回だけ実行するだけで、最適な並列化パラメータを高速に求めることができる。 [Effects of Examples]
According to the above embodiment, the parallel processing device 1 executes the optimization target program having a double loop only once in parallel processing, for which a predetermined parallelization parameter has been set. The parallel processing device 1 calculates feature values of the program to be optimized from the executed results. The parallel processing device 1 refers to a database showing the correlation between feature values and optimal parallelization parameters in sample programs having double loops, and extracts optimal parallelization parameters corresponding to the calculated feature values. . According to this configuration, when the parallel processing device 1 executes an optimization target program having a double loop in parallel processing, it is possible to quickly obtain the optimal parallelization parameters by executing the program only once. can.

また、上記実施例によれば、並列処理装置１は、特徴値を持つサンプルプログラムに対して複数通りの並列化パラメータをそれぞれ設定して実行した際の結果を基に当該プログラムにおける特徴値と最適な並列化パラメータとの相関関係を示すデータベースを生成する。並列処理装置１は、生成されたデータベースを参照して、算出された前記最適化対象のプログラムの特徴値に対応する最適な並列化パラメータを抽出する。かかる構成によれば、並列処理装置１は、２重ループを有するプログラムにおける特徴値と並列化パラメータとの相関関係を生成することで、最適な並列化パラメータを高速に求めることができる。 Further, according to the above embodiment, the parallel processing device 1 sets the characteristic values of the sample program having characteristic values and the optimal Generate a database showing the correlation with parallelization parameters. The parallel processing device 1 refers to the generated database and extracts the optimal parallelization parameter corresponding to the calculated feature value of the program to be optimized. According to this configuration, the parallel processing device 1 can quickly determine the optimal parallelization parameter by generating a correlation between the feature value and the parallelization parameter in a program having a double loop.

また、上記実施例によれば、並列処理装置１は、以下の相関関係を示すデータベースを生成する。相関関係は、ループごとの、実行命令数の平均、実行命令数の相対標準偏差およびキャッシュの再利用率の平均を示す複数の特徴値と、ループごとの、プロセッサコアの割り当て、並列化のスケジューリング方式および並列化のスケジューリング方式のチャンク数を示す複数の並列化パラメータとの相関関係である。かかる構成によれば、並列処理装置１は、２重ループを有するプログラムにおける特徴値と並列化パラメータとの相関関係を生成することで、最適な並列化パラメータを高速に求めることができる。 Further, according to the above embodiment, the parallel processing device 1 generates a database showing the following correlations. The correlations are based on multiple feature values that indicate the average number of executed instructions, the relative standard deviation of the number of executed instructions, and the average cache reuse rate for each loop, as well as processor core allocation and parallelization scheduling for each loop. This is a correlation between the method and a plurality of parallelization parameters indicating the number of chunks of the parallelization scheduling method. According to this configuration, the parallel processing device 1 can quickly determine the optimal parallelization parameter by generating a correlation between the feature value and the parallelization parameter in a program having a double loop.

［その他］
なお、上記実施例では、プログラムとしてＦｏｒｔｒａｎ言語を一例として説明した。しかしながら、プログラムは、Ｃ言語やＣ++言語であっても良く、ＯｐｅｎＭＰ規格で扱える言語であれば良い。 [others]
In the above embodiment, the Fortran language was used as an example of the program. However, the program may be in C language or C++ language, as long as it can be handled by the OpenMP standard.

また、上記実施例では、複数レベルの並列化を行なえるアプリケーションプログラムとしてスレッド並列化規格であるＯｐｅｎＭＰ規格を一例として説明した。しかしながら、Ｏｐｅｎ規格に限定されず、複数レベルの並列化を行なえるアプリケーションプログラムとしてスレッド並列化規格である所定の規格であれば良い。 Further, in the above embodiment, the OpenMP standard, which is a thread parallelization standard, is used as an example of an application program that can perform multiple levels of parallelization. However, the present invention is not limited to the Open standard, and any predetermined standard that is a thread parallelization standard may be used as an application program that can perform multiple levels of parallelization.

また、図示した並列処理装置１の各構成要素は、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、並列処理装置１の分散・統合の具体的態様は図示のものに限られず、その全部または一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的または物理的に分散・統合して構成することができる。例えば、特徴値算出部１３および最適化パラメータ抽出部１４を１つの部として統合しても良い。また、生成部１１を、基準プログラム２１を生成する第１生成部と、関係データベース２２を生成する第２生成部とに分散しても良い。また、記憶部２０を並列処理装置１の外部装置としてネットワーク経由で接続するようにしても良い。 Further, each component of the illustrated parallel processing device 1 does not necessarily need to be physically configured as illustrated. In other words, the specific manner of distributing and integrating the parallel processing device 1 is not limited to what is shown in the diagram, and all or part of it can be functionally or physically distributed in arbitrary units depending on various loads and usage conditions. It can be configured in a distributed/integrated manner. For example, the feature value calculation unit 13 and the optimization parameter extraction unit 14 may be integrated as one unit. Further, the generation unit 11 may be distributed into a first generation unit that generates the reference program 21 and a second generation unit that generates the relational database 22. Further, the storage unit 20 may be connected as an external device to the parallel processing device 1 via a network.

また、上記実施例で説明した各種の処理は、予め用意されたプログラムをパーソナルコンピュータやワークステーションなどのコンピュータで実行することによって実現することができる。そこで、以下では、図１に示した並列処理装置１と同様の機能を実現する分析プログラムを実行するコンピュータの一例を説明する。図１２は、並列処理プログラムを実行するコンピュータの一例を示す図である。 Moreover, the various processes described in the above embodiments can be realized by executing a program prepared in advance on a computer such as a personal computer or a workstation. Therefore, an example of a computer that executes an analysis program that implements the same functions as the parallel processing device 1 shown in FIG. 1 will be described below. FIG. 12 is a diagram illustrating an example of a computer that executes a parallel processing program.

図１２に示すように、コンピュータ２００は、各種演算処理を実行するＣＰＵ２０３と、ユーザからのデータの入力を受け付ける入力装置２１５と、表示装置２０９を制御する表示制御部２０７とを有する。また、コンピュータ２００は、記憶媒体からプログラムなどを読取るドライブ装置２１３と、ネットワークを介して他のコンピュータとの間でデータの授受を行う通信制御部２１７とを有する。また、コンピュータ２００は、各種情報を一時記憶するメモリ２０１と、ＨＤＤ２０５を有する。そして、メモリ２０１、ＣＰＵ２０３、ＨＤＤ２０５、表示制御部２０７、ドライブ装置２１３、入力装置２１５、通信制御部２１７は、バス２１９で接続されている。 As shown in FIG. 12, the computer 200 includes a CPU 203 that executes various calculation processes, an input device 215 that receives data input from a user, and a display control unit 207 that controls a display device 209. The computer 200 also includes a drive device 213 that reads programs and the like from a storage medium, and a communication control unit 217 that exchanges data with other computers via a network. Further, the computer 200 includes a memory 201 that temporarily stores various information, and an HDD 205. The memory 201, CPU 203, HDD 205, display control section 207, drive device 213, input device 215, and communication control section 217 are connected via a bus 219.

ドライブ装置２１３は、例えばリムーバブルディスク２１１用の装置である。ＨＤＤ２０５は、並列処理プログラム２０５ａおよび並列処理関連情報２０５ｂを記憶する。 The drive device 213 is, for example, a device for the removable disk 211. The HDD 205 stores a parallel processing program 205a and parallel processing related information 205b.

ＣＰＵ２０３は、並列処理プログラム２０５ａを読み出して、メモリ２０１に展開し、プロセスとして実行する。かかるプロセスは、並列処理装置１の各機能部に対応する。並列処理関連情報２０５ｂは、基準プログラム２１および関係データベース２２に対応する。そして、例えばリムーバブルディスク２１１が、並列処理プログラム２０５ａなどの各情報を記憶する。 The CPU 203 reads the parallel processing program 205a, develops it in the memory 201, and executes it as a process. Such processes correspond to each functional unit of the parallel processing device 1. The parallel processing related information 205b corresponds to the reference program 21 and the relational database 22. For example, the removable disk 211 stores information such as the parallel processing program 205a.

なお、並列処理プログラム２０５ａについては、必ずしも最初からＨＤＤ２０５に記憶させておかなくても良い。例えば、コンピュータ２００に挿入されるフレキシブルディスク（ＦＤ）、ＣＤ－ＲＯＭ、ＤＶＤディスク、光磁気ディスク、ＩＣカードなどの「可搬用の物理媒体」に当該プログラムを記憶させておく。そして、コンピュータ２００がこれらから並列処理プログラム２０５ａを読み出して実行するようにしても良い。 Note that the parallel processing program 205a does not necessarily have to be stored in the HDD 205 from the beginning. For example, the program is stored in a "portable physical medium" such as a flexible disk (FD), CD-ROM, DVD disk, magneto-optical disk, or IC card that is inserted into the computer 200. Then, the computer 200 may read out and execute the parallel processing program 205a from these programs.

１並列処理装置
１０制御部
１１生成部
１２プログラム仮実行部
１３特徴値算出部
１４最適化パラメータ抽出部
１５プログラム本実行部
２０記憶部
２１基準プログラム
２２関係データベース 1 Parallel processing device 10 Control unit 11 Generation unit 12 Program temporary execution unit 13 Feature value calculation unit 14 Optimization parameter extraction unit 15 Program main execution unit 20 Storage unit 21 Reference program 22 Relational database

Claims

an execution unit that executes an optimization target program having a double loop only once in parallel processing, with predetermined parallelization parameters set;
a calculation unit that calculates a feature value of the optimization target program from the result executed by the execution unit;
an extraction unit that refers to a database showing the correlation between feature values and optimal parallelization parameters in a sample program having a double loop, and extracts an optimal parallelization parameter corresponding to the feature value calculated by the calculation unit; and,
A parallel processing device characterized by having:

Generates a database that shows the correlation between the feature values and the optimal parallelization parameters for a sample program with feature values based on the results of setting and executing multiple parallelization parameters for the sample program. further comprising:
The extraction unit refers to the database generated by the generation unit and extracts an optimal parallelization parameter corresponding to the feature value of the optimization target program calculated by the calculation unit. The parallel processing device according to claim 1.

The correlation is based on multiple characteristic values indicating the average number of executed instructions, the relative standard deviation of the number of executed instructions, and the average cache reuse rate for each loop, and processor core allocation and parallelization for each loop. The parallel processing device according to claim 1 or 2, characterized in that the correlation is between a scheduling method and a plurality of parallelization parameters indicating the number of chunks of the parallelization scheduling method.