Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Detailed Changes in the Pull Request
This pull request includes the following major changes to the
mod_micro_nogtom
module:1. Added Compiler Directives
The code optimization involved the incorporation of OpenMP directives to leverage
SIMD
instructions, which significantly improved its performance. OpenMP directives were strategically placed to enable vectorization, specifically using the!$omp simd
directives. This allowed the compiler to efficiently process multiple data elements in parallel, resulting in a performance boost. The compiler vectorization report was a valuable resource during this process, providing insights into potential areas for optimization and guiding the placement of OpenMP directives.The
!dir$ ivdep
directive was added to inform the compiler that there are no dependencies in vectorizing the instructions. This directive ensures that the compiler generates code that can be executed without any conflicts or dependencies between the instructions.The
!dir$ vector
always directive was added above the initialization of matrices likesumh1(:,:,:) = d_zero
to ensure that the compiler always vectorizes them.The directive
!dir$ novector
was added above loops that iterated from1
tonqx
to instruct the compiler not to vectorize those loops. The decision to add this directive was based on the observation that nqx was relatively small (found to be5
), which meant that vectorizing these loops may incur a significant overhead that could potentially decrease performance.We had also added
!$omp parallel do
directives to check if threading could bring any performance improvements, but eventually, it turned out that the overheads of threading outstanded the performance improvement. We did not remove these compiler directives, but we run the application after exportingOMP_NUM_THREADS = 1
, which also makes these directives redundant.2. Performed Scalar Expansion
Scalar expansion has been performed on several arrays to allow for better vectorization of the loops. The following arrays have been expanded:
tnew_expanded
dp_expanded
qe_expanded
tmpl_expanded
tmpi_expanded
zdelta_expanded
phases_expanded
This optimization technique helped vectorize some loops, which could otherwise hae not been vectorized, due to reasons of overwriting the scalar variable.
Consider the following loop in the original code
All the scalars that were being assigned to, i.e.,
tnew
,dp
,qe
,tmpl
andtmpi
, were replaced with their vector versions.Similar changes have been performed for the variables
zdelta
andphases
.3. Restructured Loops for Efficiency
The structure of some loops has been modified to make the code more efficient. Consider the foolowing loop in the original code
which was restrutured in the following manner to avoid the extra computation taking place
kz
times for each combination of(i, j)
. The modified codeThe modified code stores the sum values in a temporary array
cloud_sum_calc
first, which is then used to modify thecldtopdist
array.Correctness Validation
The team has ensured the correctness of the changes by comparing the output file generated by the modified implementation with the output file generated by the original implementation. The experiments were conducted on the PARAMSANGANAK supercomputer at IIT Kanpur.
lrcemip_perturb
was set tofalse
to to disable any randomization, to check the validity of our output.Build Script
Run Script
Performance Improvements
We checked the performance of the application, specifically the
nogtom
module, by profiling it usingVTune
onPARAMSANGANK
. Since the code in the module was a serial one, to check performance, we used 48 processes, all on one node, and checked the total compute time of thenogtom
subroutine. The input files were altered to run for1
day instead of10
days in the original input file.For the smaller input file
isc24_small.in
, we observed a performance improvement, speedup of about112.3%
from about300
seconds to267
seconds. The time data is the overall compute time of the nogtom subroutine for all the48
processes.As we had expected from vectorization of intructions, we got much more performance improvement, speedup of about
123.1%
on the larger input file,isc24.in
, from6331
seconds to5143
seconds.Submission for the Bonus Task
This pull request is the submission for the bonus task of RegCM in the Student Cluster Competition (SCC) at ISC'24 from
Team ExaDecimals, IIT Kanpur
.The changes described above aim to improve the performance and efficiency of the
mod_micro_nogtom
module, while maintaining the correctness of the implementation. The team has put significant effort into optimizing the code and is confident that these changes will contribute to the overall performance of the RegCM model.