Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ISC24 bonus task by IIT Kanpur #44

Open
wants to merge 3 commits into
base: ISC24
Choose a base branch
from
Open

ISC24 bonus task by IIT Kanpur #44

wants to merge 3 commits into from

Conversation

divc13
Copy link

@divc13 divc13 commented Apr 22, 2024

Detailed Changes in the Pull Request

This pull request includes the following major changes to the mod_micro_nogtom module:

1. Added Compiler Directives

The code optimization involved the incorporation of OpenMP directives to leverage SIMD instructions, which significantly improved its performance. OpenMP directives were strategically placed to enable vectorization, specifically using the !$omp simd directives. This allowed the compiler to efficiently process multiple data elements in parallel, resulting in a performance boost. The compiler vectorization report was a valuable resource during this process, providing insights into potential areas for optimization and guiding the placement of OpenMP directives.

The !dir$ ivdep directive was added to inform the compiler that there are no dependencies in vectorizing the instructions. This directive ensures that the compiler generates code that can be executed without any conflicts or dependencies between the instructions.

The !dir$ vector always directive was added above the initialization of matrices like sumh1(:,:,:) = d_zero to ensure that the compiler always vectorizes them.

The directive !dir$ novector was added above loops that iterated from 1 to nqx to instruct the compiler not to vectorize those loops. The decision to add this directive was based on the observation that nqx was relatively small (found to be 5), which meant that vectorizing these loops may incur a significant overhead that could potentially decrease performance.

We had also added !$omp parallel do directives to check if threading could bring any performance improvements, but eventually, it turned out that the overheads of threading outstanded the performance improvement. We did not remove these compiler directives, but we run the application after exporting OMP_NUM_THREADS = 1, which also makes these directives redundant.

2. Performed Scalar Expansion

Scalar expansion has been performed on several arrays to allow for better vectorization of the loops. The following arrays have been expanded:

  • tnew_expanded
  • dp_expanded
  • qe_expanded
  • tmpl_expanded
  • tmpi_expanded
  • zdelta_expanded
  • phases_expanded

This optimization technique helped vectorize some loops, which could otherwise hae not been vectorized, due to reasons of overwriting the scalar variable.

Consider the following loop in the original code

do k = 1 , kz
	do i = ici1 , ici2
		do j = jci1 , jci2
		tnew = tx(j,i,k)
		dp = dpfs(j,i,k)
		qe = mo2mc%qdetr(j,i,k)

		if ( k > 1 ) then
			sumq0(j,i,k) = sumq0(j,i,k-1) ! total water
			sumh0(j,i,k) = sumh0(j,i,k-1) ! liquid water temperature
		end if

		tmpl = qx(iqql,j,i,k)+qx(iqqr,j,i,k)
		tmpi = qx(iqqi,j,i,k)+qx(iqqs,j,i,k)
		tnew = tnew - wlhvocp*tmpl - wlhsocp*tmpi
		sumq0(j,i,k) = sumq0(j,i,k)+(tmpl+tmpi+qx(iqqv,j,i,k))*dp*regrav

		! Detrained water treated here
		if ( lmicro .and. abs(qe) > activqx ) then
			sumq0(j,i,k) = sumq0(j,i,k) + qe*dp*regrav
			alfaw = qliq(j,i,k)
			tnew = tnew-(wlhvocp*alfaw+wlhsocp*(d_one-alfaw))*qe
		end if
		sumh0(j,i,k) = sumh0(j,i,k) + dp*tnew
		end do
	end do
end do

All the scalars that were being assigned to, i.e., tnew, dp, qe, tmpl and tmpi, were replaced with their vector versions.

do k = 1 , kz
	do i = ici1 , ici2
		!$omp simd simdlen(8)
		do j = jci1 , jci2
		tnew_expanded(j,i,k) = tx(j,i,k)
		dp_expanded(j,i,k) = dpfs(j,i,k)
		qe_expanded(j,i,k) = mo2mc%qdetr(j,i,k)

		if ( k > 1 ) then
			sumq0(j,i,k) = sumq0(j,i,k-1) ! total water
			sumh0(j,i,k) = sumh0(j,i,k-1) ! liquid water temperature
		end if

		tmpl_expanded(j,i,k) = qx(iqql,j,i,k)+qx(iqqr,j,i,k)
		tmpi_expanded(j,i,k) = qx(iqqi,j,i,k)+qx(iqqs,j,i,k)
		tnew_expanded(j,i,k) = tnew_expanded(j,i,k) - wlhvocp*tmpl_expanded(j,i,k) - wlhsocp*tmpi_expanded(j,i,k)
		sumq0(j,i,k) = sumq0(j,i,k)+(tmpl_expanded(j,i,k)+tmpi_expanded(j,i,k)+qx(iqqv,j,i,k))*dp_expanded(j,i,k)*regrav

		! Detrained water treated here
		if ( lmicro .and. abs(qe_expanded(j,i,k)) > activqx ) then
			sumq0(j,i,k) = sumq0(j,i,k) + qe_expanded(j,i,k)*dp_expanded(j,i,k)*regrav
			tnew_expanded(j,i,k) = tnew_expanded(j,i,k)-(wlhvocp*qliq(j,i,k)+wlhsocp*(d_one-qliq(j,i,k)))*qe_expanded(j,i,k)

		end if
		sumh0(j,i,k) = sumh0(j,i,k) + dp_expanded(j,i,k)*tnew_expanded(j,i,k)
		end do
	end do
end do

Similar changes have been performed for the variables zdelta and phases.

3. Restructured Loops for Efficiency

The structure of some loops has been modified to make the code more efficient. Consider the foolowing loop in the original code

do k = 2 , kz
	do i = ici1 , ici2
		do j = jci1 , jci2
			do kk = 2 , k
				if ( mc2mo%fcc(j,i,kk-1) > cldtopcf .and. &
					mc2mo%fcc(j,i,kk)  <= cldtopcf ) then
					cldtopdist(j,i,k) = cldtopdist(j,i,k) + mo2mc%delz(j,i,kk)
				end if
			end do
		end do
	end do
end do

which was restrutured in the following manner to avoid the extra computation taking place kz times for each combination of (i, j). The modified code

!dir$ vector always
cloud_sum_calc(:,:) = d_zero
!$omp parallel do
do k = 2 , kz
	do i = ici1 , ici2
		!$omp simd simdlen(8)
		do j = jci1 , jci2
		  	if ( mc2mo%fcc(j,i,k-1) > cldtopcf .and. &
				mc2mo%fcc(j,i,k)  <= cldtopcf ) then
				cloud_sum_calc(j,i) = cloud_sum_calc(j,i) + mo2mc%delz(j,i,k)
		  	end if
		end do
	end do
end do

!$omp parallel do
do k = 2 , kz
	do i = ici1 , ici2
		!$omp simd simdlen(8)
		do j = jci1 , jci2
			cldtopdist(j,i,k) = cloud_sum_calc(j, i)
		end do
	end do
end do

The modified code stores the sum values in a temporary array cloud_sum_calc first, which is then used to modify the cldtopdist array.

Correctness Validation

The team has ensured the correctness of the changes by comparing the output file generated by the modified implementation with the output file generated by the original implementation. The experiments were conducted on the PARAMSANGANAK supercomputer at IIT Kanpur. lrcemip_perturb was set to false to to disable any randomization, to check the validity of our output.

Build Script

source $PROJECT/RegCM-setvars.sh
source $PROJECT/IPM-setvars.sh
./configure CC=icc FC=ifort CXX=icpc MPICC=mpiicc MPIFC=mpiifort MPIF90=mpiifort CFLAGS="-g -O3" FCFLAGS="-g -O3 -qopenmp -diag-disable=10448 -qopenmp-simd -march=core-avx2 -align array64byte -assume contiguous_assumed_shape -assume contiguous_pointer"
make version
make install

Run Script

#!/bin/sh
#SBATCH -N 4
#SBATCH --error=err.out
#SBATCH --output=out.out
#SBATCH --time=01:00:00
#SBATCH --partition=RM

source /jet/packages/oneapi/v2023.2.0/setvars.sh
source $PROJECT/RegCM-setvars.sh
source $PROJECT/IPM-setvars.sh
cp $REGCM_ROOT/Testing/isc24.in .
cp $REGCM_ROOT/Testing/rcemip.in profile.in
mkdir output
ln -sf $REGCM_ROOT/bin/regcmMPIRCEMIP .
export OMP_NUM_THREADS=1
mpirun -ppn 128 ./regcmMPIRCEMIP ./isc24.in 

Performance Improvements

We checked the performance of the application, specifically the nogtom module, by profiling it using VTune on PARAMSANGANK. Since the code in the module was a serial one, to check performance, we used 48 processes, all on one node, and checked the total compute time of the nogtom subroutine. The input files were altered to run for 1 day instead of 10 days in the original input file.


For the smaller input file isc24_small.in, we observed a performance improvement, speedup of about 112.3% from about 300 seconds to 267 seconds. The time data is the overall compute time of the nogtom subroutine for all the 48 processes.
As we had expected from vectorization of intructions, we got much more performance improvement, speedup of about 123.1% on the larger input file, isc24.in, from 6331 seconds to 5143 seconds.

Submission for the Bonus Task

This pull request is the submission for the bonus task of RegCM in the Student Cluster Competition (SCC) at ISC'24 from Team ExaDecimals, IIT Kanpur.

The changes described above aim to improve the performance and efficiency of the mod_micro_nogtom module, while maintaining the correctness of the implementation. The team has put significant effort into optimizing the code and is confident that these changes will contribute to the overall performance of the RegCM model.

@divc13 divc13 changed the title Added optimized version for ISC24 bonus task ISC24 bonus task by IIT Kanpur Apr 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant