-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WRF detailed setup procedure #689
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all these detailed instructions, but azurehpc/apps/* (e.g wrf) should only contain scripts and code to build, install and run applications (independent of cluster deployment). I think the best location for deploying WRF on a cyclecloud cluster would be under the experimental directory.
Would it be possible to update/add the wrf build and install scripts (including creating the wrf data) in azurehpc/wrf and putting the complete deployment of WRF on cyclecloud under the experimental directory?
@@ -0,0 +1,105 @@ | |||
# Install and Setup CycleCloud for a Lab environment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does examples/cycleserver_msi and examples/cycleserver deploy VNET and cycleserver automatically via a simple azurehpc config file.
It seems you are deploying the same but with all the manual steps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm using manual steps. These manual steps can be useful in scenarios:
- where people may not want to install it using azurehpc scripts; or
- for learning purposes, where people wants to understand what exactly is installed/required.
I can add a mention about examples/cycleserver_msi and examples/cycleserver as an alternative option.
apps/wrf/readme.md
Outdated
|
||
Summary of this procedure: | ||
- Installs CycleCloud environment from scratch | ||
- Creates NFS storage server using CycleCloud cluster template |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would Azure netapp files or a PFS be better for production?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, indeed. But, this is a procedure to setup a Lab environment.
I will add comments regarding Lab env and ANF or PFS as options for production.
## Download azurehpc GitHub repository | ||
cd /data | ||
#git clone https://github.com/Azure/azurehpc.git | ||
git clone https://github.com/marcusgaspar/azurehpc.git |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this URL correct (you are pointing to your fork?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is temporary, as I'm currently using my fork during POCs.
apps/wrf/readme.md
Outdated
mkdir ~/test1 | ||
cd ~/test1 | ||
|
||
qsub -l select=1:nodearray=execute1:ncpus=60:mpiprocs=60,place=scatter:excl -v "SKU_TYPE=hbv2,INPUTDIR=/apps/hbv2/wrf-openmpi/WRF-4.1.5/run" /data/azurehpc/apps/wrf/run_wrf_openmpi.pbs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For HBv2, should ncpus=120 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These were tests I did to measure the execution time with different configs. I forgot to add the execution time duration results. I will add a chart with it.
qsub -l select=1:nodearray=execute1:ncpus=60:mpiprocs=60,place=scatter:excl -v "SKU_TYPE=hbv2,INPUTDIR=/apps/hbv2/wrf-openmpi/WRF-4.1.5/run" /data/azurehpc/apps/wrf/run_wrf_openmpi.pbs | ||
``` | ||
|
||
- Test 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why so many tests, is the only difference between each test the number of nodes (select=N)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These were tests I did to measure the execution time with different configs. I forgot to add the execution time duration results. I will add a chart with it.
mkdir ~/test5 | ||
cd ~/test5 | ||
|
||
qsub -l select=3:nodearray=execute1:ncpus=60:mpiprocs=60,place=scatter:excl -v "SKU_TYPE=hbv2,INPUTDIR=/apps/hbv2/wrf-openmpi/WRF-4.1.5/run" /data/azurehpc/apps/wrf/run_wrf_openmpi.pbs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For HB120rs_v3, ncpus=120 ?, there is also references to hbv2 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During my tests, I've used hbv2 reference and I was able to perform tests successfully on HBv2 and HBv3.
Do you recommend changing to hbv3 reference when running on HBv3?
If I change it, do I need to run the WRF and WPS build again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it would be better to use HBv3 (but not absolutely necessary, latest is now HBv4, it will keep changing)
apps/wrf/readme.md
Outdated
mkdir ~/test6 | ||
cd ~/test6 | ||
|
||
qsub -l select=3:nodearray=execute1:ncpus=64:mpiprocs=64,place=scatter:excl -v "SKU_TYPE=hbv2,INPUTDIR=/apps/hbv2/wrf-openmpi/WRF-4.1.5/run" /data/azurehpc/apps/wrf/run_wrf_openmpi.pbs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HB120-64rs_v3 test, but hbv2 references?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same above. During my tests, I've used hbv2 reference and I was able to perform tests successfully on HBv2 and HBv3.
Do you recommend changing to hbv3 reference when running on HBv3?
If I change it, do I need to run the WRF and WPS build again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's good practice to build on the SKU you are running on. It's confusing to run on HBv3 but reference hbv2. To simply to documentation, I would just pick HBv3 (because its newer than hbv2) and give a few examples running specifically on HBv3. You could then add a note to state that a very similar procedure can be also used to run WRF on hbv2.
## Prerequisites | ||
|
||
Cluster is built with the desired configuration for networking, storage, compute etc. You can see the tutorial or examples folder in this repo for how to set this up. Spack is installed (See [here](../spack/readme.md) for details). | ||
- You need a cluster built with the desired configuration for networking, storage, compute etc. You can use the tutorial in this repo with an end-to-end instructions to setup a lab environment on Cycle Cloud to run WPS and WRF v4. (See [Install and run WPS and WRF v4 - end-to-end setup guide](../../experimental/wrf_on_cyclecloud/readme.md) for details). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please link to master branch and not your branch.
## Prerequisites | ||
|
||
Cluster is built with the desired configuration for networking, storage, compute etc. You can see the tutorial or examples folder in this repo for how to set this up. Spack is installed (See [here](../spack/readme.md) for details). | ||
- You need a cluster built with the desired configuration for networking, storage, compute etc. You can use the tutorial in this repo with an end-to-end instructions to setup a lab environment on Cycle Cloud to run WPS and WRF v4. (See [Install and run WPS and WRF v4 - end-to-end setup guide](../../experimental/wrf_on_cyclecloud/readme.md) for details). | ||
- As this procedure uses HBv2 VMs to run WRFv4 simulations, you may need to request quota increase for this type of SKU in the subscription and region you will deploy the environment. You can use different SKU if you want to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be worth mentioning that other specialty VM's can also work with some minor modifications.
mkdir ~/test1 | ||
cd ~/test1 | ||
|
||
qsub -l select=1:nodearray=execute1:ncpus=60:mpiprocs=30,place=scatter:excl -v "SKU_TYPE=hbv2,INPUTDIR=/apps/hbv2/wrf-openmpi/WRF-4.1.5/run" /data/azurehpc/apps/wrf/run_wrf_openmpi.pbs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For HBv2, with the new NUMA topology (30 Numa --> 4 numa), the only ncpus or mpiprocs that would make sense would be 32, 64, 96 or 120). I do not think 60 and 30 is optimal with new Numa topology?
mkdir ~/test2 | ||
cd ~/test2 | ||
|
||
qsub -l select=2:nodearray=execute1:ncpus=60:mpiprocs=30,place=scatter:excl -v "SKU_TYPE=hbv2,INPUTDIR=/apps/hbv2/wrf-openmpi/WRF-4.1.5/run" /data/azurehpc/apps/wrf/run_wrf_openmpi.pbs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above about 30 and 60 ncpus/mpiprocs on HBv2 with 3 NUMA domains.
mkdir ~/test3 | ||
cd ~/test3 | ||
|
||
qsub -l select=3:nodearray=execute1:ncpus=60:mpiprocs=30,place=scatter:excl -v "SKU_TYPE=hbv2,INPUTDIR=/apps/hbv2/wrf-openmpi/WRF-4.1.5/run" /data/azurehpc/apps/wrf/run_wrf_openmpi.pbs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See same comment about.
|
||
#### Run real.exe | ||
echo "-- Run real.exe" | ||
mpirun $mpi_options -n $NPROCS --hostfile $PBS_NODEFILE --bind-to numa ./real.exe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, with HBv2 new numa topology --bind-to l3cache may be more optimal?
apps/wrf/run_wps-real_openmpi.pbs
Outdated
|
||
#### Run real.exe | ||
echo "-- Run real.exe" | ||
mpirun $mpi_options -n $NPROCS --hostfile $PBS_NODEFILE --bind-to numa ./real.exe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above.
``` | ||
|
||
Follow the procedures [here](https://docs.microsoft.com/en-us/azure/cyclecloud/tutorials/modify-cluster-template?view=cyclecloud-8#import-the-new-cluster-template) to upload the Cycle Cloud custom template created for WRF. | ||
Use the template: [opbswrf-template.txt](opbswrf-template.txt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change the link to mater instead of fork.
### Install WPS/WRF v4 software (via “azurehpc” scripts) | ||
Now you have a WRF cluster properly configured and running in Cycle Cloud. | ||
|
||
You can follow the instructions here to finish the WPS/WRF installation: [Install and run WPS and WRF v4 - Setup guide](/apps/wrf/readme.md) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change link to master branch instead of forked branch
|
||
### Test Results | ||
In the graph below you can compare the execution time from tests performed, with different number of nodes, cores, mpicores and SKUs: | ||
![Import-Template1](images/wrf-test-results.png) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In chart ncpus/mpiprocs=60, do not make sense for HBv2 with 4 Numa domains.? (See comments above)
fix permissions related to managed identity on the subscription
In this Pull Request I'm detailing all the setup procedures to run and test WRF v4 using Cycle Cloud.
The original setup procedure was not clear enough and there were some missing steps. I took a long time to figure out the missing steps and make it work.
I'm sharing this back to the community as I believe I will be useful for everybody who wants to run a WRF v4 test on Azure using Cycle Cloud.