Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WRF detailed setup procedure #689

Open
wants to merge 40 commits into
base: master
Choose a base branch
from
Open

Conversation

marcusgaspar
Copy link

In this Pull Request I'm detailing all the setup procedures to run and test WRF v4 using Cycle Cloud.
The original setup procedure was not clear enough and there were some missing steps. I took a long time to figure out the missing steps and make it work.
I'm sharing this back to the community as I believe I will be useful for everybody who wants to run a WRF v4 test on Azure using Cycle Cloud.

@xpillons xpillons requested a review from garvct October 27, 2022 16:28
Copy link
Collaborator

@garvct garvct left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all these detailed instructions, but azurehpc/apps/* (e.g wrf) should only contain scripts and code to build, install and run applications (independent of cluster deployment). I think the best location for deploying WRF on a cyclecloud cluster would be under the experimental directory.
Would it be possible to update/add the wrf build and install scripts (including creating the wrf data) in azurehpc/wrf and putting the complete deployment of WRF on cyclecloud under the experimental directory?

@@ -0,0 +1,105 @@
# Install and Setup CycleCloud for a Lab environment
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does examples/cycleserver_msi and examples/cycleserver deploy VNET and cycleserver automatically via a simple azurehpc config file.
It seems you are deploying the same but with all the manual steps?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm using manual steps. These manual steps can be useful in scenarios:

  • where people may not want to install it using azurehpc scripts; or
  • for learning purposes, where people wants to understand what exactly is installed/required.

I can add a mention about examples/cycleserver_msi and examples/cycleserver as an alternative option.


Summary of this procedure:
- Installs CycleCloud environment from scratch
- Creates NFS storage server using CycleCloud cluster template
Copy link
Collaborator

@garvct garvct Oct 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would Azure netapp files or a PFS be better for production?

Copy link
Author

@marcusgaspar marcusgaspar Nov 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, indeed. But, this is a procedure to setup a Lab environment.
I will add comments regarding Lab env and ANF or PFS as options for production.

## Download azurehpc GitHub repository
cd /data
#git clone https://github.com/Azure/azurehpc.git
git clone https://github.com/marcusgaspar/azurehpc.git
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this URL correct (you are pointing to your fork?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is temporary, as I'm currently using my fork during POCs.

mkdir ~/test1
cd ~/test1

qsub -l select=1:nodearray=execute1:ncpus=60:mpiprocs=60,place=scatter:excl -v "SKU_TYPE=hbv2,INPUTDIR=/apps/hbv2/wrf-openmpi/WRF-4.1.5/run" /data/azurehpc/apps/wrf/run_wrf_openmpi.pbs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For HBv2, should ncpus=120 ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were tests I did to measure the execution time with different configs. I forgot to add the execution time duration results. I will add a chart with it.

qsub -l select=1:nodearray=execute1:ncpus=60:mpiprocs=60,place=scatter:excl -v "SKU_TYPE=hbv2,INPUTDIR=/apps/hbv2/wrf-openmpi/WRF-4.1.5/run" /data/azurehpc/apps/wrf/run_wrf_openmpi.pbs
```

- Test 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why so many tests, is the only difference between each test the number of nodes (select=N)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were tests I did to measure the execution time with different configs. I forgot to add the execution time duration results. I will add a chart with it.

mkdir ~/test5
cd ~/test5

qsub -l select=3:nodearray=execute1:ncpus=60:mpiprocs=60,place=scatter:excl -v "SKU_TYPE=hbv2,INPUTDIR=/apps/hbv2/wrf-openmpi/WRF-4.1.5/run" /data/azurehpc/apps/wrf/run_wrf_openmpi.pbs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For HB120rs_v3, ncpus=120 ?, there is also references to hbv2 ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During my tests, I've used hbv2 reference and I was able to perform tests successfully on HBv2 and HBv3.
Do you recommend changing to hbv3 reference when running on HBv3?
If I change it, do I need to run the WRF and WPS build again?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it would be better to use HBv3 (but not absolutely necessary, latest is now HBv4, it will keep changing)

mkdir ~/test6
cd ~/test6

qsub -l select=3:nodearray=execute1:ncpus=64:mpiprocs=64,place=scatter:excl -v "SKU_TYPE=hbv2,INPUTDIR=/apps/hbv2/wrf-openmpi/WRF-4.1.5/run" /data/azurehpc/apps/wrf/run_wrf_openmpi.pbs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HB120-64rs_v3 test, but hbv2 references?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same above. During my tests, I've used hbv2 reference and I was able to perform tests successfully on HBv2 and HBv3.
Do you recommend changing to hbv3 reference when running on HBv3?
If I change it, do I need to run the WRF and WPS build again?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's good practice to build on the SKU you are running on. It's confusing to run on HBv3 but reference hbv2. To simply to documentation, I would just pick HBv3 (because its newer than hbv2) and give a few examples running specifically on HBv3. You could then add a note to state that a very similar procedure can be also used to run WRF on hbv2.

## Prerequisites

Cluster is built with the desired configuration for networking, storage, compute etc. You can see the tutorial or examples folder in this repo for how to set this up. Spack is installed (See [here](../spack/readme.md) for details).
- You need a cluster built with the desired configuration for networking, storage, compute etc. You can use the tutorial in this repo with an end-to-end instructions to setup a lab environment on Cycle Cloud to run WPS and WRF v4. (See [Install and run WPS and WRF v4 - end-to-end setup guide](../../experimental/wrf_on_cyclecloud/readme.md) for details).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please link to master branch and not your branch.

## Prerequisites

Cluster is built with the desired configuration for networking, storage, compute etc. You can see the tutorial or examples folder in this repo for how to set this up. Spack is installed (See [here](../spack/readme.md) for details).
- You need a cluster built with the desired configuration for networking, storage, compute etc. You can use the tutorial in this repo with an end-to-end instructions to setup a lab environment on Cycle Cloud to run WPS and WRF v4. (See [Install and run WPS and WRF v4 - end-to-end setup guide](../../experimental/wrf_on_cyclecloud/readme.md) for details).
- As this procedure uses HBv2 VMs to run WRFv4 simulations, you may need to request quota increase for this type of SKU in the subscription and region you will deploy the environment. You can use different SKU if you want to.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be worth mentioning that other specialty VM's can also work with some minor modifications.

mkdir ~/test1
cd ~/test1

qsub -l select=1:nodearray=execute1:ncpus=60:mpiprocs=30,place=scatter:excl -v "SKU_TYPE=hbv2,INPUTDIR=/apps/hbv2/wrf-openmpi/WRF-4.1.5/run" /data/azurehpc/apps/wrf/run_wrf_openmpi.pbs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For HBv2, with the new NUMA topology (30 Numa --> 4 numa), the only ncpus or mpiprocs that would make sense would be 32, 64, 96 or 120). I do not think 60 and 30 is optimal with new Numa topology?

mkdir ~/test2
cd ~/test2

qsub -l select=2:nodearray=execute1:ncpus=60:mpiprocs=30,place=scatter:excl -v "SKU_TYPE=hbv2,INPUTDIR=/apps/hbv2/wrf-openmpi/WRF-4.1.5/run" /data/azurehpc/apps/wrf/run_wrf_openmpi.pbs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above about 30 and 60 ncpus/mpiprocs on HBv2 with 3 NUMA domains.

mkdir ~/test3
cd ~/test3

qsub -l select=3:nodearray=execute1:ncpus=60:mpiprocs=30,place=scatter:excl -v "SKU_TYPE=hbv2,INPUTDIR=/apps/hbv2/wrf-openmpi/WRF-4.1.5/run" /data/azurehpc/apps/wrf/run_wrf_openmpi.pbs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See same comment about.


#### Run real.exe
echo "-- Run real.exe"
mpirun $mpi_options -n $NPROCS --hostfile $PBS_NODEFILE --bind-to numa ./real.exe
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, with HBv2 new numa topology --bind-to l3cache may be more optimal?


#### Run real.exe
echo "-- Run real.exe"
mpirun $mpi_options -n $NPROCS --hostfile $PBS_NODEFILE --bind-to numa ./real.exe
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

```

Follow the procedures [here](https://docs.microsoft.com/en-us/azure/cyclecloud/tutorials/modify-cluster-template?view=cyclecloud-8#import-the-new-cluster-template) to upload the Cycle Cloud custom template created for WRF.
Use the template: [opbswrf-template.txt](opbswrf-template.txt)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change the link to mater instead of fork.

### Install WPS/WRF v4 software (via “azurehpc” scripts)
Now you have a WRF cluster properly configured and running in Cycle Cloud.

You can follow the instructions here to finish the WPS/WRF installation: [Install and run WPS and WRF v4 - Setup guide](/apps/wrf/readme.md)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change link to master branch instead of forked branch


### Test Results
In the graph below you can compare the execution time from tests performed, with different number of nodes, cores, mpicores and SKUs:
![Import-Template1](images/wrf-test-results.png)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In chart ncpus/mpiprocs=60, do not make sense for HBv2 with 4 Numa domains.? (See comments above)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants