Wolffe is the name of the HPC cluster at WSU, the wiki page for Wolffe is available here.
To log onto Wolffe, you will need to use the SSH protocol. Open a terminal and type the following command:
ssh <username>@wolffe.cdms.westernsydney.edu.au
Replace <username>
with your WSU username. You will be prompted to enter your password. Once logged in, you will be in your home directory on the Wolffe cluster.
You’ll also need to be on the CDMS network to access Wolffe. To access the VPN, use the OpenVPN client and the profile you have received. Note the profile is only valid until the end of the year, so staff should request a new profile each year, and students should request a new profile from their supervisors.
Once logged in, you can see the message:
For help with HPC matters, see:
https://wiki.cdms.westernsydney.edu.au/index.php?title=HPC_documentation
Last login: xxxxx from xxx.xxx.xxx.xxx
and any message of the day that may be displayed.
Unlike a standard computer, HPC systems do not have all the application available for use by default. Instead, applications are made available through modules. Modules allow you to load and unload software packages as needed.
To see the available modules, you can use the command:
module avail
This will list all the available modules on Wolffe:
---------------------------------- /usr/share/Modules/modulefiles ----------------------------------
dot module-git module-info modules null use.own
-------------------------------------- /usr/share/modulefiles --------------------------------------
mp-x86_64 mpi/openmpi-x86_64
-------------------------------------- /software/modulefiles ---------------------------------------
10x/cellranger-3.0.2 funtools/funtools mpi/openmpi-4.0.2 R/4.4.2
anaconda/conda3 gnu/gcc-7.4.0 mpi/openmpi-4.1.5 T-RECS
caffe/caffe-cuda-10.0 gnu/gcc-10 nccl/nccl-2.26.2-1
caffe/caffe-fedora gnu/gcc-10.5.0 nextflow/nextflow-25.04.6
casa/casa-5.4.1 gnu/gcc-11.1.0 PyCharm-community-2023.2.3
casa/casa-5.6.0 Java/java-24 Python/Python3.6
casa/casa-5.7.0.pre Julialang/Julia-1.6.2 Python/Python3.7.0
cmake/cmake-3.24.4 karma/karma-1.7.25 Python/Python3.9
cmake/cmake-3.25.3 lammps/lammps-stable-20230801 Python/Python3.10
colmap/colmap-3.11 lammps/lammps-stable-20230822 Python/Python3.11
cuda/cuda-10.0 matlab/matlab2016a Python/Python3.12.3
cuda/cuda-10.2 matlab/matlab2018a PyTorch/Python3.7.0
cuda/cuda-11.0 matlab/matlab2019a PyTorch/Python3.9
cuda/cuda-11.2 miriad/miriad PyTorch/Python3.10
cuda/cuda-11.6 Montage/Montage-6.0 PyTorch/Python3.11
cuda/cuda-12.6 mpi/openmpi-1.8.8 PyTorch/Python3.12.3
You can load a module using the command:
module load <module_name>
For example, to load the Python 3.10 module, you would use:
module load Python/Python3.10
You can check which modules are currently loaded with:
module list
Currently Loaded Modulefiles:
1) Python/Python3.10
Unlike most HPC systems, Wolffe does have git
installed by default, so you can use it without loading a module. So you can clone this repository directly into your home directory:
git clone https://github.com/CRMDS/Intro-HPC-workshop.git
Wolffe uses the Slurm workload manager to manage jobs. To find out what resources are available, you can use the command:
sinfo
This will show you the status of the nodes in the cluster:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
cpu* up 21-00:00:0 2 mix compute-[002-003]
cpu* up 21-00:00:0 1 alloc compute-001
ampere24 up 7-00:00:00 4 alloc a30-[002-005]
ampere80 up 7-00:00:00 1 alloc a100-100
ampere80 up 7-00:00:00 1 idle a100-101
ampere40 up 7-00:00:00 1 alloc a100-000
ampere40 up 7-00:00:00 2 idle a100-[001-002]
The columns indicate:
alloc
for allocated, idle
for available)To see the jobs currently running and submitted on the cluster, you can use:
squeue
This will show you a list of jobs in the queue:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
18226 ampere24 RLGOAL 30069287 R 2-10:31:01 1 a30-005
18287 ampere24 mat1 18870679 R 21:04:11 1 a30-002
18288 ampere24 mat2 18870679 R 21:03:45 1 a30-003
18289 ampere24 mat3 18870679 R 21:03:13 1 a30-004
18100 ampere40 EA_V4_Mi 30069287 R 4-15:02:23 1 a100-000
18304 ampere80 vec2word 30069287 R 8:52:30 1 a100-100
17962 cpu X12.2_15 30031031 R 5-20:01:58 1 compute-001
17963 cpu X12.2_14 30031031 R 5-20:01:55 1 compute-001
17964 cpu X12.2_14 30031031 R 5-19:57:36 1 compute-001
17965 cpu X12.2_13 30031031 R 5-19:52:40 1 compute-001
17966 cpu X12.2_13 30031031 R 5-19:52:11 1 compute-001
17967 cpu X12.2_12 30031031 R 5-19:35:30 1 compute-001
17968 cpu X12.2_12 30031031 R 5-19:34:30 1 compute-001
17969 cpu X12.2_11 30031031 R 5-19:29:03 1 compute-001
17970 cpu X12.2_11 30031031 R 5-19:25:31 1 compute-001
17971 cpu X12.2_10 30031031 R 5-18:49:38 1 compute-001
17972 cpu X12.2_10 30031031 R 5-18:28:52 1 compute-001
17973 cpu X12.2_09 30031031 R 5-18:24:49 1 compute-001
......
The columns indicate:
R
for running, PD
for pending)squeue
can also be queried to show more information about jobs, check the man squeue
page for more details.
To see your own jobs, you can use:
squeue -u $USER
There are two main ways to run jobs on Wolffe: interactively and in batch mode. Note that you should never run jobs directly on the login node, as this can disrupt other users.
To ask for interactive resources, you can use the sinteractive
command:
sinteractive -p cpu --time=0:30:00
This command requests an interactive session on the cpu
partition for 30 minutes. You can adjust the partition and time limit as needed. Once the resources are allocated, you will be dropped into a shell on one of the compute nodes, where you can run commands interactively.
We’ll run a simple neural network example. First, load the Python module:
module load Python/Python3.10
Then, run the script:
python3 -u MLP.py
Note that the -u
option is used to force the output to be unbuffered, which is useful for interactive sessions. This code should take a few seconds to run, for this case the -u
flag doesn’t make a lot of difference.
Use interactive jobs for testing and debugging your code, for running actual jobs, use the batch mode.
To run jobs in batch mode, you need to create a job script that specifies the resources your job needs and the commands to run. For this example, we’ll use the MLP.py script we used in the interactive session.
To submit a batch job, create a script file (e.g., first_script.sh) with the following content:
#! /usr/bin/env bash
#
#SBATCH --job-name=MLP
#SBATCH --output=S-res.txt
#SBATCH --error=S-err.txt
#
#SBATCH --ntasks=1
#SBATCH --time=00:05:00
#SBATCH --partition=cpu
# load the module
module load Python/Python3.10
# move to work directory
cd ~/Intro-HPC-workshop/02.Working_on_Wolffe/
# do the submission
python3 -u MLP.py
sleep 60
This script does the following:
#! /usr/bin/env bash
: Specifies the script should be run using the bash shell#SBATCH --job-name=MLP
: Sets the name of the job to “MLP”#SBATCH --output=S-res.txt
: Specifies the file where the output of the job will be written#SBATCH --error=S-err.txt
: Specifies the file where any error messages will be written#SBATCH --ntasks=1
: Requests one task (or process) for the job#SBATCH --time=05:00
: Sets a time limit of 5 minutes for the job#SBATCH --partition=cpu
: Specifies that the job should run in the cpu
partitioncd ~/Intro-HPC-workshop/02.Working_on_Wolffe/
: Changes the working directory to where the script is located.MLP.py
. The -u
option is used to ensure the output is unbuffered, which is useful for batch jobs.sleep 60
command is included to keep the job running for an additional 60 seconds after the Python script completes, which is used here to show the jobs running the queue.A note about resource requests.
We can then submit this job script using the sbatch
command:
sbatch first_script.sh
You can then check the status of your job using squeue -u $USER
.
So you should see something like this:
(base) [30057355@wolffe 02.Working_on_Wolffe]$ sbatch first_script.sh
Submitted batch job 18330
(base) [30057355@wolffe 02.Working_on_Wolffe]$ squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
18330 cpu MLP 30057355 R 0:07 1 compute-003
Once the job is complete, you can check the output in the S-res.txt
file. This file will contain the output of the Python script, including any print statements. You can also check any error outputs in the S-err.txt
file.
Other useful information:
scancel
command.sacct
command.squeue
and sacct
using the --format
option.We leave these items as exercises for you to try out.
In summary, we:
sbatch
.squeue
.S-res.txt
and S-err.txt
files.The power of the HPC comes from the ability to run jobs with multiple tasks or processes. For example, when training a machine learning model, you will want to run multiple training jobs with different hyperparameters to find the best model. You can do it in a single script using for
loops, but this can be inefficient and hard to manage. Instead, you should run training tasks as separate jobs, each with its own set of resources. This is where the power of HPC systems comes into play, as they can run many jobs in parallel, significantly speeding up computations.
We will modify the previous python script to accept command line arguments for the random state, and then submit multiple jobs with different random states. To do this, we will create a new script called MLP_pararg.py that accepts a random state as a command line argument.
Test the script by running it with command line input of random states in interactive mode (or in batch mode if you prefer):
python3 MLP_pararg.py --random_state 42
This should run the script and print the test accuracy, and also output the results to a file named res_42.txt
.
Next, we will create a job script that submits multiple jobs with different random states. Create a new script called second_script.sh with the following content:
#! /usr/bin/env bash
#
#SBATCH --job-name=MLP
#SBATCH --output=output/S-%a-res.txt
#SBATCH --error=output/S-%a-err.txt
#
#SBATCH --ntasks=1
#SBATCH --time=00:05:00
#SBATCH --partition=cpu
#SBATCH --array=1-10 # Array job with 10 tasks
# load the module
module load Python/Python3.10
# move to work directory
cd ~/Intro-HPC-workshop/02.Working_on_Wolffe/
data_file='random_state.txt'
# read the i-th line from the file and store it as "n"
n=$(sed -n "${SLURM_ARRAY_TASK_ID}p" $data_file)
echo "Running task ${SLURM_ARRAY_TASK_ID} with random state ${n}"
# do the submission
python3 -u MLP_pararg.py --random_state $n
sleep 60
This script does the following:
#SBATCH --array=1-10
: Specifies that this is an array job with 10 tasks, each with a different random state.random_state.txt
file, which contains a list of random states, and uses the SLURM_ARRAY_TASK_ID
to select the appropriate random state for each task.MLP_pararg.py
script with the selected random state as a command line argument.S-%a-res.txt
and S-%a-err.txt
, where %a
is the array ID, so each task will have its own output and error files. You can also use %A
for job ID. Note that they will be stored in the output
directory, so make sure to create this directory before running the script.We’ll also need to create the random_state.txt
file, which contains a list of random states, one per line.
You can then submit this job script using the sbatch
command:
sbatch second_script.sh
You can check the status of your jobs using squeue -u $USER
, and you should see something like this:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
18341_[7-10] cpu MLP 30057355 PD 0:00 1 (Resources)
18341_1 cpu MLP 30057355 R 0:50 1 compute-003
18341_2 cpu MLP 30057355 R 0:50 1 compute-003
18341_3 cpu MLP 30057355 R 0:50 1 compute-002
18341_4 cpu MLP 30057355 R 0:50 1 compute-002
18341_5 cpu MLP 30057355 R 0:50 1 compute-002
18341_6 cpu MLP 30057355 R 0:50 1 compute-002
Each job in the array has its own job ID, using the main job ID followed by an underscore and the task number (e.g., 18341_1
, 18341_2
, etc.). The jobs will run in parallel, and you can check the output files in the output
directory to see the results of each job. Any job that is still pending will have a status of PD
(pending), and they are groups into a single job ID with the task numbers in square brackets (e.g., 18341_[7-10]
).
Sometimes you will want to run a series of jobs that depend on each other, for example, you may want to run a job that processes all the accuracies of the MLP training with different random states. In this case, you can use job dependencies to chain jobs together.
To do this, you can use the --dependency
option in the sbatch
command. For example, if you have a job that processes the results of the MLP training and you want it to run only after all the MLP training jobs have completed, you can submit the processing job with a dependency on the MLP training jobs.
Let’s first create a Python script that gathers all the results of the MLP training into one file, and a job script to run it. We can test everything runs by submitting the job script:
sbatch collect.sh
We then write another Python script that summarises the results, and a job script to run it. We can test this works by submitting the job script:
sbatch summarise.sh
Waiting for a job to complete before submitting the next one can be tedious, so we can use job dependencies to automate this process. We can do this using the following commands:
(base) [30057355@wolffe 02.Working_on_Wolffe]$ sbatch --begin=now+120 second_script.sh
Submitted batch job 18374
(base) [30057355@wolffe 02.Working_on_Wolffe]$ sbatch -d afterok:18374 collect.sh
Submitted batch job 18375
(base) [30057355@wolffe 02.Working_on_Wolffe]$ sbatch -d afterok:18375 summarise.sh
Submitted batch job 18376
The --begin=now+120
option in the first command specifies that the job should start in 120 seconds, which gives us time to submit the next jobs and setup dependencies before the job runs.
The -d afterok:18374
option in the second command specifies that the job should only run after job 18374 has completed successfully. The afterok
dependency means that the job will only run if the previous job completed without errors. Note that we use the job ID of the first job (18374) to set the dependency for the second job (18375), instead of the array job ID. Slurm will automatically wait for all tasks in the array job to complete before running the dependent job.
Running squeue -u $USER
will show you the status of all these jobs in the queue. You should see something like this:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
18374_[7-10] cpu MLP 30057355 PD 0:00 1 (Resources)
18375 cpu collect 30057355 PD 0:00 1 (Dependency)
18376 cpu summaris 30057355 PD 0:00 1 (Dependency)
18374_1 cpu MLP 30057355 R 0:05 1 compute-003
18374_2 cpu MLP 30057355 R 0:05 1 compute-003
18374_3 cpu MLP 30057355 R 0:05 1 compute-002
18374_4 cpu MLP 30057355 R 0:05 1 compute-002
18374_5 cpu MLP 30057355 R 0:05 1 compute-002
18374_6 cpu MLP 30057355 R 0:05 1 compute-002
Or we can watch the jobs move through the queue using the watch
command:
(base) [30057355@wolffe 02.Working_on_Wolffe]$ watch squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
18386_[1-10] cpu MLP 30057355 PD 0:00 1 (BeginTime)
18387 cpu collect 30057355 PD 0:00 1 (Dependency)
18388 cpu summaris 30057355 PD 0:00 1 (Dependency)
# MLP job is watiing to begin, others are pending due to dependencies
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
18386_[7-10] cpu MLP 30057355 PD 0:00 1 (Resources)
18387 cpu collect 30057355 PD 0:00 1 (Dependency)
18388 cpu summaris 30057355 PD 0:00 1 (Dependency)
18386_1 cpu MLP 30057355 R 0:18 1 compute-003
18386_2 cpu MLP 30057355 R 0:18 1 compute-003
18386_3 cpu MLP 30057355 R 0:18 1 compute-002
18386_4 cpu MLP 30057355 R 0:18 1 compute-002
18386_5 cpu MLP 30057355 R 0:18 1 compute-002
18386_6 cpu MLP 30057355 R 0:18 1 compute-002
# Some of the MLP jobs are running, others are pending due to resources, and the collect and summarise jobs are pending due to dependencies
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
18387 cpu collect 30057355 PD 0:00 1 (Dependency)
18388 cpu summaris 30057355 PD 0:00 1 (Dependency)
18386_9 cpu MLP 30057355 R 0:03 1 compute-002
18386_10 cpu MLP 30057355 R 0:03 1 compute-002
18386_7 cpu MLP 30057355 R 0:08 1 compute-003
18386_8 cpu MLP 30057355 R 0:08 1 compute-003
# All MLP jobs are running, collect and summarise jobs are pending due to dependencies
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
18388 cpu summaris 30057355 PD 0:00 1 (Dependency)
18387 cpu collect 30057355 R 0:00 1 compute-003
# All MLP jobs are finished, collect job is running, summarise job is pending due to dependency
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
18388 cpu summaris 30057355 R 0:00 1 compute-003
# collect job is finished, summarise job is running.
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
# all jobs are done.
Once everything is done, you will want to clean up the output files and any temporary files you created. You can do this by creating a script called cleanup.sh
so that you know you’re deleting the right files. We leave this as an exercise.
Other than the ability to run massive parallel jobs, HPC systems also give you access to powerful GPUs. Wolffe currently has NVIDIA A100 and A30 GPUs available for use. To use the GPU, you need to specify the GPU partition when submitting your job.
To use the GPU, you’ll first need a Pyhon (or other) script that uses the GPU. For this example, we’ll use NN_gpu.py that trains a neural network on the digits
dataset using the GPU. This script uses the PyTorch library, which is a popular deep learning framework that can take advantage of GPUs for training models.
We now use a new script to run the NN_gpu.py
script on the GPU. The script is similar to the previous job scripts, but it specifies the GPU partition and requests a GPU:
#! /usr/bin/env bash
#
#SBATCH --job-name=NN_gpu
#SBATCH --output=output/S-gpu-out.txt
#SBATCH --error=output/S-gpu-err.txt
#
#SBATCH --time=00:05:00
#SBATCH --partition=ampere80
#SBATCH --cpus-per-task=1
# load the module
module load PyTorch/Python3.10
# move to work directory
cd ~/Intro-HPC-workshop/02.Working_on_Wolffe/
# do the submission
python3 -u NN_gpu.py
sleep 60
This script does the following:
#SBATCH --partition=ampere80
: Specifies that the job should run in the ampere80
partition, which is the partition for A100 GPUs and 80Gb RAM.#SBATCH --cpus-per-task=1
: Requests one CPU core for the job. This is important because the GPU will handle the heavy lifting, but you still need a CPU to manage the job and run the Python script. Also, for large GPU jobs, you can speed up the job by requesting more CPU cores, for example, to load data into the GPU faster.Note that in other HPC systems, you may need to use #SBATCH --gres=gpu:1
(gres
for “generic resources”) to request a GPU, but on Wolffe, the --partition=ampere80
option is sufficient to request a GPU.
To make the most of the HPC resources:
--dependency
option in the sbatch
command.squeue
, sinfo
, and sacct
to monitor your jobs and the state of the cluster. This can help you identify issues and optimise your job submissions.module
command: Always check which modules are loaded and available. This can help you avoid conflicts and ensure you have the right software for your job.sinteractive
or sbatch
to run jobs on compute nodes.sacct
for job accounting: This command can be used to view the status and resource usage of completed jobs, which can help you analyze performance and optimise future jobs.