Intro-HPC-workshop

Working on Wolffe

Wolffe is the name of the HPC cluster at WSU, the wiki page for Wolffe is available here.

Logging in

To log onto Wolffe, you will need to use the SSH protocol. Open a terminal and type the following command:

ssh <username>@wolffe.cdms.westernsydney.edu.au

Replace <username> with your WSU username. You will be prompted to enter your password. Once logged in, you will be in your home directory on the Wolffe cluster.

You’ll also need to be on the CDMS network to access Wolffe. To access the VPN, use the OpenVPN client and the profile you have received. Note the profile is only valid until the end of the year, so staff should request a new profile each year, and students should request a new profile from their supervisors.

Once logged in, you can see the message:

For help with HPC matters, see:

https://wiki.cdms.westernsydney.edu.au/index.php?title=HPC_documentation

Last login: xxxxx from xxx.xxx.xxx.xxx

and any message of the day that may be displayed.

How to use any apps on Wolffe

Unlike a standard computer, HPC systems do not have all the application available for use by default. Instead, applications are made available through modules. Modules allow you to load and unload software packages as needed.

To see the available modules, you can use the command:

module avail

This will list all the available modules on Wolffe:

---------------------------------- /usr/share/Modules/modulefiles ----------------------------------
dot  module-git  module-info  modules  null  use.own  

-------------------------------------- /usr/share/modulefiles --------------------------------------
mp-x86_64  mpi/openmpi-x86_64  

-------------------------------------- /software/modulefiles ---------------------------------------
10x/cellranger-3.0.2   funtools/funtools              mpi/openmpi-4.0.2           R/4.4.2  
anaconda/conda3        gnu/gcc-7.4.0                  mpi/openmpi-4.1.5           T-RECS   
caffe/caffe-cuda-10.0  gnu/gcc-10                     nccl/nccl-2.26.2-1          
caffe/caffe-fedora     gnu/gcc-10.5.0                 nextflow/nextflow-25.04.6   
casa/casa-5.4.1        gnu/gcc-11.1.0                 PyCharm-community-2023.2.3  
casa/casa-5.6.0        Java/java-24                   Python/Python3.6            
casa/casa-5.7.0.pre    Julialang/Julia-1.6.2          Python/Python3.7.0          
cmake/cmake-3.24.4     karma/karma-1.7.25             Python/Python3.9            
cmake/cmake-3.25.3     lammps/lammps-stable-20230801  Python/Python3.10           
colmap/colmap-3.11     lammps/lammps-stable-20230822  Python/Python3.11           
cuda/cuda-10.0         matlab/matlab2016a             Python/Python3.12.3         
cuda/cuda-10.2         matlab/matlab2018a             PyTorch/Python3.7.0         
cuda/cuda-11.0         matlab/matlab2019a             PyTorch/Python3.9           
cuda/cuda-11.2         miriad/miriad                  PyTorch/Python3.10          
cuda/cuda-11.6         Montage/Montage-6.0            PyTorch/Python3.11          
cuda/cuda-12.6         mpi/openmpi-1.8.8              PyTorch/Python3.12.3

You can load a module using the command:

module load <module_name>

For example, to load the Python 3.10 module, you would use:

module load Python/Python3.10

You can check which modules are currently loaded with:

module list

Currently Loaded Modulefiles:
 1) Python/Python3.10

Unlike most HPC systems, Wolffe does have git installed by default, so you can use it without loading a module. So you can clone this repository directly into your home directory:

git clone https://github.com/CRMDS/Intro-HPC-workshop.git

Querying the queue system

Wolffe uses the Slurm workload manager to manage jobs. To find out what resources are available, you can use the command:

sinfo

This will show you the status of the nodes in the cluster:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
cpu*         up 21-00:00:0      2    mix compute-[002-003]
cpu*         up 21-00:00:0      1  alloc compute-001
ampere24     up 7-00:00:00      4  alloc a30-[002-005]
ampere80     up 7-00:00:00      1  alloc a100-100
ampere80     up 7-00:00:00      1   idle a100-101
ampere40     up 7-00:00:00      1  alloc a100-000
ampere40     up 7-00:00:00      2   idle a100-[001-002]

The columns indicate:

PARTITION: The name of the partition (or queue) the nodes belong to
AVAIL: Whether the partition is available for use
TIMELIMIT: The maximum time a job can run in this partition
NODES: The number of nodes in this partition
STATE: The current state of the nodes (e.g., alloc for allocated, idle for available)
NODELIST: The names of the nodes in this partition

To see the jobs currently running and submitted on the cluster, you can use:

squeue

This will show you a list of jobs in the queue:

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             18226  ampere24   RLGOAL 30069287  R 2-10:31:01      1 a30-005
             18287  ampere24     mat1 18870679  R   21:04:11      1 a30-002
             18288  ampere24     mat2 18870679  R   21:03:45      1 a30-003
             18289  ampere24     mat3 18870679  R   21:03:13      1 a30-004
             18100  ampere40 EA_V4_Mi 30069287  R 4-15:02:23      1 a100-000
             18304  ampere80 vec2word 30069287  R    8:52:30      1 a100-100
             17962       cpu X12.2_15 30031031  R 5-20:01:58      1 compute-001
             17963       cpu X12.2_14 30031031  R 5-20:01:55      1 compute-001
             17964       cpu X12.2_14 30031031  R 5-19:57:36      1 compute-001
             17965       cpu X12.2_13 30031031  R 5-19:52:40      1 compute-001
             17966       cpu X12.2_13 30031031  R 5-19:52:11      1 compute-001
             17967       cpu X12.2_12 30031031  R 5-19:35:30      1 compute-001
             17968       cpu X12.2_12 30031031  R 5-19:34:30      1 compute-001
             17969       cpu X12.2_11 30031031  R 5-19:29:03      1 compute-001
             17970       cpu X12.2_11 30031031  R 5-19:25:31      1 compute-001
             17971       cpu X12.2_10 30031031  R 5-18:49:38      1 compute-001
             17972       cpu X12.2_10 30031031  R 5-18:28:52      1 compute-001
             17973       cpu X12.2_09 30031031  R 5-18:24:49      1 compute-001
             ......

The columns indicate:

JOBID: The unique identifier for the job
PARTITION: The partition (or queue) the job is running in
NAME: The name of the job
USER: The user who submitted the job
ST: The current state of the job (e.g., R for running, PD for pending)
TIME: The total time the job has been running
NODES: The number of nodes allocated to the job
NODELIST(REASON): The list of nodes the job is running on, or the reason why it is pending if it is not running

squeue can also be queried to show more information about jobs, check the man squeue page for more details.

To see your own jobs, you can use:

squeue -u $USER

Running jobs

There are two main ways to run jobs on Wolffe: interactively and in batch mode. Note that you should never run jobs directly on the login node, as this can disrupt other users.

Interactive jobs

To ask for interactive resources, you can use the sinteractive command:

sinteractive -p cpu --time=0:30:00

This command requests an interactive session on the cpu partition for 30 minutes. You can adjust the partition and time limit as needed. Once the resources are allocated, you will be dropped into a shell on one of the compute nodes, where you can run commands interactively.

We’ll run a simple neural network example. First, load the Python module:

module load Python/Python3.10

Then, run the script:

python3 -u MLP.py

Note that the -u option is used to force the output to be unbuffered, which is useful for interactive sessions. This code should take a few seconds to run, for this case the -u flag doesn’t make a lot of difference.

Use interactive jobs for testing and debugging your code, for running actual jobs, use the batch mode.

Batch jobs

To run jobs in batch mode, you need to create a job script that specifies the resources your job needs and the commands to run. For this example, we’ll use the MLP.py script we used in the interactive session.

To submit a batch job, create a script file (e.g., first_script.sh) with the following content:

#! /usr/bin/env bash
#
#SBATCH --job-name=MLP
#SBATCH --output=S-res.txt
#SBATCH --error=S-err.txt
#
#SBATCH --ntasks=1
#SBATCH --time=00:05:00
#SBATCH --partition=cpu

# load the module
module load Python/Python3.10

# move to work directory
cd ~/Intro-HPC-workshop/02.Working_on_Wolffe/

# do the submission
python3 -u MLP.py
sleep 60

This script does the following:

#! /usr/bin/env bash: Specifies the script should be run using the bash shell
#SBATCH --job-name=MLP: Sets the name of the job to “MLP”
#SBATCH --output=S-res.txt: Specifies the file where the output of the job will be written
#SBATCH --error=S-err.txt: Specifies the file where any error messages will be written
#SBATCH --ntasks=1: Requests one task (or process) for the job
#SBATCH --time=05:00: Sets a time limit of 5 minutes for the job
#SBATCH --partition=cpu: Specifies that the job should run in the cpu partition
Load the module for Python 3.10, when jobs are run in batch mode, it is like when you first log in, no modules are loaded by default, so you need to load the modules you need for your job.
cd ~/Intro-HPC-workshop/02.Working_on_Wolffe/: Changes the working directory to where the script is located.
Runs the Python script MLP.py. The -u option is used to ensure the output is unbuffered, which is useful for batch jobs.
The sleep 60 command is included to keep the job running for an additional 60 seconds after the Python script completes, which is used here to show the jobs running the queue.

A note about resource requests.

We can then submit this job script using the sbatch command:

sbatch first_script.sh

You can then check the status of your job using squeue -u $USER.

So you should see something like this:

(base) [30057355@wolffe 02.Working_on_Wolffe]$ sbatch first_script.sh 
Submitted batch job 18330
(base) [30057355@wolffe 02.Working_on_Wolffe]$ squeue -u $USER
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             18330       cpu      MLP 30057355  R       0:07      1 compute-003

Once the job is complete, you can check the output in the S-res.txt file. This file will contain the output of the Python script, including any print statements. You can also check any error outputs in the S-err.txt file.

Other useful information:

You can delete the job on the queue using the scancel command.
When your jobs have finished, you can check for the resource usage using the sacct command.
You can customise the outputs of squeue and sacct using the --format option.

We leave these items as exercises for you to try out.

In summary, we:

Created a python script that trains a simple neural network on the digits dataset.
Created a job script that specifies the resources needed and the commands to run.
Submitted the job script using sbatch.
Checked the status of the job using squeue.
Check the output and the error of the job in the S-res.txt and S-err.txt files.

Parallel jobs and workflow management

Parallel jobs

The power of the HPC comes from the ability to run jobs with multiple tasks or processes. For example, when training a machine learning model, you will want to run multiple training jobs with different hyperparameters to find the best model. You can do it in a single script using for loops, but this can be inefficient and hard to manage. Instead, you should run training tasks as separate jobs, each with its own set of resources. This is where the power of HPC systems comes into play, as they can run many jobs in parallel, significantly speeding up computations.

We will modify the previous python script to accept command line arguments for the random state, and then submit multiple jobs with different random states. To do this, we will create a new script called MLP_pararg.py that accepts a random state as a command line argument.

Test the script by running it with command line input of random states in interactive mode (or in batch mode if you prefer):

python3 MLP_pararg.py --random_state 42

This should run the script and print the test accuracy, and also output the results to a file named res_42.txt.

Next, we will create a job script that submits multiple jobs with different random states. Create a new script called second_script.sh with the following content:

#! /usr/bin/env bash
#
#SBATCH --job-name=MLP
#SBATCH --output=output/S-%a-res.txt
#SBATCH --error=output/S-%a-err.txt
#
#SBATCH --ntasks=1
#SBATCH --time=00:05:00
#SBATCH --partition=cpu
#SBATCH --array=1-10   # Array job with 10 tasks

# load the module
module load Python/Python3.10

# move to work directory
cd ~/Intro-HPC-workshop/02.Working_on_Wolffe/

data_file='random_state.txt'
# read the i-th line from the file and store it as "n"
n=$(sed -n "${SLURM_ARRAY_TASK_ID}p" $data_file)

echo "Running task ${SLURM_ARRAY_TASK_ID} with random state ${n}"

# do the submission
python3 -u MLP_pararg.py --random_state $n
sleep 60

This script does the following:

#SBATCH --array=1-10: Specifies that this is an array job with 10 tasks, each with a different random state.
Reads the random_state.txt file, which contains a list of random states, and uses the SLURM_ARRAY_TASK_ID to select the appropriate random state for each task.
Runs the MLP_pararg.py script with the selected random state as a command line argument.
The output and error files are named S-%a-res.txt and S-%a-err.txt, where %a is the array ID, so each task will have its own output and error files. You can also use %A for job ID. Note that they will be stored in the output directory, so make sure to create this directory before running the script.

We’ll also need to create the random_state.txt file, which contains a list of random states, one per line.

You can then submit this job script using the sbatch command:

sbatch second_script.sh

You can check the status of your jobs using squeue -u $USER, and you should see something like this:

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
      18341_[7-10]       cpu      MLP 30057355 PD       0:00      1 (Resources)
           18341_1       cpu      MLP 30057355  R       0:50      1 compute-003
           18341_2       cpu      MLP 30057355  R       0:50      1 compute-003
           18341_3       cpu      MLP 30057355  R       0:50      1 compute-002
           18341_4       cpu      MLP 30057355  R       0:50      1 compute-002
           18341_5       cpu      MLP 30057355  R       0:50      1 compute-002
           18341_6       cpu      MLP 30057355  R       0:50      1 compute-002

Each job in the array has its own job ID, using the main job ID followed by an underscore and the task number (e.g., 18341_1, 18341_2, etc.). The jobs will run in parallel, and you can check the output files in the output directory to see the results of each job. Any job that is still pending will have a status of PD (pending), and they are groups into a single job ID with the task numbers in square brackets (e.g., 18341_[7-10]).

Workflow management

Sometimes you will want to run a series of jobs that depend on each other, for example, you may want to run a job that processes all the accuracies of the MLP training with different random states. In this case, you can use job dependencies to chain jobs together.

To do this, you can use the --dependency option in the sbatch command. For example, if you have a job that processes the results of the MLP training and you want it to run only after all the MLP training jobs have completed, you can submit the processing job with a dependency on the MLP training jobs.

Let’s first create a Python script that gathers all the results of the MLP training into one file, and a job script to run it. We can test everything runs by submitting the job script:

sbatch collect.sh

We then write another Python script that summarises the results, and a job script to run it. We can test this works by submitting the job script:

sbatch summarise.sh

Waiting for a job to complete before submitting the next one can be tedious, so we can use job dependencies to automate this process. We can do this using the following commands:

(base) [30057355@wolffe 02.Working_on_Wolffe]$ sbatch --begin=now+120 second_script.sh 
Submitted batch job 18374
(base) [30057355@wolffe 02.Working_on_Wolffe]$ sbatch -d afterok:18374 collect.sh 
Submitted batch job 18375
(base) [30057355@wolffe 02.Working_on_Wolffe]$ sbatch -d afterok:18375 summarise.sh 
Submitted batch job 18376

The --begin=now+120 option in the first command specifies that the job should start in 120 seconds, which gives us time to submit the next jobs and setup dependencies before the job runs.

The -d afterok:18374 option in the second command specifies that the job should only run after job 18374 has completed successfully. The afterok dependency means that the job will only run if the previous job completed without errors. Note that we use the job ID of the first job (18374) to set the dependency for the second job (18375), instead of the array job ID. Slurm will automatically wait for all tasks in the array job to complete before running the dependent job.

Running squeue -u $USER will show you the status of all these jobs in the queue. You should see something like this:

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
      18374_[7-10]       cpu      MLP 30057355 PD       0:00      1 (Resources)
             18375       cpu  collect 30057355 PD       0:00      1 (Dependency)
             18376       cpu summaris 30057355 PD       0:00      1 (Dependency)
           18374_1       cpu      MLP 30057355  R       0:05      1 compute-003
           18374_2       cpu      MLP 30057355  R       0:05      1 compute-003
           18374_3       cpu      MLP 30057355  R       0:05      1 compute-002
           18374_4       cpu      MLP 30057355  R       0:05      1 compute-002
           18374_5       cpu      MLP 30057355  R       0:05      1 compute-002
           18374_6       cpu      MLP 30057355  R       0:05      1 compute-002

Or we can watch the jobs move through the queue using the watch command:

(base) [30057355@wolffe 02.Working_on_Wolffe]$ watch squeue -u $USER
             JOBID PARTITION     NAME     USER ST	TIME  NODES NODELIST(REASON)
      18386_[1-10]	 cpu	  MLP 30057355 PD	0:00	  1 (BeginTime)
             18387	 cpu  collect 30057355 PD	0:00	  1 (Dependency)
             18388	 cpu summaris 30057355 PD	0:00	  1 (Dependency)
# MLP job is watiing to begin, others are pending due to dependencies

             JOBID PARTITION     NAME     USER ST	TIME  NODES NODELIST(REASON)
      18386_[7-10]	 cpu	  MLP 30057355 PD	0:00	  1 (Resources)
             18387	 cpu  collect 30057355 PD	0:00	  1 (Dependency)
             18388	 cpu summaris 30057355 PD	0:00	  1 (Dependency)
           18386_1	 cpu	  MLP 30057355  R       0:18	  1 compute-003
           18386_2	 cpu	  MLP 30057355  R       0:18	  1 compute-003
           18386_3	 cpu	  MLP 30057355  R       0:18	  1 compute-002
           18386_4	 cpu	  MLP 30057355  R       0:18	  1 compute-002
           18386_5	 cpu	  MLP 30057355  R       0:18	  1 compute-002
           18386_6	 cpu	  MLP 30057355  R       0:18	  1 compute-002
# Some of the MLP jobs are running, others are pending due to resources, and the collect and summarise jobs are pending due to dependencies

             JOBID PARTITION     NAME     USER ST	TIME  NODES NODELIST(REASON)
             18387	 cpu  collect 30057355 PD	0:00	  1 (Dependency)
             18388	 cpu summaris 30057355 PD	0:00	  1 (Dependency)
           18386_9	 cpu      MLP 30057355  R       0:03	  1 compute-002
          18386_10	 cpu	  MLP 30057355  R       0:03	  1 compute-002
           18386_7	 cpu	  MLP 30057355  R       0:08	  1 compute-003
           18386_8	 cpu	  MLP 30057355  R       0:08	  1 compute-003
# All MLP jobs are running, collect and summarise jobs are pending due to dependencies

             JOBID PARTITION     NAME     USER ST	TIME  NODES NODELIST(REASON)
             18388	 cpu summaris 30057355 PD	0:00	  1 (Dependency)
             18387	 cpu collect  30057355 R	0:00	  1 compute-003
# All MLP jobs are finished, collect job is running, summarise job is pending due to dependency

             JOBID PARTITION     NAME     USER ST	TIME  NODES NODELIST(REASON)
             18388	 cpu summaris 30057355 R	0:00	  1 compute-003
# collect job is finished, summarise job is running. 

             JOBID PARTITION     NAME     USER ST	TIME  NODES NODELIST(REASON)
# all jobs are done.

Once everything is done, you will want to clean up the output files and any temporary files you created. You can do this by creating a script called cleanup.sh so that you know you’re deleting the right files. We leave this as an exercise.

Using the GPU

Other than the ability to run massive parallel jobs, HPC systems also give you access to powerful GPUs. Wolffe currently has NVIDIA A100 and A30 GPUs available for use. To use the GPU, you need to specify the GPU partition when submitting your job.

To use the GPU, you’ll first need a Pyhon (or other) script that uses the GPU. For this example, we’ll use NN_gpu.py that trains a neural network on the digits dataset using the GPU. This script uses the PyTorch library, which is a popular deep learning framework that can take advantage of GPUs for training models.

We now use a new script to run the NN_gpu.py script on the GPU. The script is similar to the previous job scripts, but it specifies the GPU partition and requests a GPU:

#! /usr/bin/env bash
#
#SBATCH --job-name=NN_gpu
#SBATCH --output=output/S-gpu-out.txt
#SBATCH --error=output/S-gpu-err.txt
#
#SBATCH --time=00:05:00
#SBATCH --partition=ampere80
#SBATCH --cpus-per-task=1

# load the module
module load PyTorch/Python3.10

# move to work directory
cd ~/Intro-HPC-workshop/02.Working_on_Wolffe/

# do the submission
python3 -u NN_gpu.py
sleep 60

This script does the following:

#SBATCH --partition=ampere80: Specifies that the job should run in the ampere80 partition, which is the partition for A100 GPUs and 80Gb RAM.
#SBATCH --cpus-per-task=1: Requests one CPU core for the job. This is important because the GPU will handle the heavy lifting, but you still need a CPU to manage the job and run the Python script. Also, for large GPU jobs, you can speed up the job by requesting more CPU cores, for example, to load data into the GPU faster.
We load the PyTorch module, which is already configured to use the GPU available.
The rest of the script is similar to the previous scripts.

Note that in other HPC systems, you may need to use #SBATCH --gres=gpu:1 (gres for “generic resources”) to request a GPU, but on Wolffe, the --partition=ampere80 option is sufficient to request a GPU.

Tips and tricks

To make the most of the HPC resources:

Resource requests: Always ask for just enough resources for your job. Over-requesting can lead to longer wait times in the queue.
Parallised jobs: If your job can be parallelised, use multiple tasks or processes to take advantage of the available resources. This can significantly speed up your computations and wait time on the queue.
Job dependencies: Use job dependencies to chain jobs together, ensuring that one job starts only after another has completed. This can be done using the --dependency option in the sbatch command.
Job arrays: If you have many similar jobs to run, consider using job arrays. This allows you to submit multiple jobs with a single command, which can save time and reduce the load on the scheduler.
Monitoring: Use commands like squeue, sinfo, and sacct to monitor your jobs and the state of the cluster. This can help you identify issues and optimise your job submissions.
Error handling: Always check the output and error files generated by your jobs. These files can provide valuable information about the success or failure of your job and help you debug any issues that arise.
Always read the message of the day: When you log in, read the message of the day for important announcements or changes to the system.
Use the module command: Always check which modules are loaded and available. This can help you avoid conflicts and ensure you have the right software for your job.
Keep your home directory organised: HPC systems often have limited storage, so keep your home directory tidy and remove unnecessary files regularly.
Use version control: If you’re working on code, consider using version control systems like Git to keep track of changes and collaborate with others.
Backup important data: Regularly back up important data to avoid loss in case of hardware failure or other issues. The HPC system is not a place for long-term data storage, so consider using external storage solutions for important files.
Be mindful of resource usage: Avoid running resource-intensive jobs on the login node, as this can disrupt other users. Always use sinteractive or sbatch to run jobs on compute nodes.
Learn about job scheduling: Understanding how the scheduler works can help you optimise your job submissions and reduce wait times in the queue.
Use sacct for job accounting: This command can be used to view the status and resource usage of completed jobs, which can help you analyze performance and optimise future jobs.
Save checkpoints: If your job is long-running, consider saving checkpoints periodically. This way, if the job fails or is interrupted, you can resume from the last checkpoint instead of starting over. Further, some HPC systems have a maximum job time limit (e.g. 7 days on Wolffe GPU, on NCI it’s 2 days), so saving checkpoints can help you avoid losing progress if your job exceeds the time limit.

This site is open source. Improve this page.