Overview
This section introduces two critical technologies that have transformed modern HPC: containers for reproducible computing environments and workflow management systems for orchestrating complex computational pipelines.
Part 1: Containers in HPC
The Problem: Dependency Hell
Every HPC system is different. Your local workstation has Python 3.9, but the cluster only has Python 3.6. You need a specific version of NumPy, but it conflicts with another user’s requirements. The specific cuda version you need isn’t installed, and you don’t have admin rights to install it. Your collaborator’s code works perfectly on their system but crashes on yours with cryptic library errors.
This is “dependency hell” - the nightmare of managing software dependencies across different computing environments. Traditional solutions like modules help, but they’re limited and system-specific.
What Are Containers and Why Do They Matter?
Containers package applications with all their dependencies into a portable, lightweight unit that runs consistently across different computing environments.
Key Benefits for HPC:
- Reproducibility: Your code runs the same way everywhere
- Dependency Management: No more “it works on my machine” problems
- Legacy Software: Run older software on modern systems
- Collaboration: Easy sharing of complete computational environments
Introduction to Singularity/Apptainer
Singularity (now called Apptainer) is the container platform designed specifically for HPC environments.
Why Singularity/Apptainer for HPC:
- Designed for multi-user, shared HPC systems
- No root/sudo privileges required to run containers
- Integrates with HPC schedulers (SLURM, PBS, etc.), or local machines
- Supports GPU access and MPI applications
Alternative: Docker
- Easy to use locally, but requires sudo/root privileges, making it unsuitable for HPC use.
- Has a good Online Community sharing many pre-built containers through Docker Hub
- Able to convert Docker Containers to Singularity Containers, though can get messy on non-Linux machines (sorry Mac users!)
Basic Singularity/Apptainer Concepts
- Definition Files: Recipes for building custom images
- Images: Read-only templates containing your application and environment
- Containers: Running instances of images
Common Use Cases
- Running complex software stacks without installation headaches (once the container is set up, that is)
- Ensuring consistent environments across development and production
- Isolating conflicting dependencies between projects
- Distributing complete computational environments with publications
Quick Demo/Examples
# Pull a pre-built container
singularity pull docker://ubuntu:20.04
# Run a command in the container using the exec command
singularity exec ubuntu_20.04.sif cat /etc/os-release
# Run a terminal within the container with the shell command
singularity shell ubuntu_20.04.sif
Part 2: Workflow Management
The Problem: Computational Chaos
You have a complex analysis with 50 steps involving downloading data, pre-processing the data, setting up a ML model, optimising over a bunch of different hyper-parameters, re-training a model using the optimum parameters and datasets, making some predictions, aggregating the results, and outputting wonderful plots. All up, it takes 3 days to run. Step 47 fails at 2 AM, and you have to start over. Your pipeline works great on your laptop, but when you try to run it on the HPC cluster, you need to completely rewrite the job submission scripts, since you’re now working on a HPC cluster, and not a laptop you have full admin rights over. You want to run the same analysis on 100 datasets, but manually managing all those jobs is a nightmare. Your workflow uses both containers and bare-metal software, different queue systems from different HPC systems (you might have some jobs running on the WSU HPC, and some running at NCI), and various resource requirements - coordinating all of this manually is error-prone and time-consuming.
This is computational chaos - the challenge of orchestrating complex, multi-step analyses across different computing environments while managing failures, resources, and dependencies.
What Are Computational Workflows?
Workflows are automated sequences of computational tasks that process data through multiple steps, managing dependencies, inputs, outputs, and error handling.
Why Workflows Matter in HPC:
- Complexity Management: Handle multi-step analyses with hundreds of tasks
- Reproducibility: Document and repeat entire computational pipelines
- Portability: Run the same workflow on laptops, clusters, or cloud platforms
- Error Recovery: Resume from failures without starting over (checkpointing)
- Resource Optimisation: Right-size compute resources for each task
- Queue System Abstraction: Same workflow works with SLURM, PBS, or local execution
- Container Integration: Seamlessly mix containerised and native applications
- Parallel Execution: Automatically identify and run independent tasks simultaneously
Introduction to Nextflow
Nextflow is a workflow management system designed for data-intensive computational pipelines, particularly popular in bioinformatics and scientific computing.
Key Nextflow Features:
- Portable: Runs on laptops, clusters, and cloud platforms
- Scalable: Automatically handles parallelisation and resource management
- Container-Native: Built-in support for Docker, Singularity/Apptainer
- Resumable: Continue workflows from where they left off
- Language: Very similar to Groovy - a language used mostly for scripting in Java Virtual Machines, though with some tweaks. Bears some similarity to Python
Workflow Components
- Processes: Individual computational tasks
- Channels: Data flow between processes
- Executors: How tasks are submitted (local, SLURM, PBS, cloud)
- Configuration: Resource requirements, container settings
Real-World Example Scenarios
- Genomics Pipeline: Quality control → Alignment → Variant calling → Annotation
- Image Processing: Raw data → Preprocessing → Analysis → Visualisation
- Climate Modelling: Data ingestion → Model runs → Post-processing → Visualisation
- Machine Learning Pipeline: Data ingestion → Feature engineering → Model training → Hyperparameter tuning → Model evaluation → Deployment
Integration with HPC Systems
Workflows automatically:
- Submit jobs to any cluster scheduler (SLURM, PBS, etc.)
- Manage resource allocation (CPU, GPU, memory, time)
- Handle job queuing and dependencies
- Restart failed tasks from checkpoints
- Aggregate results from distributed tasks
- Switch between different execution environments without code changes
Bringing It All Together
The Power Combination: Containers + Workflows
- Workflows manage the orchestration and scaling
- Containers ensure consistent, reproducible environments
- HPC systems provide the computational power
Getting Started Recommendations
- Start Small: Begin with simple, single-step containers
- Use Existing Resources: Leverage pre-built containers and workflows
- Community Resources:
- Documentation: Both technologies have excellent documentation and active communities
Best Practices
- Version control your definition files and workflows
- Test locally before running on HPC systems
- Use meaningful names and documentation
- Plan for data management and storage requirements
- Start with community-tested workflows when possible
Questions and Next Steps
Resources for Continued Learning:
Hands-on Opportunities:
- Many HPC centres offer container and workflow training, for example, NCI
- Online tutorials and workshops
- Start with your current research workflows and containerise them incrementally
Advanced Workshop Topics:
- Working on NCI
- Nextflow in action.