Workflows¶
srunx provides a powerful workflow system for orchestrating complex multi-step computational pipelines on SLURM clusters.
Overview¶
Workflows in srunx are defined using YAML files that specify:
- Jobs: Individual computational steps
- Dependencies: Execution order and prerequisites
- Resources: Computational requirements for each task
- Environments: Software environments for execution
Workflow Definition¶
Basic Structure¶
name: workflow_name
description: "Optional workflow description"
jobs:
- name: task1
command: ["python", "script1.py"]
# ... task configuration
- name: task2
command: ["python", "script2.py"]
depends_on: [task1]
# ... task configuration
Task Configuration¶
Each task supports the following configuration options:
Command and Environment
- name: my_task
command: ["python", "train.py", "--epochs", "100"]
environment:
conda: ml_environment
# OR
# venv: /path/to/virtualenv
# Container (can be combined with conda/venv)
- name: containerized_task
command: ["python", "train.py"]
environment:
conda: ml_env
container:
runtime: apptainer # or pyxis, singularity
image: /path/to/image.sif
nv: true # NVIDIA GPU passthrough
mounts:
- /data:/data
- /models:/models
# Simple Pyxis container (default runtime)
- name: pyxis_task
command: ["python", "inference.py"]
environment:
container:
image: nvcr.io/nvidia/pytorch:24.01-py3
mounts:
- /data:/workspace/data
Resource Allocation
- name: gpu_task
command: ["python", "gpu_training.py"]
resources:
nodes: 2
ntasks_per_node: 1
cpus_per_task: 8
gpus_per_node: 2
memory_per_node: "64GB"
time_limit: "12:00:00"
Dependencies
Dependencies¶
Linear Dependencies¶
Simple sequential execution:
name: linear_pipeline
jobs:
- name: step1
command: ["python", "step1.py"]
- name: step2
command: ["python", "step2.py"]
depends_on: [step1]
- name: step3
command: ["python", "step3.py"]
depends_on: [step2]
Parallel Dependencies¶
Multiple jobsdepending on the same prerequisite:
name: parallel_pipeline
jobs:
- name: preprocess
command: ["python", "preprocess.py"]
- name: train_model_a
command: ["python", "train_a.py"]
depends_on: [preprocess]
- name: train_model_b
command: ["python", "train_b.py"]
depends_on: [preprocess]
- name: ensemble
command: ["python", "ensemble.py"]
depends_on: [train_model_a, train_model_b]
Complex Dependencies¶
Advanced dependency patterns:
name: complex_pipeline
jobs:
- name: data_download
command: ["python", "download.py"]
- name: data_validation
command: ["python", "validate.py"]
depends_on: [data_download]
- name: feature_engineering
command: ["python", "features.py"]
depends_on: [data_validation]
- name: model_training
command: ["python", "train.py"]
depends_on: [feature_engineering]
- name: model_evaluation
command: ["python", "evaluate.py"]
depends_on: [model_training]
- name: report_generation
command: ["python", "report.py"]
depends_on: [model_evaluation, data_validation]
Workflow Examples¶
Machine Learning Pipeline¶
name: ml_pipeline
jobs:
- name: data_preprocessing
command: ["python", "preprocess.py", "--input", "raw_data/"]
resources:
nodes: 1
cpus_per_task: 4
memory_per_node: "16GB"
time_limit: "2:00:00"
- name: feature_selection
command: ["python", "feature_selection.py"]
depends_on: [data_preprocessing]
resources:
nodes: 1
cpus_per_task: 8
memory_per_node: "32GB"
- name: hyperparameter_tuning
command: ["python", "hyperopt.py", "--trials", "100"]
depends_on: [feature_selection]
resources:
nodes: 4
gpus_per_node: 1
time_limit: "8:00:00"
environment:
conda: pytorch_env
- name: final_training
command: ["python", "train_final.py"]
depends_on: [hyperparameter_tuning]
resources:
nodes: 2
gpus_per_node: 2
time_limit: "12:00:00"
environment:
conda: pytorch_env
- name: model_validation
command: ["python", "validate.py"]
depends_on: [final_training]
resources:
nodes: 1
gpus_per_node: 1
environment:
conda: pytorch_env
- name: deployment_prep
command: ["python", "prepare_deployment.py"]
depends_on: [model_validation]
resources:
nodes: 1
Bioinformatics Pipeline¶
name: genomics_pipeline
jobs:
- name: quality_control
command: ["fastqc", "*.fastq.gz"]
resources:
nodes: 1
cpus_per_task: 16
- name: trimming
command: ["trim_galore", "--paired", "sample_R1.fastq.gz", "sample_R2.fastq.gz"]
depends_on: [quality_control]
resources:
nodes: 1
cpus_per_task: 8
- name: alignment
command: ["STAR", "--runThreadN", "32", "--genomeDir", "genome_index"]
depends_on: [trimming]
resources:
nodes: 1
cpus_per_task: 32
memory_per_node: "64GB"
time_limit: "4:00:00"
- name: quantification
command: ["featureCounts", "-T", "16", "-a", "annotation.gtf"]
depends_on: [alignment]
resources:
nodes: 1
cpus_per_task: 16
- name: differential_expression
command: ["Rscript", "deseq2_analysis.R"]
depends_on: [quantification]
resources:
nodes: 1
cpus_per_task: 4
environment:
conda: r_env
Workflow Execution¶
Running Workflows¶
Execute a workflow:
Validate workflow before execution:
Dry run (show what would be executed):
Monitoring Workflows¶
srunx provides built-in workflow monitoring:
- Progress tracking: See which jobs are running/completed
- Dependency resolution: Automatic job scheduling based on dependencies
- Error handling: Failed jobs don't block independent jobs
- Logging: Comprehensive logging of workflow execution
Workflow Management¶
Error Handling¶
When a job fails:
- Dependent jobs are blocked: Jobs depending on failed job won't run
- Independent jobs continue: Other jobs in the workflow continue
- Detailed logging: Error information is captured and logged
- Manual intervention: You can fix issues and restart failed jobs
Restart and Recovery¶
srunx supports partial workflow execution:
# Start execution from a specific job (skips dependencies before it)
srunx flow run pipeline.yaml --from job_name
# Stop execution at a specific job (inclusive)
srunx flow run pipeline.yaml --to job_name
# Execute only a single job (ignoring all dependencies)
srunx flow run pipeline.yaml --job job_name
Best Practices¶
Workflow Design¶
- Modular jobs: Keep jobs focused and independent when possible
- Resource optimization: Right-size resources for each job
- Checkpointing: Save intermediate results for recovery
- Testing: Test individual jobs before full workflow execution
Dependency Management¶
- Minimize dependencies: Reduce blocking relationships
- Parallel execution: Design for maximum parallelism
- Data dependencies: Ensure data flow matches job dependencies
- Avoid cycles: srunx will detect and reject circular dependencies
Resource Planning¶
- Job profiling: Understand resource needs for each job
- Queue management: Consider cluster queue policies
- Time limits: Set appropriate time limits for each job
- Resource sharing: Balance resource allocation across jobs
Advanced Features¶
Template Variables with args¶
Workflows support Jinja2 template variables via the args section. Variables defined in args are substituted into job fields before parsing:
name: parameterized_workflow
args:
dataset: "experiment_1"
epochs: 100
model_type: resnet50
jobs:
- name: training
command: ["python", "train.py", "--dataset", "{{dataset}}", "--epochs", "{{epochs}}", "--model", "{{model_type}}"]
- name: evaluation
command: ["python", "evaluate.py", "--dataset", "{{dataset}}"]
depends_on: [training]
Shell Jobs¶
Instead of specifying a command, you can point to an existing shell script:
Parameter Sweeps¶
Run the same workflow over a cross-product of hyperparameters without
copying YAML files. srunx expands the matrix into N independent
workflow_runs (cells), so a cell failure does not abort peers unless
fail_fast: true.
YAML declaration¶
Declare the matrix in a sweep block at the workflow root. Matrix keys
reference values via ordinary {{ arg }} substitution in job fields.
name: train
args:
lr: 0.01
seed: 1
sweep:
matrix:
lr: [0.001, 0.01, 0.1]
seed: [1, 2, 3]
fail_fast: false
max_parallel: 4
jobs:
- name: train
command: ["python", "train.py", "--lr", "{{ lr }}", "--seed", "{{ seed }}"]
gpus_per_node: 1
The example above produces a 3 x 3 cross-product (9 cells). Each cell
runs with its own lr / seed pair and is tracked as a separate
workflow_run under one parent sweep_run.
CLI usage¶
Run a YAML-declared sweep:
Override a single arg without triggering a sweep:
Launch an ad-hoc sweep from the CLI (takes precedence over, or augments,
any YAML sweep block):
Constraints¶
max_parallelis required (YAML or--max-parallel; Web UI defaults to 4).- Matrix values must be scalar (str/int/float/bool) -- nested structures are rejected at validation.
- Total cell count is capped at 1000 (safety valve; SLURM
MaxSubmitJobsis typically ~4096). fail_fastdefaults tofalse-- one cell failing does not cancel peers.
Ad-hoc overrides (--arg / --sweep)¶
--arg key=valueoverrides a single workflow arg for every cell (no sweep triggered on its own).--sweep key=v1,v2,v3adds or replaces a matrix axis. Repeat the flag to sweep multiple keys.- CLI sweep axes merge with the YAML
sweep.matrix: CLI wins on conflicting keys. --max-parallel Nis required when a sweep is active via CLI.
Dry-run preview¶
Preview the expanded matrix without submitting any jobs:
The dry-run prints the resolved args for each cell so you can sanity-check the cross-product before launching.
Execution paths¶
CLI and MCP sweeps (without a mount= argument) run cells through the
local Slurm singleton. Web UI sweeps, and MCP sweeps that specify
mount=<profile>, route every cell through a per-sweep
SlurmSSHExecutorPool (capped at min(max_parallel, 8) pooled SSH
connections) which is closed when the background sweep task exits.
See also: MCP sweep recipes and Web UI sweep recipes.