User Guide#

This comprehensive guide covers all aspects of using srunx for SLURM job management.

Core Concepts#

Jobs#

A job in srunx represents a computational task that will be executed on a SLURM cluster. Jobs are defined with:

Command: The command to execute
Resources: CPU, memory, GPU, and time requirements
Environment: Conda, virtual environment, or container setup
Dependencies: Job dependencies for workflow orchestration

Resources#

srunx provides fine-grained control over resource allocation:

--nodes: Number of compute nodes
--tasks-per-node: Tasks per node
--cpus-per-task: CPUs per task
--gpus-per-node: GPUs per node
--memory-per-node: Memory per node
--time-limit: Maximum execution time

Environment Management#

srunx supports conda and virtual environment activation:

Conda: --conda env_name
Virtual Environment: --venv /path/to/venv

Only one of conda or venv can be specified per job.

Container Runtimes#

srunx supports multiple container runtimes for job execution. Containers are orthogonal to conda/venv — they can be used together.

Pyxis (NVIDIA Enroot) — default runtime, uses srun --container-* flags:

srunx submit python train.py --container /path/to/image.sqsh

Apptainer — wraps the command with apptainer exec:

srunx submit python train.py \
  --container "runtime=apptainer,image=/path/to/image.sif,nv=true"

Singularity — same as Apptainer with singularity binary:

srunx submit python train.py \
  --container "runtime=singularity,image=/path/to/image.sif,nv=true"

The runtime can also be specified with a separate flag:

srunx submit python train.py \
  --container /path/to/image.sif \
  --container-runtime apptainer

Container Options#

The --container flag accepts a key=value format for detailed configuration:

Key	Runtime	Description
`image`	All	Container image path (SIF, sqsh, or Docker URI)
`runtime`	All	`pyxis` (default), `apptainer`, or `singularity`
`mounts` / `bind`	All	Bind mounts (semicolon-separated, e.g. `/data:/data;/scratch:/scratch`)
`workdir`	All	Working directory inside container
`nv`	Apptainer	NVIDIA GPU passthrough (`true`/`false`)
`rocm`	Apptainer	AMD GPU passthrough (`true`/`false`)
`cleanenv`	Apptainer	Start with clean environment (`true`/`false`)
`fakeroot`	Apptainer	Run as fake root (`true`/`false`)
`writable_tmpfs`	Apptainer	Writable tmpfs overlay (`true`/`false`)
`overlay`	Apptainer	Overlay image path
`env`	Apptainer	Container environment variables (`KEY1=VAL1;KEY2=VAL2`)

Example with multiple options:

srunx submit python train.py --container \
  "runtime=apptainer,image=pytorch.sif,nv=true,bind=/data:/data;/models:/models,cleanenv=true"

Container + Conda/Venv#

Containers can be combined with conda or venv. The environment activation runs on the host before the containerized command:

srunx submit python train.py \
  --container "runtime=apptainer,image=pytorch.sif,nv=true,bind=/opt/conda:/opt/conda" \
  --conda ml_env

Note

When using cleanenv=true with Apptainer, host environment variables (including those set by conda/venv activation) are stripped. Pass needed variables explicitly via env= or ensure the relevant paths are bind-mounted.

Suppressing Default Containers#

If a default container is configured (via SRUNX_DEFAULT_CONTAINER or config file), you can suppress it for individual jobs:

srunx submit python train.py --no-container

Command Line Interface#

Job Submission#

Basic submission:

srunx submit <command>

With resource specification:

srunx submit python train.py \
  --name "training_job" \
  --nodes 2 \
  --gpus-per-node 2 \
  --memory-per-node "64GB" \
  --time-limit "8:00:00" \
  --conda ml_env

Job Monitoring#

Check status:

srunx status 12345

List all jobs:

srunx list

List with GPU allocation info:

srunx list --show-gpus

List in JSON format:

srunx list --format json

Job Control#

Cancel a job:

srunx cancel 12345

Monitor job until completion:

srunx submit python script.py --wait

Workflows#

Workflow Definition#

Workflows are defined in YAML format with jobs and dependencies:

name: data_pipeline

jobs:
  - name: download_data
    command: ["python", "download.py"]
    resources:
      nodes: 1
      memory_per_node: "8GB"

  - name: preprocess
    command: ["python", "preprocess.py", "--input", "data/raw"]
    depends_on: [download_data]
    resources:
      nodes: 1
      cpus_per_task: 4

  - name: train_model
    command: ["python", "train.py"]
    depends_on: [preprocess]
    resources:
      nodes: 2
      gpus_per_node: 1
      time_limit: "12:00:00"
    environment:
      conda: pytorch_env

  - name: evaluate
    command: ["python", "evaluate.py"]
    depends_on: [train_model]

Workflow Execution#

Run a workflow:

srunx flow run pipeline.yaml

Validate workflow syntax:

srunx flow validate pipeline.yaml

Run with custom parameters:

srunx flow run pipeline.yaml --dry-run

Advanced Features#

Callbacks and Notifications#

srunx supports job completion callbacks, including Slack notifications:

from srunx.callbacks import SlackCallback
from srunx.client import Slurm

callback = SlackCallback(webhook_url="https://hooks.slack.com/...")
client = Slurm()

job = Job(
    name="training_job",
    command=["python", "train.py"],
)
result = client.submit(job, callbacks=[callback])

Template Customization#

srunx uses Jinja2 templates for SLURM script generation. You can customize templates by:

Copying default templates from srunx/templates/
Modifying them for your needs
Specifying custom template path

Programmatic Usage#

Use srunx from Python code:

from srunx.client import Slurm
from srunx.models import Job, JobResource, JobEnvironment, ContainerResource

# Create client
client = Slurm()

# Define job with conda
job = Job(
    name="my_job",
    command=["python", "script.py"],
    resources=JobResource(
        nodes=2,
        gpus_per_node=1,
        memory_per_node="32GB",
        time_limit="4:00:00"
    ),
    environment=JobEnvironment(conda="ml_env")
)

# Define job with Apptainer container
container_job = Job(
    name="container_job",
    command=["python", "train.py"],
    resources=JobResource(gpus_per_node=2),
    environment=JobEnvironment(
        container=ContainerResource(
            runtime="apptainer",
            image="/path/to/pytorch.sif",
            nv=True,
            mounts=["/data:/data"],
        )
    )
)

# Submit and monitor
result = client.submit(job)
status = client.retrieve(result.job_id)
print(f"Job {result.job_id} status: {status}")

Best Practices#

Resource Planning#

Right-size your jobs: Don’t over-allocate resources
Use time limits: Prevent runaway jobs
Monitor resource usage: Optimize for future jobs

Environment Management#

Use environment isolation: Conda, venv, or containers
Pin dependencies: Ensure reproducibility
Test environments: Validate before large runs
Prefer Apptainer for reproducibility: SIF files are immutable and portable across clusters
Combine containers with conda: Use containers for system-level dependencies and conda for Python packages

Workflow Design#

Break down jobs: Smaller, focused jobs are easier to debug
Use dependencies wisely: Minimize blocking dependencies
Handle failures: Design for partial workflow recovery

Configuration#

Container Defaults#

Set default container settings via environment variables:

# Default container image
export SRUNX_DEFAULT_CONTAINER=/path/to/default.sif

# Default container runtime (pyxis, apptainer, singularity)
export SRUNX_DEFAULT_CONTAINER_RUNTIME=apptainer

Or in a config file (~/.config/srunx/config.json):

{
  "environment": {
    "container": {
      "runtime": "apptainer",
      "image": "/shared/containers/pytorch.sif"
    }
  }
}

The runtime resolution order (highest priority first):

Explicit --container-runtime CLI flag
runtime= key in --container value
SRUNX_DEFAULT_CONTAINER_RUNTIME environment variable
Config file setting
pyxis (default fallback)

Troubleshooting#

Common Issues#

Job fails to start

Check resource availability
Verify environment exists
Review SLURM script syntax

Workflow hangs

Check for circular dependencies
Verify all dependencies are satisfiable
Review job logs

Environment errors

Ensure conda/venv paths are correct
Check environment activation
Verify package availability

Container errors

Verify the image path exists and is accessible from compute nodes
For Apptainer, ensure --nv is set when using GPUs
Check bind mount paths exist on both host and container
If using --cleanenv, pass required environment variables via env=
Apptainer-specific flags (nv, rocm, etc.) raise an error if used with runtime=pyxis

Debug Mode#

Enable debug logging:

export SRUNX_LOG_LEVEL=DEBUG
srunx submit python script.py

Preview job submission (show summary without submitting):

srunx submit --dry-run python script.py

View rendered SLURM scripts:

srunx flow run pipeline.yaml --debug

User Guide

Contents

User Guide#

Core Concepts#

Jobs#

Resources#

Environment Management#

Container Runtimes#

Container Options#

Container + Conda/Venv#

Suppressing Default Containers#

Command Line Interface#

Job Submission#

Job Monitoring#

Job Control#

Workflows#

Workflow Definition#

Workflow Execution#

Advanced Features#

Callbacks and Notifications#

Template Customization#

Programmatic Usage#

Best Practices#

Resource Planning#

Environment Management#

Workflow Design#

Configuration#

Container Defaults#

Troubleshooting#

Common Issues#

Debug Mode#