User Guide¶

This comprehensive guide covers all aspects of using srunx for SLURM job management.

Core Concepts¶

Jobs¶

A job in srunx represents a computational task that will be executed on a SLURM cluster. Jobs are defined with:

Command: The command to execute
Resources: CPU, memory, GPU, and time requirements
Environment: Conda, virtual environment, or container setup
Dependencies: Job dependencies for workflow orchestration

Resources¶

srunx provides fine-grained control over resource allocation:

--nodes: Number of compute nodes
--tasks-per-node: Tasks per node
--cpus-per-task: CPUs per task
--gpus-per-node: GPUs per node
--memory / --mem: Memory per node
--time-limit: Maximum execution time

Environment Management¶

srunx supports conda and virtual environment activation:

Conda: --conda env_name
Virtual Environment: --venv /path/to/venv

Only one of conda or venv can be specified per job.

Container Runtimes¶

srunx supports multiple container runtimes for job execution. Containers are orthogonal to conda/venv — they can be used together.

Pyxis (NVIDIA Enroot) — default runtime, uses srun --container-* flags:

srunx sbatch --wrap "python train.py" --container /path/to/image.sqsh

Apptainer — wraps the command with apptainer exec:

srunx sbatch --wrap "python train.py \"
  --container "runtime=apptainer,image=/path/to/image.sif,nv=true"

Singularity — same as Apptainer with singularity binary:

srunx sbatch --wrap "python train.py \"
  --container "runtime=singularity,image=/path/to/image.sif,nv=true"

The runtime can also be specified with a separate flag:

srunx sbatch --wrap "python train.py \"
  --container /path/to/image.sif \
  --container-runtime apptainer

Container Options¶

The --container flag accepts a key=value format for detailed configuration:

Key	Runtime	Description
`image`	All	Container image path (SIF, sqsh, or Docker URI)
`runtime`	All	`pyxis` (default), `apptainer`, or `singularity`
`mounts` / `bind`	All	Bind mounts (semicolon-separated, e.g. `/data:/data;/scratch:/scratch`)
`workdir`	All	Working directory inside container
`nv`	Apptainer	NVIDIA GPU passthrough (`true`/`false`)
`rocm`	Apptainer	AMD GPU passthrough (`true`/`false`)
`cleanenv`	Apptainer	Start with clean environment (`true`/`false`)
`fakeroot`	Apptainer	Run as fake root (`true`/`false`)
`writable_tmpfs`	Apptainer	Writable tmpfs overlay (`true`/`false`)
`overlay`	Apptainer	Overlay image path
`env`	Apptainer	Container environment variables (`KEY1=VAL1;KEY2=VAL2`)

Example with multiple options:

srunx sbatch --wrap "python train.py" --container \
  "runtime=apptainer,image=pytorch.sif,nv=true,bind=/data:/data;/models:/models,cleanenv=true"

Container + Conda/Venv¶

Containers can be combined with conda or venv. The environment activation runs on the host before the containerized command:

srunx sbatch --wrap "python train.py \"
  --container "runtime=apptainer,image=pytorch.sif,nv=true,bind=/opt/conda:/opt/conda" \
  --conda ml_env

Note

When using cleanenv=true with Apptainer, host environment variables (including those set by conda/venv activation) are stripped. Pass needed variables explicitly via env= or ensure the relevant paths are bind-mounted.

Suppressing Default Containers¶

If a default container is configured (via SRUNX_DEFAULT_CONTAINER or config file), you can suppress it for individual jobs:

srunx sbatch --wrap "python train.py" --no-container

Command Line Interface¶

Job Submission¶

Basic submission:

srunx sbatch --wrap "<command>"

With resource specification:

srunx sbatch --wrap "python train.py \"
  --name "training_job" \
  --nodes 2 \
  --gpus-per-node 2 \
  --memory "64GB" \
  --time-limit "8:00:00" \
  --conda ml_env

Job Monitoring¶

Check a specific job's state (active queue):

srunx squeue -j 12345

For finished jobs (srunx state DB):

srunx history -j 12345

List all jobs:

srunx squeue

Filter to a single user (default is all users):

srunx squeue --user alice

Live in-place refresh (like native squeue -i; Ctrl+C to exit):

srunx squeue -i 5

Show additional columns (each --show-* flag adds one; -a enables all):

srunx squeue --show-partition --show-cpus --show-limit --show-nodes
srunx squeue -a

List in JSON format (always includes every field, regardless of --show-*):

srunx squeue --format json

Inspect cluster resources:

srunx gpus                 # GPU aggregate summary across partitions
srunx sinfo                # partition / state / nodelist (native-sinfo parity)

Cluster-side accounting history (real SLURM sacct, requires slurmdbd):

srunx sacct -a -S now-1day                      # all users, last 24h
srunx sacct -j 12345 --show-steps               # include .batch / .extern sub-steps
srunx sacct -s FAILED,TIMEOUT -p gpu            # filter by state / partition

Job Control¶

Cancel a job:

srunx scancel 12345

Monitor job until completion:

srunx sbatch --wrap "python script.py" --wait

Workflows¶

Workflow Definition¶

Workflows are defined in YAML format with jobs and dependencies:

name: data_pipeline

jobs:
  - name: download_data
    command: ["python", "download.py"]
    resources:
      nodes: 1
      memory_per_node: "8GB"

  - name: preprocess
    command: ["python", "preprocess.py", "--input", "data/raw"]
    depends_on: [download_data]
    resources:
      nodes: 1
      cpus_per_task: 4

  - name: train_model
    command: ["python", "train.py"]
    depends_on: [preprocess]
    resources:
      nodes: 2
      gpus_per_node: 1
      time_limit: "12:00:00"
    environment:
      conda: pytorch_env

  - name: evaluate
    command: ["python", "evaluate.py"]
    depends_on: [train_model]

Workflow Execution¶

Run a workflow:

srunx flow run pipeline.yaml

Validate workflow syntax:

srunx flow run --validate pipeline.yaml

Run with custom parameters:

srunx flow run pipeline.yaml --dry-run

Advanced Features¶

Callbacks and Notifications¶

srunx supports job completion callbacks, including Slack notifications:

from srunx.observability.notifications.legacy_slack import SlackCallback
from srunx.slurm.local import Slurm

callback = SlackCallback(webhook_url="https://hooks.slack.com/...")
client = Slurm()

job = Job(
    name="training_job",
    command=["python", "train.py"],
)
result = client.submit(job, callbacks=[callback])

Template Customization¶

srunx uses Jinja2 templates for SLURM script generation. You can customize templates by:

Copying default templates from srunx/templates/
Modifying them for your needs
Specifying custom template path

Programmatic Usage¶

Use srunx from Python code:

from srunx.slurm.local import Slurm
from srunx.domain import Job, JobResource, JobEnvironment, ContainerResource

# Create client
client = Slurm()

# Define job with conda
job = Job(
    name="my_job",
    command=["python", "script.py"],
    resources=JobResource(
        nodes=2,
        gpus_per_node=1,
        memory_per_node="32GB",
        time_limit="4:00:00"
    ),
    environment=JobEnvironment(conda="ml_env")
)

# Define job with Apptainer container
container_job = Job(
    name="container_job",
    command=["python", "train.py"],
    resources=JobResource(gpus_per_node=2),
    environment=JobEnvironment(
        container=ContainerResource(
            runtime="apptainer",
            image="/path/to/pytorch.sif",
            nv=True,
            mounts=["/data:/data"],
        )
    )
)

# Submit and monitor
result = client.submit(job)
status = client.retrieve(result.job_id)
print(f"Job {result.job_id} status: {status}")

Best Practices¶

Resource Planning¶

Right-size your jobs: Don't over-allocate resources
Use time limits: Prevent runaway jobs
Monitor resource usage: Optimize for future jobs

Environment Management¶

Use environment isolation: Conda, venv, or containers
Pin dependencies: Ensure reproducibility
Test environments: Validate before large runs
Prefer Apptainer for reproducibility: SIF files are immutable and portable across clusters
Combine containers with conda: Use containers for system-level dependencies and conda for Python packages

Workflow Design¶

Break down jobs: Smaller, focused jobs are easier to debug
Use dependencies wisely: Minimize blocking dependencies
Handle failures: Design for partial workflow recovery

Configuration¶

Container Defaults¶

Set default container settings via environment variables:

# Default container image
export SRUNX_DEFAULT_CONTAINER=/path/to/default.sif

# Default container runtime (pyxis, apptainer, singularity)
export SRUNX_DEFAULT_CONTAINER_RUNTIME=apptainer

Or in a config file (~/.config/srunx/config.json):

{
  "environment": {
    "container": {
      "runtime": "apptainer",
      "image": "/shared/containers/pytorch.sif"
    }
  }
}

The runtime resolution order (highest priority first):

Explicit --container-runtime CLI flag
runtime= key in --container value
SRUNX_DEFAULT_CONTAINER_RUNTIME environment variable
Config file setting
pyxis (default fallback)

Troubleshooting¶

Common Issues¶

Job fails to start
- Check resource availability - Verify environment exists - Review SLURM script syntax

Workflow hangs
- Check for circular dependencies - Verify all dependencies are satisfiable - Review job logs

Environment errors
- Ensure conda/venv paths are correct - Check environment activation - Verify package availability

Container errors
- Verify the image path exists and is accessible from compute nodes - For Apptainer, ensure --nv is set when using GPUs - Check bind mount paths exist on both host and container - If using --cleanenv, pass required environment variables via env= - Apptainer-specific flags (nv, rocm, etc.) raise an error if used with runtime=pyxis

Debug Mode¶

Enable debug logging:

export SRUNX_LOG_LEVEL=DEBUG
srunx sbatch --wrap "python script.py"

Preview job submission (show summary without submitting):

srunx sbatch --dry-run python script.py

View rendered SLURM scripts:

srunx flow run pipeline.yaml --debug