User Guide¶
This comprehensive guide covers all aspects of using srunx for SLURM job management.
Core Concepts¶
Jobs¶
A job in srunx represents a computational task that will be executed on a SLURM cluster. Jobs are defined with:
- Command: The command to execute
- Resources: CPU, memory, GPU, and time requirements
- Environment: Conda, virtual environment, or container setup
- Dependencies: Job dependencies for workflow orchestration
Resources¶
srunx provides fine-grained control over resource allocation:
--nodes: Number of compute nodes--tasks-per-node: Tasks per node--cpus-per-task: CPUs per task--gpus-per-node: GPUs per node--memory/--mem: Memory per node--time-limit: Maximum execution time
Environment Management¶
srunx supports conda and virtual environment activation:
- Conda:
--conda env_name - Virtual Environment:
--venv /path/to/venv
Only one of conda or venv can be specified per job.
Container Runtimes¶
srunx supports multiple container runtimes for job execution. Containers are orthogonal to conda/venv — they can be used together.
Pyxis (NVIDIA Enroot) — default runtime, uses srun --container-* flags:
Apptainer — wraps the command with apptainer exec:
srunx sbatch --wrap "python train.py \"
--container "runtime=apptainer,image=/path/to/image.sif,nv=true"
Singularity — same as Apptainer with singularity binary:
srunx sbatch --wrap "python train.py \"
--container "runtime=singularity,image=/path/to/image.sif,nv=true"
The runtime can also be specified with a separate flag:
srunx sbatch --wrap "python train.py \"
--container /path/to/image.sif \
--container-runtime apptainer
Container Options¶
The --container flag accepts a key=value format for detailed configuration:
| Key | Runtime | Description |
|---|---|---|
image |
All | Container image path (SIF, sqsh, or Docker URI) |
runtime |
All | pyxis (default), apptainer, or singularity |
mounts / bind |
All | Bind mounts (semicolon-separated, e.g. /data:/data;/scratch:/scratch) |
workdir |
All | Working directory inside container |
nv |
Apptainer | NVIDIA GPU passthrough (true/false) |
rocm |
Apptainer | AMD GPU passthrough (true/false) |
cleanenv |
Apptainer | Start with clean environment (true/false) |
fakeroot |
Apptainer | Run as fake root (true/false) |
writable_tmpfs |
Apptainer | Writable tmpfs overlay (true/false) |
overlay |
Apptainer | Overlay image path |
env |
Apptainer | Container environment variables (KEY1=VAL1;KEY2=VAL2) |
Example with multiple options:
srunx sbatch --wrap "python train.py" --container \
"runtime=apptainer,image=pytorch.sif,nv=true,bind=/data:/data;/models:/models,cleanenv=true"
Container + Conda/Venv¶
Containers can be combined with conda or venv. The environment activation runs on the host before the containerized command:
srunx sbatch --wrap "python train.py \"
--container "runtime=apptainer,image=pytorch.sif,nv=true,bind=/opt/conda:/opt/conda" \
--conda ml_env
Note
When using cleanenv=true with Apptainer, host environment variables (including those set by conda/venv activation) are stripped. Pass needed variables explicitly via env= or ensure the relevant paths are bind-mounted.
Suppressing Default Containers¶
If a default container is configured (via SRUNX_DEFAULT_CONTAINER or config file), you can suppress it for individual jobs:
Command Line Interface¶
Job Submission¶
Basic submission:
With resource specification:
srunx sbatch --wrap "python train.py \"
--name "training_job" \
--nodes 2 \
--gpus-per-node 2 \
--memory "64GB" \
--time-limit "8:00:00" \
--conda ml_env
Job Monitoring¶
Check a specific job's state (active queue):
For finished jobs (srunx state DB):
List all jobs:
Filter to a single user (default is all users):
Live in-place refresh (like native squeue -i; Ctrl+C to exit):
Show additional columns (each --show-* flag adds one; -a enables all):
List in JSON format (always includes every field, regardless of --show-*):
Inspect cluster resources:
srunx gpus # GPU aggregate summary across partitions
srunx sinfo # partition / state / nodelist (native-sinfo parity)
Cluster-side accounting history (real SLURM sacct, requires slurmdbd):
srunx sacct -a -S now-1day # all users, last 24h
srunx sacct -j 12345 --show-steps # include .batch / .extern sub-steps
srunx sacct -s FAILED,TIMEOUT -p gpu # filter by state / partition
Job Control¶
Cancel a job:
Monitor job until completion:
Workflows¶
Workflow Definition¶
Workflows are defined in YAML format with jobs and dependencies:
name: data_pipeline
jobs:
- name: download_data
command: ["python", "download.py"]
resources:
nodes: 1
memory_per_node: "8GB"
- name: preprocess
command: ["python", "preprocess.py", "--input", "data/raw"]
depends_on: [download_data]
resources:
nodes: 1
cpus_per_task: 4
- name: train_model
command: ["python", "train.py"]
depends_on: [preprocess]
resources:
nodes: 2
gpus_per_node: 1
time_limit: "12:00:00"
environment:
conda: pytorch_env
- name: evaluate
command: ["python", "evaluate.py"]
depends_on: [train_model]
Workflow Execution¶
Run a workflow:
Validate workflow syntax:
Run with custom parameters:
Advanced Features¶
Callbacks and Notifications¶
srunx supports job completion callbacks, including Slack notifications:
from srunx.observability.notifications.legacy_slack import SlackCallback
from srunx.slurm.local import Slurm
callback = SlackCallback(webhook_url="https://hooks.slack.com/...")
client = Slurm()
job = Job(
name="training_job",
command=["python", "train.py"],
)
result = client.submit(job, callbacks=[callback])
Template Customization¶
srunx uses Jinja2 templates for SLURM script generation. You can customize templates by:
- Copying default templates from
srunx/templates/ - Modifying them for your needs
- Specifying custom template path
Programmatic Usage¶
Use srunx from Python code:
from srunx.slurm.local import Slurm
from srunx.domain import Job, JobResource, JobEnvironment, ContainerResource
# Create client
client = Slurm()
# Define job with conda
job = Job(
name="my_job",
command=["python", "script.py"],
resources=JobResource(
nodes=2,
gpus_per_node=1,
memory_per_node="32GB",
time_limit="4:00:00"
),
environment=JobEnvironment(conda="ml_env")
)
# Define job with Apptainer container
container_job = Job(
name="container_job",
command=["python", "train.py"],
resources=JobResource(gpus_per_node=2),
environment=JobEnvironment(
container=ContainerResource(
runtime="apptainer",
image="/path/to/pytorch.sif",
nv=True,
mounts=["/data:/data"],
)
)
)
# Submit and monitor
result = client.submit(job)
status = client.retrieve(result.job_id)
print(f"Job {result.job_id} status: {status}")
Best Practices¶
Resource Planning¶
- Right-size your jobs: Don't over-allocate resources
- Use time limits: Prevent runaway jobs
- Monitor resource usage: Optimize for future jobs
Environment Management¶
- Use environment isolation: Conda, venv, or containers
- Pin dependencies: Ensure reproducibility
- Test environments: Validate before large runs
- Prefer Apptainer for reproducibility: SIF files are immutable and portable across clusters
- Combine containers with conda: Use containers for system-level dependencies and conda for Python packages
Workflow Design¶
- Break down jobs: Smaller, focused jobs are easier to debug
- Use dependencies wisely: Minimize blocking dependencies
- Handle failures: Design for partial workflow recovery
Configuration¶
Container Defaults¶
Set default container settings via environment variables:
# Default container image
export SRUNX_DEFAULT_CONTAINER=/path/to/default.sif
# Default container runtime (pyxis, apptainer, singularity)
export SRUNX_DEFAULT_CONTAINER_RUNTIME=apptainer
Or in a config file (~/.config/srunx/config.json):
{
"environment": {
"container": {
"runtime": "apptainer",
"image": "/shared/containers/pytorch.sif"
}
}
}
The runtime resolution order (highest priority first):
- Explicit
--container-runtimeCLI flag runtime=key in--containervalueSRUNX_DEFAULT_CONTAINER_RUNTIMEenvironment variable- Config file setting
pyxis(default fallback)
Troubleshooting¶
Common Issues¶
Job fails to start
- Check resource availability
- Verify environment exists
- Review SLURM script syntax
Workflow hangs
- Check for circular dependencies
- Verify all dependencies are satisfiable
- Review job logs
Environment errors
- Ensure conda/venv paths are correct
- Check environment activation
- Verify package availability
Container errors
- Verify the image path exists and is accessible from compute nodes
- For Apptainer, ensure --nv is set when using GPUs
- Check bind mount paths exist on both host and container
- If using --cleanenv, pass required environment variables via env=
- Apptainer-specific flags (nv, rocm, etc.) raise an error if used with runtime=pyxis
Debug Mode¶
Enable debug logging:
Preview job submission (show summary without submitting):
View rendered SLURM scripts: