User Guide#
This comprehensive guide covers all aspects of using srunx for SLURM job management.
Core Concepts#
Jobs#
A job in srunx represents a computational task that will be executed on a SLURM cluster. Jobs are defined with:
Command: The command to execute
Resources: CPU, memory, GPU, and time requirements
Environment: Conda, virtual environment, or container setup
Dependencies: Job dependencies for workflow orchestration
Resources#
srunx provides fine-grained control over resource allocation:
--nodes: Number of compute nodes--tasks-per-node: Tasks per node--cpus-per-task: CPUs per task--gpus-per-node: GPUs per node--memory-per-node: Memory per node--time-limit: Maximum execution time
Environment Management#
srunx supports conda and virtual environment activation:
Conda:
--conda env_nameVirtual Environment:
--venv /path/to/venv
Only one of conda or venv can be specified per job.
Container Runtimes#
srunx supports multiple container runtimes for job execution. Containers are orthogonal to conda/venv — they can be used together.
Pyxis (NVIDIA Enroot) — default runtime, uses srun --container-* flags:
srunx submit python train.py --container /path/to/image.sqsh
Apptainer — wraps the command with apptainer exec:
srunx submit python train.py \
--container "runtime=apptainer,image=/path/to/image.sif,nv=true"
Singularity — same as Apptainer with singularity binary:
srunx submit python train.py \
--container "runtime=singularity,image=/path/to/image.sif,nv=true"
The runtime can also be specified with a separate flag:
srunx submit python train.py \
--container /path/to/image.sif \
--container-runtime apptainer
Container Options#
The --container flag accepts a key=value format for detailed configuration:
Key |
Runtime |
Description |
|---|---|---|
|
All |
Container image path (SIF, sqsh, or Docker URI) |
|
All |
|
|
All |
Bind mounts (semicolon-separated, e.g. |
|
All |
Working directory inside container |
|
Apptainer |
NVIDIA GPU passthrough ( |
|
Apptainer |
AMD GPU passthrough ( |
|
Apptainer |
Start with clean environment ( |
|
Apptainer |
Run as fake root ( |
|
Apptainer |
Writable tmpfs overlay ( |
|
Apptainer |
Overlay image path |
|
Apptainer |
Container environment variables ( |
Example with multiple options:
srunx submit python train.py --container \
"runtime=apptainer,image=pytorch.sif,nv=true,bind=/data:/data;/models:/models,cleanenv=true"
Container + Conda/Venv#
Containers can be combined with conda or venv. The environment activation runs on the host before the containerized command:
srunx submit python train.py \
--container "runtime=apptainer,image=pytorch.sif,nv=true,bind=/opt/conda:/opt/conda" \
--conda ml_env
Note
When using cleanenv=true with Apptainer, host environment variables (including those set by conda/venv activation) are stripped. Pass needed variables explicitly via env= or ensure the relevant paths are bind-mounted.
Suppressing Default Containers#
If a default container is configured (via SRUNX_DEFAULT_CONTAINER or config file), you can suppress it for individual jobs:
srunx submit python train.py --no-container
Command Line Interface#
Job Submission#
Basic submission:
srunx submit <command>
With resource specification:
srunx submit python train.py \
--name "training_job" \
--nodes 2 \
--gpus-per-node 2 \
--memory-per-node "64GB" \
--time-limit "8:00:00" \
--conda ml_env
Job Monitoring#
Check status:
srunx status 12345
List all jobs:
srunx list
List with GPU allocation info:
srunx list --show-gpus
List in JSON format:
srunx list --format json
Job Control#
Cancel a job:
srunx cancel 12345
Monitor job until completion:
srunx submit python script.py --wait
Workflows#
Workflow Definition#
Workflows are defined in YAML format with jobs and dependencies:
name: data_pipeline
jobs:
- name: download_data
command: ["python", "download.py"]
resources:
nodes: 1
memory_per_node: "8GB"
- name: preprocess
command: ["python", "preprocess.py", "--input", "data/raw"]
depends_on: [download_data]
resources:
nodes: 1
cpus_per_task: 4
- name: train_model
command: ["python", "train.py"]
depends_on: [preprocess]
resources:
nodes: 2
gpus_per_node: 1
time_limit: "12:00:00"
environment:
conda: pytorch_env
- name: evaluate
command: ["python", "evaluate.py"]
depends_on: [train_model]
Workflow Execution#
Run a workflow:
srunx flow run pipeline.yaml
Validate workflow syntax:
srunx flow validate pipeline.yaml
Run with custom parameters:
srunx flow run pipeline.yaml --dry-run
Advanced Features#
Callbacks and Notifications#
srunx supports job completion callbacks, including Slack notifications:
from srunx.callbacks import SlackCallback
from srunx.client import Slurm
callback = SlackCallback(webhook_url="https://hooks.slack.com/...")
client = Slurm()
job = Job(
name="training_job",
command=["python", "train.py"],
)
result = client.submit(job, callbacks=[callback])
Template Customization#
srunx uses Jinja2 templates for SLURM script generation. You can customize templates by:
Copying default templates from
srunx/templates/Modifying them for your needs
Specifying custom template path
Programmatic Usage#
Use srunx from Python code:
from srunx.client import Slurm
from srunx.models import Job, JobResource, JobEnvironment, ContainerResource
# Create client
client = Slurm()
# Define job with conda
job = Job(
name="my_job",
command=["python", "script.py"],
resources=JobResource(
nodes=2,
gpus_per_node=1,
memory_per_node="32GB",
time_limit="4:00:00"
),
environment=JobEnvironment(conda="ml_env")
)
# Define job with Apptainer container
container_job = Job(
name="container_job",
command=["python", "train.py"],
resources=JobResource(gpus_per_node=2),
environment=JobEnvironment(
container=ContainerResource(
runtime="apptainer",
image="/path/to/pytorch.sif",
nv=True,
mounts=["/data:/data"],
)
)
)
# Submit and monitor
result = client.submit(job)
status = client.retrieve(result.job_id)
print(f"Job {result.job_id} status: {status}")
Best Practices#
Resource Planning#
Right-size your jobs: Don’t over-allocate resources
Use time limits: Prevent runaway jobs
Monitor resource usage: Optimize for future jobs
Environment Management#
Use environment isolation: Conda, venv, or containers
Pin dependencies: Ensure reproducibility
Test environments: Validate before large runs
Prefer Apptainer for reproducibility: SIF files are immutable and portable across clusters
Combine containers with conda: Use containers for system-level dependencies and conda for Python packages
Workflow Design#
Break down jobs: Smaller, focused jobs are easier to debug
Use dependencies wisely: Minimize blocking dependencies
Handle failures: Design for partial workflow recovery
Configuration#
Container Defaults#
Set default container settings via environment variables:
# Default container image
export SRUNX_DEFAULT_CONTAINER=/path/to/default.sif
# Default container runtime (pyxis, apptainer, singularity)
export SRUNX_DEFAULT_CONTAINER_RUNTIME=apptainer
Or in a config file (~/.config/srunx/config.json):
{
"environment": {
"container": {
"runtime": "apptainer",
"image": "/shared/containers/pytorch.sif"
}
}
}
The runtime resolution order (highest priority first):
Explicit
--container-runtimeCLI flagruntime=key in--containervalueSRUNX_DEFAULT_CONTAINER_RUNTIMEenvironment variableConfig file setting
pyxis(default fallback)
Troubleshooting#
Common Issues#
- Job fails to start
Check resource availability
Verify environment exists
Review SLURM script syntax
- Workflow hangs
Check for circular dependencies
Verify all dependencies are satisfiable
Review job logs
- Environment errors
Ensure conda/venv paths are correct
Check environment activation
Verify package availability
- Container errors
Verify the image path exists and is accessible from compute nodes
For Apptainer, ensure
--nvis set when using GPUsCheck bind mount paths exist on both host and container
If using
--cleanenv, pass required environment variables viaenv=Apptainer-specific flags (
nv,rocm, etc.) raise an error if used withruntime=pyxis
Debug Mode#
Enable debug logging:
export SRUNX_LOG_LEVEL=DEBUG
srunx submit python script.py
Preview job submission (show summary without submitting):
srunx submit --dry-run python script.py
View rendered SLURM scripts:
srunx flow run pipeline.yaml --debug