Orchestrate SLURM jobs
like code.¶
A Python toolkit for submitting, monitoring, and chaining compute jobs on HPC clusters — with a web UI, an MCP server, and YAML workflows that feel like CI.
01 / Capabilities
-
Simple submission
One-line SLURM submits with conda, venv, Apptainer, and Pyxis wiring included.
-
Resource control
Declare nodes, GPUs, memory, and partitions — or let the defaults ride.
-
Workflows as YAML
Typed jobs with
depends_on, retry, and Jinja-templated args. -
Parameter sweeps
Matrix cross-product over hyperparameters. Per-cell tracking, bounded SSH pool, Web UI progress.
-
Live monitoring
Poll state, fan out Slack deliveries, and snapshot GPU utilization.
-
rsync project sync
Delta-transfer your repo to any cluster via ProxyJump-aware SSH.
-
Web UI
Browser dashboard for queue, DAG visualization, run history, and sweep detail pages.
-
MCP server
Claude Code and other MCP clients drive srunx over stdio — including
run_workflow(sweep=..., mount=...). -
Jinja templates
Every sbatch script is a rendered template you can override.
02 / In action
03 / Documentation
-
Tutorials
Start here. Install, submit your first job, tour the Web UI, and set up MCP.
-
How-to guides
Task-oriented recipes for workflows, monitoring, sync, and MCP usage.
-
Reference
Python API, Web UI endpoints, and the full MCP tool surface.
-
Explanation
Architectural decisions, design trade-offs, and how the pieces fit together.