Architecture#
This document explains the internal architecture and design decisions of srunx.
Overview#
srunx is organized as a modular Python library with clear separation of concerns:
src/srunx/
├── models.py # Data models and validation (Pydantic)
├── client.py # Local SLURM client (subprocess-based)
├── runner.py # Workflow execution engine
├── callbacks.py # Notification system (Slack, etc.)
├── config.py # Configuration management
├── exceptions.py # Custom exceptions
├── formatters.py # Output formatting utilities
├── history.py # Job execution history tracking
├── logging.py # Centralized logging (Loguru)
├── template.py # SLURM script template rendering
├── utils.py # Utility functions
├── cli/ # Command-line interface (Typer)
├── monitor/ # Job and resource monitoring
├── ssh/ # SSH integration for remote SLURM
├── sync/ # rsync-based file synchronization
├── templates/ # SLURM script templates (Jinja2)
└── web/ # Web UI (FastAPI + React)
Two Execution Paths#
srunx has two independent paths for interacting with SLURM:
- Local execution (
srunx.client.Slurm) Calls
sbatchdirectly viasubprocess. Used when the CLI runs on the same machine (or a login node) where SLURM is available.- Remote/SSH execution (
srunx.ssh.core.client.SSHSlurmClient) Connects to a remote SLURM server via Paramiko SSH and executes
sbatchremotely. Supports ProxyJump for multi-hop connections.
File Transfer Strategy#
The SSH path uses two complementary file transfer mechanisms:
- SFTP (Paramiko) — for ephemeral single-file staging
upload_file()transfers a single script to/tmp/srunx/on the remote server._write_remote_file()writes in-memory script content. Both use Paramiko’s SFTP subsystem.- rsync (subprocess) — for project directory synchronization
sync_project()uses theRsyncClientto sync an entire project directory to~/.config/srunx/workspace/{repo_name}/on the remote server. This enables scripts that import local modules to work correctly on the remote side. rsync provides delta transfers, exclude patterns, and works through ProxyJump via the-eflag.
Web UI Architecture#
The Web UI adds a third execution path: a browser-based interface that runs locally and connects to SLURM clusters via SSH.
flowchart LR
subgraph local["Local Machine"]
Browser["Browser :8000"]
FastAPI["FastAPI\n(srunx-web)"]
Adapter["SlurmSSHAdapter"]
Browser --> FastAPI --> Adapter
end
subgraph remote["DGX / SLURM Server"]
squeue["squeue / sinfo"]
sbatch["sbatch / scancel"]
sacct["sacct"]
end
Adapter -- SSH --> squeue
Adapter -- SSH --> sbatch
Adapter -- SSH --> sacct
Key design decisions:
- REST polling over WebSocket
SSH adds latency that makes real-time WebSocket updates impractical. The frontend polls REST endpoints at configurable intervals (3–15 seconds) depending on the page and job state.
- SlurmSSHAdapter
A thin adapter wrapping
SSHSlurmClientthat adds missing operations (list_jobs,cancel_job,get_resources) using_execute_slurm_command()for SLURM path resolution and environment setup. The adapter manages SSH reconnection and keep-alive.- No core modifications
The Web UI is implemented entirely in
src/srunx/web/without modifying existing core modules. It accessesSSHSlurmClient’s private_execute_slurm_command()method for proper SLURM path handling.- Input validation
All user-supplied identifiers (user names, partition names, workflow filenames) are validated against a strict regex pattern before being interpolated into SSH commands to prevent command injection.
- Frontend architecture
React + TypeScript with Vite. Pages use a custom
useApihook for data fetching with automatic polling. React Flow provides DAG visualization for workflow dependencies.
DAG Builder Architecture#
The DAG builder is an interactive workflow editor that lets users construct SLURM pipelines visually instead of writing YAML by hand.
- ReactFlow canvas and state management
The builder page renders a ReactFlow canvas where each job is a custom node (
BuilderJobNode) with source and target handles for edge connections. TheuseWorkflowBuilderhook encapsulates all builder state: a ReactFlow node array, an edge array, and aMap<string, BuilderJob>ref for O(1) job lookups. Node positions, edge connections, and job property edits all flow through this single hook, keeping the page component thin.- Job property panel
Clicking a node opens the
JobPropertyPanelsidebar, which exposes every field from theBuilderJobtype: name, command, resources (nodes, GPUs, memory, time limit, partition), environment (conda, venv, container, env vars), working directory, log directory, and retry settings. Changes propagate immediately to the ReactFlow node data, updating the canvas in real time.- Client-side validation
Before saving,
useWorkflowBuilder.validate()checks four rules: every job must have a non-empty name, every job must have a non-empty command, job names must be unique, and the graph must be acyclic (detected via DFS with an explicit recursion-stack set). Validation errors are displayed in a banner above the canvas.- Serialization and persistence
useWorkflowBuilder.serialize()converts the ReactFlow graph into aWorkflowCreateRequestpayload: job names, commands (split on whitespace),depends_onlists (derived from incoming edges, with optional dependency type prefixes likeafternotok:preprocess), resources, and environment settings. The payload is POSTed to/api/workflows/create, which validates it again server-side using Pydantic models andWorkflow.validate(), serializes it to YAML, and writes it to the workflow directory on disk.
flowchart TD
subgraph frontend["Frontend (React)"]
Canvas["ReactFlow Canvas"]
Hook["useWorkflowBuilder\n(nodes, edges, jobMap)"]
Panel["JobPropertyPanel"]
Canvas -- node click --> Panel
Panel -- updateJob --> Hook
Canvas -- onConnect --> Hook
Hook -- nodes/edges --> Canvas
end
subgraph save["Save Flow"]
Validate["Client-side\nvalidation"]
Serialize["serialize()\n→ WorkflowCreateRequest"]
API["POST /api/workflows/create"]
ServerValidate["Server-side validation\n(Pydantic + cycle detection)"]
YAML["Serialize to YAML"]
Disk["Write to disk"]
Validate --> Serialize --> API --> ServerValidate --> YAML --> Disk
end
Hook -- Save Workflow --> Validate
- Mount-based path resolution
The file browser bridges local development directories and remote SLURM paths. Mount points are stored in the SSH profile configuration and define a
localpath (on the developer’s machine) and aremotepath (on the SLURM cluster). When the user browses files, the backend reads the local filesystem under the mount root and returns entries with their computed remote paths. The frontend never sees local filesystem paths.The translation is straightforward: for a mount with
local=/home/user/projectandremote=/scratch/user/project, a file atlocal + /src/train.pymaps toremote + /src/train.py. Thersyncsync operation pushes local contents to the remote side so that selected paths are valid when the workflow executes.- Security model
The file browser enforces strict containment:
Path traversal prevention — The resolved path must be relative to the mount root. Attempts to escape via
../return403 Forbidden.Symlink containment — Symlinks are resolved and checked against the mount boundary. Links pointing outside are marked
accessible: falseand cannot be followed in the browser.Local path isolation — The API response includes only the remote prefix, mount name, and entry metadata. Local filesystem paths are never sent to the frontend.
Workflow Execution Pipeline#
When the user clicks Run Workflow, the backend orchestrates a multi-phase pipeline that bridges the local development environment and the remote SLURM cluster.
Phase 1: Mount resolution and sync
The backend inspects each job’s work_dir and matches it against the mount
remote paths using longest-prefix matching. For example, if a job has
work_dir=/scratch/user/ml-project/experiments and a mount maps
/scratch/user/ml-project to ~/projects/ml-project, that mount is
selected. If the workflow has a default_project field, that mount is
included as well. Each matched mount is synced via rsync before any jobs are
submitted, ensuring the remote cluster has the latest source files.
Phase 2: Script rendering
Each Job in the workflow is rendered through the advanced.slurm.jinja
template to produce a complete sbatch script. ShellJob instances use
their script content directly. Rendering happens in a temporary directory and
is purely CPU-bound (no network I/O).
Phase 3: Topological submission
Jobs are submitted in BFS topological order. For each job, the backend
constructs a --dependency flag from the SLURM job IDs of its parent jobs,
using the dependency type specified on each edge (afterok, after,
afterany, or afternotok). This delegates scheduling entirely to SLURM:
the backend does not wait between submissions.
Phase 4: Background monitoring
After all jobs are submitted, a background anyio task polls each job’s
status via sacct every 10 seconds. When all jobs reach a terminal state
(COMPLETED, FAILED, CANCELLED, or TIMEOUT), the run is marked
accordingly. If the backend loses contact with SLURM for 30 consecutive
polling failures (~5 minutes), the run is marked as failed.
flowchart LR
subgraph phase1["1. Sync"]
Resolve["Resolve mounts\n(longest prefix)"]
Rsync["rsync to remote"]
Resolve --> Rsync
end
subgraph phase2["2. Render"]
Template["Jinja2 template\n→ sbatch script"]
end
subgraph phase3["3. Submit"]
Topo["BFS topological\nordering"]
Sbatch["sbatch\n--dependency=afterok:ID"]
Topo --> Sbatch
end
subgraph phase4["4. Monitor"]
Poll["Poll sacct\nevery 10s"]
Terminal["All terminal?\n→ mark run"]
Poll --> Terminal
end
phase1 --> phase2 --> phase3 --> phase4
The entire pipeline is tracked by an in-memory RunRegistry that stores the
run ID, status, per-job SLURM IDs, and per-job statuses. The frontend polls
GET /api/workflows/runs/{run_id} to reflect live progress in the DAG view.
Settings UI Architecture#
The Settings page is a tab-based configuration interface that exposes the full srunx config surface through the Web UI.
- Stateless per-request reads
The
ConfigManageris instantiated fresh on every API request, reading config from disk each time. This avoids stale in-memory state when the user modifies config files outside the Web UI (e.g. via CLI). The tradeoff is slightly higher I/O per request, but SSH profile configs are small JSON files and the overhead is negligible.- Mount-based project model
Projects are not derived from the current working directory (which is meaningless for a web server). Instead, each SSH profile mount defines a project: the mount’s
localdirectory is scanned for.srunx.json. This makes project configuration remote-friendly — the same mount mapping used for file browsing and rsync sync also drives per-project settings.- Environment variable read-only design
The
/api/config/envendpoint exposes activeSRUNX_*variables for inspection but does not allow modification. Environment variables are set at server startup and cannot be safely mutated at runtime. The UI makes this explicit by rendering the tab as read-only.
File Explorer Architecture#
The file explorer is a VS Code-style tree panel integrated into the Web UI sidebar for browsing project files and submitting scripts to SLURM.
- Lazy directory loading
The explorer does not fetch the entire file tree upfront. Each directory is loaded on demand when the user clicks to expand it, calling
GET /api/files/browse?mount={name}&path={relative}per expansion. This keeps initial load fast and avoids transferring large directory trees.- Mount-scoped state isolation
Each mount maintains its own expanded-directory state in the frontend. Internally, paths are tracked as
{mountName}:{fullPath}to prevent collisions between mounts that may have similarly-named subdirectories. Syncing, refreshing, and expanding are all per-mount operations.- Right-click submission flow
The context menu identifies submittable files by extension (
.sh,.slurm,.sbatch,.bash). On submit, the frontend reads the script content viaGET /api/files/readand posts it toPOST /api/jobswith the script body and a user-editable job name. This two-step flow lets the user preview the script before submission.- Security model
The file explorer enforces the same containment rules as the DAG builder file browser:
Path traversal prevention — Resolved paths must stay within the mount root. Attempts to escape via
../return403 Forbidden.Symlink containment — Symlinks are resolved and checked against the mount boundary. Links outside are marked
accessible: false.Local path isolation — The browse and read APIs include only remote prefixes and entry metadata. Local filesystem paths are not sent to the frontend. Note that the config/project management APIs intentionally expose local paths for administration purposes.
File size limit —
GET /api/files/readrejects files over 1 MB.
Configuration Hierarchy#
Configuration is loaded in order of precedence (lowest to highest):
System-wide:
/etc/srunx/config.jsonUser-wide:
~/.config/srunx/config.jsonProject-wide:
.srunx.jsonorsrunx.jsonin the working directoryEnvironment variables:
SRUNX_DEFAULT_*