MCP Tool Reference¶
Complete reference for all 14 tools exposed by the srunx MCP server.
The server is started with uvx --from 'srunx[mcp]' srunx-mcp (or the
plain srunx-mcp binary after uv tool install --with 'mcp[cli]' srunx)
and communicates over stdio using the Model Context Protocol. See
MCP Setup for registration details.
All tools return a JSON object with a success boolean. On success,
additional fields carry the result data. On failure, an error string
describes what went wrong.
Job Management¶
submit_job¶
Submit a SLURM job.
Parameters:
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
command |
str | Yes | Shell command to execute (e.g. "python train.py --epochs 100") |
|
name |
str | No | "job" |
Job name for identification in the SLURM queue |
nodes |
int | No | 1 |
Number of compute nodes to allocate |
gpus_per_node |
int | No | 0 |
Number of GPUs per node (0 for CPU-only) |
ntasks_per_node |
int | No | 1 |
Number of tasks per node |
cpus_per_task |
int | No | 1 |
Number of CPUs per task |
memory_per_node |
str | null | No | null |
Memory per node (e.g. "32GB", "64G") |
time_limit |
str | null | No | null |
Wall time limit (e.g. "4:00:00", "1-00:00:00") |
partition |
str | null | No | null |
SLURM partition name (e.g. "gpu", "cpu") |
nodelist |
str | null | No | null |
Specific nodes to use (e.g. "node001,node002") |
conda |
str | null | No | null |
Conda environment name to activate before running |
venv |
str | null | No | null |
Path to Python virtual environment to activate |
env_vars |
dict | null | No | null |
Additional environment variables as key-value pairs |
log_dir |
str | No | "logs" |
Directory for stdout/stderr log files |
work_dir |
str | null | No | null |
Working directory for the job (defaults to cwd; required when use_ssh=true) |
use_ssh |
bool | No | false |
Submit via SSH to remote SLURM cluster |
Return value:
Example:
list_jobs¶
List current user's SLURM jobs in the queue.
Parameters:
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
use_ssh |
bool | No | false |
Query jobs via SSH on remote cluster |
Return value:
{
"success": true,
"jobs": [
{
"name": "training",
"job_id": "12345",
"status": "RUNNING",
"partition": "gpu",
"nodes": "1"
}
],
"count": 1
}
get_job_status¶
Get the status of a specific SLURM job.
Parameters:
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
job_id |
str | Yes | SLURM job ID to check (numeric, e.g. "12345" or "12345_1") |
|
use_ssh |
bool | No | false |
Query via SSH on remote cluster |
Return value:
{
"success": true,
"job_id": "12345",
"name": "training",
"status": "RUNNING",
"partition": "gpu",
"nodes": "1"
}
cancel_job¶
Cancel a running or pending SLURM job.
Parameters:
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
job_id |
str | Yes | SLURM job ID to cancel | |
use_ssh |
bool | No | false |
Cancel via SSH on remote cluster |
Return value:
get_job_logs¶
Get stdout/stderr logs for a SLURM job.
Parameters:
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
job_id |
str | Yes | SLURM job ID | |
job_name |
str | null | No | null |
Job name to help locate log files |
use_ssh |
bool | No | false |
Fetch logs via SSH from remote cluster |
Return value:
{
"success": true,
"job_id": "12345",
"stdout": "Epoch 1/10: loss=0.45 ...",
"stderr": "",
"log_files": ["logs/training-12345.out"]
}
Note
The log_files field is only present for local (non-SSH) queries.
Resources¶
get_resources¶
Get current GPU and node resource availability on the SLURM cluster.
Parameters:
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
partition |
str | null | No | null |
Specific partition to check (null for all partitions) |
use_ssh |
bool | No | false |
Query resources via SSH on remote cluster |
Return value (local):
{
"success": true,
"partition": "gpu",
"total_gpus": 32,
"gpus_in_use": 24,
"gpus_available": 8,
"gpu_utilization": 0.75,
"jobs_running": 12,
"nodes_total": 8,
"nodes_idle": 2,
"nodes_down": 0
}
Return value (SSH):
When using SSH mode, the return includes a raw_output field with the
sinfo output instead of parsed metrics:
{
"success": true,
"partition": "gpu",
"raw_output": "node001 gpu:4 idle gpu*\nnode002 gpu:4 mixed gpu*"
}
Workflows¶
create_workflow¶
Create a SLURM workflow YAML file. Generates a workflow definition that can
be executed with run_workflow.
Parameters:
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
name |
str | Yes | Workflow name for identification | |
jobs |
list[dict] | Yes | List of job definitions (see schema below) | |
output_path |
str | Yes | File path to write the YAML (e.g. "workflow.yaml") |
|
args |
dict | null | No | null |
Template variables for Jinja2 templating in job definitions |
default_project |
str | null | No | null |
Default SSH mount name for file syncing |
Job definition schema:
Each entry in the jobs list is a dict with these fields:
| Field | Type | Required | Description |
|---|---|---|---|
name |
str | Yes | Job identifier (must be unique within the workflow) |
command |
str | list[str] | Yes* | Command to execute (*required for regular jobs) |
script_path |
str | Yes* | Path to shell script (*required for shell jobs, mutually exclusive with command) |
depends_on |
list[str] | No | Job names this job depends on. Supports dependency types: "afterok:preprocess", "after:job_a", "afterany:job_a", "afternotok:job_a" |
retry |
int | No | Number of retry attempts on failure (default 0) |
retry_delay |
int | No | Seconds between retries (default 60) |
resources |
dict | No | Resource allocation: nodes, gpus_per_node, ntasks_per_node, cpus_per_task, memory_per_node, time_limit, partition, nodelist |
environment |
dict | No | Environment setup: conda, venv, env_vars, container |
log_dir |
str | No | Log directory path |
work_dir |
str | No | Working directory path |
Return value:
{
"success": true,
"path": "/absolute/path/to/workflow.yaml",
"name": "ml_pipeline",
"job_count": 3,
"job_names": ["preprocess", "train", "evaluate"],
"message": "Workflow 'ml_pipeline' created at workflow.yaml"
}
validate_workflow¶
Validate a workflow YAML file for correctness. Checks YAML syntax, job structure, dependency resolution, and circular dependency detection.
Parameters:
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
yaml_path |
str | Yes | Path to the YAML workflow file to validate |
Return value:
{
"success": true,
"name": "ml_pipeline",
"valid": true,
"job_count": 3,
"jobs": [
{"name": "preprocess", "depends_on": [], "command": "python preprocess.py"},
{"name": "train", "depends_on": ["preprocess"], "command": "python train.py"},
{"name": "evaluate", "depends_on": ["train"], "command": "python evaluate.py"}
]
}
run_workflow¶
Execute a SLURM workflow from a YAML file. Jobs are executed in dependency order -- independent jobs run in parallel, dependent jobs wait for their prerequisites to complete.
Parameters:
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
yaml_path |
str | Yes | Path to the YAML workflow file | |
from_job |
str | null | No | null |
Start execution from this job (skip earlier jobs) |
to_job |
str | null | No | null |
Stop execution at this job (skip later jobs) |
single_job |
str | null | No | null |
Execute only this specific job, ignoring dependencies |
dry_run |
bool | No | false |
Show what would be executed without running |
args |
dict | null | No | null |
Mapping merged over the YAML args section before Jinja rendering. python:-prefixed values are rejected |
sweep |
dict | null | No | null |
Sweep spec (see schema below). When present, the workflow is routed through SweepOrchestrator and the return shape changes |
mount |
str | null | No | null |
Active-SSH-profile mount name. When set, the run is routed through the configured cluster adapter with mount-aware path translation for work_dir / log_dir. Default (null) keeps execution on the local SLURM client |
sweep schema:
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
matrix |
dict[str, list[scalar]] | Yes | Axis name → list of scalar values (str, int, float, bool). Nested structures are rejected | |
fail_fast |
bool | No | false |
Stop launching new cells on first failure (peers already launched continue to terminal state) |
max_parallel |
int | No | 4 |
Maximum concurrent cells. Required in YAML; MCP defaults to 4 |
Return value (normal execution):
{
"success": true,
"workflow": "ml_pipeline",
"results": {
"preprocess": {"job_id": "12345", "status": "COMPLETED"},
"train": {"job_id": "12346", "status": "COMPLETED"},
"evaluate": {"job_id": "12347", "status": "COMPLETED"}
},
"all_completed": true
}
Return value (sweep execution):
When sweep is provided, the tool blocks until every cell reaches a
terminal state and returns the aggregate counters instead of the
per-job results map:
{
"success": true,
"sweep_run_id": 42,
"status": "completed",
"cell_count": 9,
"cells_completed": 8,
"cells_failed": 1,
"cells_cancelled": 0
}
Return value (dry run):
{
"success": true,
"dry_run": true,
"workflow": "ml_pipeline",
"jobs_to_execute": [
{"name": "preprocess", "depends_on": [], "command": "python preprocess.py"},
{"name": "train", "depends_on": ["preprocess"], "command": "python train.py"}
],
"count": 2
}
Note
sweep and dry_run are mutually distinct code paths: dry_run
previews a non-sweep workflow, while sweep always executes. The
sweep's parent workflow_runs rows record submission_source="mcp"
(automatically stamped) and cells record triggered_by="mcp" after
the V4 migration widens the workflow_runs.triggered_by CHECK
allowlist.
Error responses:
Errors are returned with {"success": false, "error": "..."}:
- Mount not present in the active SSH profile →
"Mount '<name>' not found in profile '<profile_name>'" - No current SSH profile selected but
mount=was passed →"mount requires a current SSH profile; configure one viasrunx ssh profile addand select it withsrunx ssh profile use" - A
ShellJob'sscript_pathresolves outside every mount'slocalroot →"Script path '<path>' is outside allowed directories" - Any
python:prefix inargsorsweep.matrix→"<source> at '<path>' contains 'python:' prefix which is not allowed: '<value>'" - Invalid sweep spec (non-dict matrix, bad types) →
"invalid sweep spec: <detail>"or"sweep.matrix must be a mapping"
Constraints:
- MCP sweeps are blocking — the tool call returns only after every
cell reaches a terminal state. This differs from the Web UI's
POST /api/workflows/{name}/run, which returns202immediately and exposes progress throughGET /api/sweep_runs/{id}. max_parallelis required in YAML and defaults to4when the MCP caller omits it.- When
mount=is combined withsweep, every cell is submitted through a sharedSlurmSSHExecutorPool(size=min(max_parallel, 8))so concurrent cells reuse a small set of SSH sessions against the cluster. The pool is closed when the tool call returns. - Matrix values must be scalar (
str,int,float,bool); nested lists or dicts are rejected at load time. - The total cell count is capped at 1000 as a safety valve.
list_workflows¶
List workflow YAML files in a directory. Scans for YAML files that contain a
valid srunx workflow structure (must have name and jobs keys). Skips
hidden directories, node_modules, .venv, and __pycache__.
Parameters:
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
directory |
str | No | "." |
Directory to search for workflow files |
Return value:
{
"success": true,
"workflows": [
{
"path": "/home/user/project/workflows/ml_pipeline.yaml",
"name": "ml_pipeline",
"job_count": 3,
"job_names": ["preprocess", "train", "evaluate"]
}
],
"count": 1
}
get_workflow¶
Read and parse a workflow YAML file, returning its full structure including resource and environment configuration for each job.
Parameters:
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
yaml_path |
str | Yes | Path to the YAML workflow file |
Return value:
{
"success": true,
"name": "ml_pipeline",
"args": null,
"default_project": null,
"jobs": [
{
"name": "train",
"depends_on": ["preprocess"],
"retry": 0,
"retry_delay": 60,
"command": "python train.py",
"resources": {
"nodes": 1,
"gpus_per_node": 2,
"ntasks_per_node": 1,
"cpus_per_task": 1,
"memory_per_node": "32GB",
"time_limit": "8:00:00",
"partition": null,
"nodelist": null
},
"environment": {
"conda": "ml_env",
"venv": null,
"env_vars": {}
}
}
],
"raw_yaml": "name: ml_pipeline\njobs:\n ..."
}
File Sync¶
sync_files¶
Sync files between local machine and remote SLURM cluster using rsync. Supports two modes: mount-based (using a named mount from the SSH profile) or path-based (using explicit local and remote paths).
Parameters:
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
profile_name |
str | null | No | null |
SSH profile name (uses current profile if not specified) |
mount_name |
str | null | No | null |
Mount point name from the SSH profile to sync |
local_path |
str | null | No | null |
Local directory path (alternative to mount_name) |
remote_path |
str | null | No | null |
Remote directory path (alternative to mount_name) |
dry_run |
bool | No | false |
Show what would be transferred without actually syncing |
Note
You must provide either mount_name or local_path. When using
mount_name, the local and remote paths are read from the SSH profile
configuration. When using local_path without remote_path, a
default remote path is derived.
Return value:
{
"success": true,
"profile": "myserver",
"mount": "ml-project",
"local": "/home/user/projects/ml-project",
"remote": "/home/researcher/projects/ml-project",
"dry_run": false,
"output": "sending incremental file list\nsrc/train.py\n..."
}
Configuration¶
get_config¶
Get the current srunx configuration including resource defaults and environment settings.
Parameters: None.
Return value:
{
"success": true,
"resources": {
"nodes": 1,
"gpus_per_node": 0,
"ntasks_per_node": 1,
"cpus_per_task": 1,
"memory_per_node": null,
"time_limit": null,
"partition": null,
"nodelist": null
},
"environment": {
"conda": null,
"venv": null,
"env_vars": {}
},
"log_dir": "logs",
"work_dir": null
}
list_ssh_profiles¶
List all configured SSH connection profiles for remote SLURM clusters. Shows profile names, hostnames, and configured mount points.
Parameters: None.
Return value:
{
"success": true,
"profiles": [
{
"name": "dgx",
"hostname": "dgx.example.com",
"username": "researcher",
"port": 22,
"description": "Main DGX cluster",
"is_current": true,
"mounts": [
{
"name": "ml-project",
"local": "/home/user/projects/ml-project",
"remote": "/home/researcher/projects/ml-project"
}
]
}
],
"current": "dgx",
"count": 1
}