MCP Tool Reference¶

Complete reference for all 14 tools exposed by the srunx MCP server.

The server is started with uvx --from 'srunx[mcp]' srunx-mcp (or the plain srunx-mcp binary after uv tool install --with 'mcp[cli]' srunx) and communicates over stdio using the Model Context Protocol. See MCP Setup for registration details.

All tools return a JSON object with a success boolean. On success, additional fields carry the result data. On failure, an error string describes what went wrong.

Job Management¶

submit_job¶

Submit a SLURM job.

Parameters:

Name	Type	Required	Default	Description
`command`	str	Yes		Shell command to execute (e.g. `"python train.py --epochs 100"`)
`name`	str	No	`"job"`	Job name for identification in the SLURM queue
`nodes`	int	No	`1`	Number of compute nodes to allocate
`gpus_per_node`	int	No	`0`	Number of GPUs per node (0 for CPU-only)
`ntasks_per_node`	int	No	`1`	Number of tasks per node
`cpus_per_task`	int	No	`1`	Number of CPUs per task
`memory_per_node`	str \| null	No	`null`	Memory per node (e.g. `"32GB"`, `"64G"`)
`time_limit`	str \| null	No	`null`	Wall time limit (e.g. `"4:00:00"`, `"1-00:00:00"`)
`partition`	str \| null	No	`null`	SLURM partition name (e.g. `"gpu"`, `"cpu"`)
`nodelist`	str \| null	No	`null`	Specific nodes to use (e.g. `"node001,node002"`)
`conda`	str \| null	No	`null`	Conda environment name to activate before running
`venv`	str \| null	No	`null`	Path to Python virtual environment to activate
`env_vars`	dict \| null	No	`null`	Additional environment variables as key-value pairs
`log_dir`	str	No	`"logs"`	Directory for stdout/stderr log files
`work_dir`	str \| null	No	`null`	Working directory for the job (defaults to cwd; required when `use_ssh=true`)
`use_ssh`	bool	No	`false`	Submit via SSH to remote SLURM cluster

Return value:

{
  "success": true,
  "job_id": "12345",
  "name": "training",
  "status": "PENDING"
}

Example:

> Submit "python train.py" with 2 GPUs, conda env ml_env, 8 hour time limit

list_jobs¶

List current user's SLURM jobs in the queue.

Parameters:

Name	Type	Required	Default	Description
`use_ssh`	bool	No	`false`	Query jobs via SSH on remote cluster

Return value:

{
  "success": true,
  "jobs": [
    {
      "name": "training",
      "job_id": "12345",
      "status": "RUNNING",
      "partition": "gpu",
      "nodes": "1"
    }
  ],
  "count": 1
}

get_job_status¶

Get the status of a specific SLURM job.

Parameters:

Name	Type	Required	Default	Description
`job_id`	str	Yes		SLURM job ID to check (numeric, e.g. `"12345"` or `"12345_1"`)
`use_ssh`	bool	No	`false`	Query via SSH on remote cluster

Return value:

{
  "success": true,
  "job_id": "12345",
  "name": "training",
  "status": "RUNNING",
  "partition": "gpu",
  "nodes": "1"
}

cancel_job¶

Cancel a running or pending SLURM job.

Parameters:

Name	Type	Required	Default	Description
`job_id`	str	Yes		SLURM job ID to cancel
`use_ssh`	bool	No	`false`	Cancel via SSH on remote cluster

Return value:

{
  "success": true,
  "job_id": "12345",
  "message": "Job cancelled"
}

get_job_logs¶

Get stdout/stderr logs for a SLURM job.

Parameters:

Name	Type	Required	Default	Description
`job_id`	str	Yes		SLURM job ID
`job_name`	str \| null	No	`null`	Job name to help locate log files
`use_ssh`	bool	No	`false`	Fetch logs via SSH from remote cluster

Return value:

{
  "success": true,
  "job_id": "12345",
  "stdout": "Epoch 1/10: loss=0.45 ...",
  "stderr": "",
  "log_files": ["logs/training-12345.out"]
}

Note

The log_files field is only present for local (non-SSH) queries.

Resources¶

get_resources¶

Get current GPU and node resource availability on the SLURM cluster.

Parameters:

Name	Type	Required	Default	Description
`partition`	str \| null	No	`null`	Specific partition to check (null for all partitions)
`use_ssh`	bool	No	`false`	Query resources via SSH on remote cluster

Return value (local):

{
  "success": true,
  "partition": "gpu",
  "total_gpus": 32,
  "gpus_in_use": 24,
  "gpus_available": 8,
  "gpu_utilization": 0.75,
  "jobs_running": 12,
  "nodes_total": 8,
  "nodes_idle": 2,
  "nodes_down": 0
}

Return value (SSH):

When using SSH mode, the return includes a raw_output field with the sinfo output instead of parsed metrics:

{
  "success": true,
  "partition": "gpu",
  "raw_output": "node001 gpu:4 idle gpu*\nnode002 gpu:4 mixed gpu*"
}

Workflows¶

create_workflow¶

Create a SLURM workflow YAML file. Generates a workflow definition that can be executed with run_workflow.

Parameters:

Name	Type	Required	Default	Description
`name`	str	Yes		Workflow name for identification
`jobs`	list[dict]	Yes		List of job definitions (see schema below)
`output_path`	str	Yes		File path to write the YAML (e.g. `"workflow.yaml"`)
`args`	dict \| null	No	`null`	Template variables for Jinja2 templating in job definitions
`default_project`	str \| null	No	`null`	Default SSH mount name for file syncing

Job definition schema:

Each entry in the jobs list is a dict with these fields:

Field	Type	Required	Description
`name`	str	Yes	Job identifier (must be unique within the workflow)
`command`	str \| list[str]	Yes*	Command to execute (*required for regular jobs)
`script_path`	str	Yes*	Path to shell script (*required for shell jobs, mutually exclusive with `command`)
`depends_on`	list[str]	No	Job names this job depends on. Supports dependency types: `"afterok:preprocess"`, `"after:job_a"`, `"afterany:job_a"`, `"afternotok:job_a"`
`retry`	int	No	Number of retry attempts on failure (default 0)
`retry_delay`	int	No	Seconds between retries (default 60)
`resources`	dict	No	Resource allocation: `nodes`, `gpus_per_node`, `ntasks_per_node`, `cpus_per_task`, `memory_per_node`, `time_limit`, `partition`, `nodelist`
`environment`	dict	No	Environment setup: `conda`, `venv`, `env_vars`, `container`
`log_dir`	str	No	Log directory path
`work_dir`	str	No	Working directory path

Return value:

{
  "success": true,
  "path": "/absolute/path/to/workflow.yaml",
  "name": "ml_pipeline",
  "job_count": 3,
  "job_names": ["preprocess", "train", "evaluate"],
  "message": "Workflow 'ml_pipeline' created at workflow.yaml"
}

validate_workflow¶

Validate a workflow YAML file for correctness. Checks YAML syntax, job structure, dependency resolution, and circular dependency detection.

Parameters:

Name	Type	Required	Default	Description
`yaml_path`	str	Yes		Path to the YAML workflow file to validate

Return value:

{
  "success": true,
  "name": "ml_pipeline",
  "valid": true,
  "job_count": 3,
  "jobs": [
    {"name": "preprocess", "depends_on": [], "command": "python preprocess.py"},
    {"name": "train", "depends_on": ["preprocess"], "command": "python train.py"},
    {"name": "evaluate", "depends_on": ["train"], "command": "python evaluate.py"}
  ]
}

run_workflow¶

Execute a SLURM workflow from a YAML file. Jobs are executed in dependency order -- independent jobs run in parallel, dependent jobs wait for their prerequisites to complete.

Parameters:

Name	Type	Required	Default	Description
`yaml_path`	str	Yes		Path to the YAML workflow file
`from_job`	str \| null	No	`null`	Start execution from this job (skip earlier jobs)
`to_job`	str \| null	No	`null`	Stop execution at this job (skip later jobs)
`single_job`	str \| null	No	`null`	Execute only this specific job, ignoring dependencies
`dry_run`	bool	No	`false`	Show what would be executed without running
`args`	dict \| null	No	`null`	Mapping merged over the YAML `args` section before Jinja rendering. `python:`-prefixed values are rejected
`sweep`	dict \| null	No	`null`	Sweep spec (see schema below). When present, the workflow is routed through `SweepOrchestrator` and the return shape changes
`mount`	str \| null	No	`null`	Active-SSH-profile mount name. When set, the run is routed through the configured cluster adapter with mount-aware path translation for `work_dir` / `log_dir`. Default (null) keeps execution on the local SLURM client

sweep schema:

Field	Type	Required	Default	Description
`matrix`	dict[str, list[scalar]]	Yes		Axis name → list of scalar values (str, int, float, bool). Nested structures are rejected
`fail_fast`	bool	No	`false`	Stop launching new cells on first failure (peers already launched continue to terminal state)
`max_parallel`	int	No	`4`	Maximum concurrent cells. Required in YAML; MCP defaults to 4

Return value (normal execution):

{
  "success": true,
  "workflow": "ml_pipeline",
  "results": {
    "preprocess": {"job_id": "12345", "status": "COMPLETED"},
    "train": {"job_id": "12346", "status": "COMPLETED"},
    "evaluate": {"job_id": "12347", "status": "COMPLETED"}
  },
  "all_completed": true
}

Return value (sweep execution):

When sweep is provided, the tool blocks until every cell reaches a terminal state and returns the aggregate counters instead of the per-job results map:

{
  "success": true,
  "sweep_run_id": 42,
  "status": "completed",
  "cell_count": 9,
  "cells_completed": 8,
  "cells_failed": 1,
  "cells_cancelled": 0
}

Return value (dry run):

{
  "success": true,
  "dry_run": true,
  "workflow": "ml_pipeline",
  "jobs_to_execute": [
    {"name": "preprocess", "depends_on": [], "command": "python preprocess.py"},
    {"name": "train", "depends_on": ["preprocess"], "command": "python train.py"}
  ],
  "count": 2
}

Note

sweep and dry_run are mutually distinct code paths: dry_run previews a non-sweep workflow, while sweep always executes. The sweep's parent workflow_runs rows record submission_source="mcp" (automatically stamped) and cells record triggered_by="mcp" after the V4 migration widens the workflow_runs.triggered_by CHECK allowlist.

Error responses:

Errors are returned with {"success": false, "error": "..."}:

Mount not present in the active SSH profile → "Mount '<name>' not found in profile '<profile_name>'"
No current SSH profile selected but mount= was passed → "mount requires a current SSH profile; configure one viasrunx ssh profile addand select it withsrunx ssh profile use"
A ShellJob's script_path resolves outside every mount's local root → "Script path '<path>' is outside allowed directories"
Any python: prefix in args or sweep.matrix → "<source> at '<path>' contains 'python:' prefix which is not allowed: '<value>'"
Invalid sweep spec (non-dict matrix, bad types) → "invalid sweep spec: <detail>" or "sweep.matrix must be a mapping"

Constraints:

MCP sweeps are blocking — the tool call returns only after every cell reaches a terminal state. This differs from the Web UI's POST /api/workflows/{name}/run, which returns 202 immediately and exposes progress through GET /api/sweep_runs/{id}.
max_parallel is required in YAML and defaults to 4 when the MCP caller omits it.
When mount= is combined with sweep, every cell is submitted through a shared SlurmSSHExecutorPool(size=min(max_parallel, 8)) so concurrent cells reuse a small set of SSH sessions against the cluster. The pool is closed when the tool call returns.
Matrix values must be scalar (str, int, float, bool); nested lists or dicts are rejected at load time.
The total cell count is capped at 1000 as a safety valve.

list_workflows¶

List workflow YAML files in a directory. Scans for YAML files that contain a valid srunx workflow structure (must have name and jobs keys). Skips hidden directories, node_modules, .venv, and __pycache__.

Parameters:

Name	Type	Required	Default	Description
`directory`	str	No	`"."`	Directory to search for workflow files

Return value:

{
  "success": true,
  "workflows": [
    {
      "path": "/home/user/project/workflows/ml_pipeline.yaml",
      "name": "ml_pipeline",
      "job_count": 3,
      "job_names": ["preprocess", "train", "evaluate"]
    }
  ],
  "count": 1
}

get_workflow¶

Read and parse a workflow YAML file, returning its full structure including resource and environment configuration for each job.

Parameters:

Name	Type	Required	Default	Description
`yaml_path`	str	Yes		Path to the YAML workflow file

Return value:

{
  "success": true,
  "name": "ml_pipeline",
  "args": null,
  "default_project": null,
  "jobs": [
    {
      "name": "train",
      "depends_on": ["preprocess"],
      "retry": 0,
      "retry_delay": 60,
      "command": "python train.py",
      "resources": {
        "nodes": 1,
        "gpus_per_node": 2,
        "ntasks_per_node": 1,
        "cpus_per_task": 1,
        "memory_per_node": "32GB",
        "time_limit": "8:00:00",
        "partition": null,
        "nodelist": null
      },
      "environment": {
        "conda": "ml_env",
        "venv": null,
        "env_vars": {}
      }
    }
  ],
  "raw_yaml": "name: ml_pipeline\njobs:\n  ..."
}

File Sync¶

sync_files¶

Sync files between local machine and remote SLURM cluster using rsync. Supports two modes: mount-based (using a named mount from the SSH profile) or path-based (using explicit local and remote paths).

Parameters:

Name	Type	Required	Default	Description
`profile_name`	str \| null	No	`null`	SSH profile name (uses current profile if not specified)
`mount_name`	str \| null	No	`null`	Mount point name from the SSH profile to sync
`local_path`	str \| null	No	`null`	Local directory path (alternative to `mount_name`)
`remote_path`	str \| null	No	`null`	Remote directory path (alternative to `mount_name`)
`dry_run`	bool	No	`false`	Show what would be transferred without actually syncing

Note

You must provide either mount_name or local_path. When using mount_name, the local and remote paths are read from the SSH profile configuration. When using local_path without remote_path, a default remote path is derived.

Return value:

{
  "success": true,
  "profile": "myserver",
  "mount": "ml-project",
  "local": "/home/user/projects/ml-project",
  "remote": "/home/researcher/projects/ml-project",
  "dry_run": false,
  "output": "sending incremental file list\nsrc/train.py\n..."
}

Configuration¶

get_config¶

Get the current srunx configuration including resource defaults and environment settings.

Parameters: None.

Return value:

{
  "success": true,
  "resources": {
    "nodes": 1,
    "gpus_per_node": 0,
    "ntasks_per_node": 1,
    "cpus_per_task": 1,
    "memory_per_node": null,
    "time_limit": null,
    "partition": null,
    "nodelist": null
  },
  "environment": {
    "conda": null,
    "venv": null,
    "env_vars": {}
  },
  "log_dir": "logs",
  "work_dir": null
}

list_ssh_profiles¶

List all configured SSH connection profiles for remote SLURM clusters. Shows profile names, hostnames, and configured mount points.

Parameters: None.

Return value:

{
  "success": true,
  "profiles": [
    {
      "name": "dgx",
      "hostname": "dgx.example.com",
      "username": "researcher",
      "port": 22,
      "description": "Main DGX cluster",
      "is_current": true,
      "mounts": [
        {
          "name": "ml-project",
          "local": "/home/user/projects/ml-project",
          "remote": "/home/researcher/projects/ml-project"
        }
      ]
    }
  ],
  "current": "dgx",
  "count": 1
}