Job and Resource Monitoring#
srunx provides comprehensive monitoring capabilities for SLURM jobs and GPU resources.
Overview#
The monitoring system consists of three main components:
Job Monitoring: Track SLURM job state transitions
Resource Monitoring: Monitor GPU availability on partitions
Scheduled Reporting: Periodic cluster status reports via Slack
Monitor Commands#
The srunx monitor command has three subcommands:
srunx monitor jobs # Monitor job state changes
srunx monitor resources # Monitor GPU resource availability
srunx monitor cluster # Scheduled periodic reporting
Job Monitoring#
Monitor single or multiple SLURM jobs until completion with automatic state transition detection.
Basic Usage#
# Monitor single job until completion
srunx monitor jobs 12345
# Monitor multiple jobs
srunx monitor jobs 12345 12346 12347
# Monitor all your jobs
srunx monitor jobs --all
Continuous Monitoring#
Monitor job state changes continuously with real-time notifications:
# Continuous monitoring (Ctrl+C to stop)
srunx monitor jobs 12345 --continuous
# With custom poll interval (default: 60s)
srunx monitor jobs 12345 --continuous --interval 30
# With Slack notifications
srunx monitor jobs 12345 --continuous --notify $WEBHOOK_URL
Monitoring Options#
Option |
Description |
|---|---|
|
Monitor continuously instead of until completion |
|
Polling interval (default: 60s, minimum: 1s) |
|
Maximum monitoring duration (default: no timeout) |
|
Slack webhook URL for notifications |
|
Monitor all user jobs |
State Transitions#
The job monitor detects and reports the following state transitions:
PENDING → RUNNING: Job started execution
RUNNING → COMPLETED: Job finished successfully
RUNNING → FAILED: Job failed with error
RUNNING → CANCELLED: Job was cancelled
RUNNING → TIMEOUT: Job exceeded time limit
Resource Monitoring#
Monitor GPU resource availability on SLURM partitions with threshold-based waiting.
Wait for Resources#
The monitor resources command requires --min-gpus to specify a threshold:
# Wait for 4 GPUs to become available
srunx monitor resources --min-gpus 4
# Wait for resources on specific partition
srunx monitor resources --min-gpus 2 --partition gpu
# With timeout (default: no timeout)
srunx monitor resources --min-gpus 4 --timeout 7200
Note
To display current resources without waiting, use srunx resources instead (see User Guide).
Continuous Resource Monitoring#
Monitor resource availability changes continuously:
# Continuous monitoring with state change notifications
srunx monitor resources --min-gpus 2 --continuous
# With Slack notifications
srunx monitor resources --min-gpus 4 --continuous --notify $WEBHOOK_URL
Resource Monitoring Options#
Option |
Description |
|---|---|
|
Minimum GPUs required (required) |
|
Monitor specific partition (default: all partitions) |
|
Monitor continuously with state change notifications |
|
Polling interval (default: 60s, minimum: 1s; intervals below 5s trigger a warning) |
|
Maximum monitoring duration |
|
Slack webhook URL for notifications |
Resource Information#
The resource monitor provides:
Total GPUs: Total GPU count across all available nodes
GPUs in Use: GPUs currently allocated to running jobs
GPUs Available: GPUs available for new jobs
GPU Utilization: Percentage of GPUs in use
Nodes Total: Total nodes in partition
Nodes Idle: Idle nodes available for jobs
Nodes Down: Unavailable nodes (DOWN/DRAIN/MAINT/RESERVED)
Node State Handling#
The resource monitor automatically filters out unavailable nodes:
DOWN: Node is down
DRAIN/DRAINING: Node is being drained
MAINT: Node is in maintenance mode
RESERVED: Node is reserved
Only GPUs from available nodes are counted in the availability calculation.
Scheduled Reporting#
Send periodic SLURM cluster status reports via Slack webhooks.
Basic Usage#
# Send hourly reports
srunx monitor cluster --schedule 1h --notify $WEBHOOK_URL
# Send reports every 30 minutes
srunx monitor cluster --schedule 30m --notify $WEBHOOK_URL
# Send daily reports at 9 AM
srunx monitor cluster --schedule "0 9 * * *" --notify $WEBHOOK_URL
Schedule Formats#
Interval Format (simple intervals):
1h,2h, etc. - Hours30m,45m, etc. - Minutes (minimum: 1m)1d,7d, etc. - Days
Cron Format (precise scheduling):
"0 9 * * *"- Every day at 9:00 AM"0 */6 * * *"- Every 6 hours"0 9 * * 1-5"- Weekdays at 9:00 AM"*/30 * * * *"- Every 30 minutes
Report Contents#
The scheduled report includes:
- Job Statistics:
Pending jobs count
Running jobs count
Completed jobs (if available)
Failed jobs (if available)
- GPU Resources:
Total GPUs in partition
GPUs in use
GPUs available
Utilization percentage
- Node Statistics:
Total nodes
Idle nodes
Down/unavailable nodes
- User Job Information (optional):
Your pending jobs
Your running jobs
Your completed/failed jobs
- Running Jobs List (optional):
Job ID and name
User
Status
Partition
Runtime (for running jobs)
GPU allocation
Report Options#
Option |
Description |
|---|---|
|
Schedule specification (interval or cron) |
|
Slack webhook URL (required) |
|
Monitor specific partition |
|
Report sections: jobs, resources, user, running |
|
User for user-specific statistics (default: current user) |
|
Timeframe for job aggregation (default: 24h) |
|
Run as background daemon (default: daemon) |
Include Sections#
Customize report contents with --include:
# Include all sections (default)
srunx monitor cluster --schedule 1h --notify $WEBHOOK_URL \
--include jobs,resources,user,running
# Only job and resource statistics
srunx monitor cluster --schedule 1h --notify $WEBHOOK_URL \
--include jobs,resources
# Only running jobs list
srunx monitor cluster --schedule 30m --notify $WEBHOOK_URL \
--include running
Available sections:
jobs- Overall job queue statisticsresources- GPU resource availabilityuser- User-specific job statisticsrunning- List of running jobs with details
Programmatic Usage#
Job Monitoring#
from srunx import Slurm
from srunx.monitor import JobMonitor
from srunx.monitor.types import MonitorConfig
from srunx.callbacks import SlackCallback
client = Slurm()
# Submit a job
job = client.submit(job)
# Monitor until completion
monitor = JobMonitor(
job_ids=[job.job_id],
config=MonitorConfig(poll_interval=30, timeout=3600)
)
monitor.watch_until() # Blocks until job completes or timeout
# Continuous monitoring with callbacks
slack_callback = SlackCallback(webhook_url="your_webhook_url")
monitor = JobMonitor(
job_ids=[job.job_id],
config=MonitorConfig(
poll_interval=10,
mode=WatchMode.CONTINUOUS,
notify_on_change=True
),
callbacks=[slack_callback]
)
monitor.watch_continuous() # Ctrl+C to stop
Resource Monitoring#
from srunx.monitor import ResourceMonitor
from srunx.monitor.types import MonitorConfig, WatchMode
from srunx.callbacks import SlackCallback
# Wait for resources
monitor = ResourceMonitor(
min_gpus=4,
partition="gpu",
config=MonitorConfig(poll_interval=60, timeout=7200)
)
monitor.watch_until() # Blocks until 4 GPUs available
# Continuous monitoring with notifications
callback = SlackCallback(webhook_url="your_webhook_url")
monitor = ResourceMonitor(
min_gpus=2,
partition="gpu",
config=MonitorConfig(
poll_interval=60,
mode=WatchMode.CONTINUOUS,
notify_on_change=True
),
callbacks=[callback]
)
monitor.watch_continuous()
Scheduled Reporting#
from srunx import Slurm
from srunx.callbacks import SlackCallback
from srunx.monitor.scheduler import ScheduledReporter
from srunx.monitor.report_types import ReportConfig
client = Slurm()
callback = SlackCallback(webhook_url="your_webhook_url")
# Configure report
config = ReportConfig(
schedule="1h", # Every hour
partition="gpu",
include=["jobs", "resources", "user", "running"],
max_jobs=10,
user=None # Use current user
)
# Start reporter
reporter = ScheduledReporter(client, callback, config)
reporter.run() # Blocking execution, Ctrl+C to stop
Custom Callbacks#
Create custom callbacks by extending the Callback base class:
from srunx.callbacks import Callback
from srunx.models import Job
from srunx.monitor.types import ResourceSnapshot
class CustomCallback(Callback):
def on_job_completed(self, job: Job) -> None:
print(f"Job {job.job_id} completed!")
def on_job_failed(self, job: Job) -> None:
print(f"Job {job.job_id} failed!")
def on_resources_available(self, snapshot: ResourceSnapshot) -> None:
print(f"{snapshot.gpus_available} GPUs now available!")
def on_resources_exhausted(self, snapshot: ResourceSnapshot) -> None:
print(f"Only {snapshot.gpus_available} GPUs remaining!")
# Use custom callback
monitor = JobMonitor(
job_ids=[job_id],
callbacks=[CustomCallback()]
)
monitor.watch_continuous()
Configuration#
MonitorConfig#
Configure monitoring behavior:
from srunx.monitor.types import MonitorConfig, WatchMode
config = MonitorConfig(
poll_interval=60, # Poll every 60 seconds
timeout=3600, # Timeout after 1 hour
mode=WatchMode.UNTIL_CONDITION, # or WatchMode.CONTINUOUS
notify_on_change=True # Send notifications on state changes
)
Configuration parameters:
poll_interval- Seconds between polls (default: 60, minimum: 1)timeout- Maximum monitoring duration in seconds (default: None)mode-UNTIL_CONDITIONorCONTINUOUSnotify_on_change- Enable state change notifications (default: True)
Note
Polling intervals below 5 seconds are considered aggressive and will trigger a warning. Choose appropriate intervals based on cluster load.
Best Practices#
Polling Intervals#
Job Monitoring: 10-60 seconds for active monitoring
Resource Monitoring: 60+ seconds recommended (intervals below 5s trigger a warning)
Scheduled Reporting: 30 minutes to 1 hour for regular updates (minimum: 1 minute)
Choose appropriate intervals based on:
Cluster size and load
Monitoring urgency
SLURM system performance
Timeouts#
Always set timeouts for blocking operations:
# Good: Set reasonable timeout
srunx monitor jobs 12345 --timeout 3600
# Bad: No timeout (could block indefinitely)
srunx monitor jobs 12345
Error Handling#
The monitoring system includes automatic error recovery:
SLURM command timeouts (30s per command)
Graceful handling of node state variations
Automatic retry on transient failures
Safe handling of malformed SLURM output
Signal Handling#
All monitoring operations support graceful shutdown:
Ctrl+C (SIGINT) - Clean exit with status summary
SIGTERM - Graceful shutdown for process management
Troubleshooting#
Common Issues#
- High CPU usage:
Increase poll interval
Reduce number of monitored jobs
Use
--timeoutto limit duration
- Missing notifications:
Verify Slack webhook URL
Check network connectivity
Review logs for callback errors
- Inaccurate GPU counts:
Verify partition name
Check node states (DOWN/DRAIN nodes are excluded)
Ensure SLURM commands are accessible
Logging#
Enable debug logging for troubleshooting:
from loguru import logger
logger.add("monitor.log", level="DEBUG")
Or via environment variable:
export LOGURU_LEVEL=DEBUG
srunx monitor jobs 12345
Examples#
Complete Workflow Integration#
from srunx import Slurm, Job
from srunx.monitor import ResourceMonitor, JobMonitor
from srunx.monitor.types import MonitorConfig
from srunx.callbacks import SlackCallback
client = Slurm()
callback = SlackCallback(webhook_url="your_webhook_url")
# Wait for resources
print("Waiting for 4 GPUs...")
resource_monitor = ResourceMonitor(
min_gpus=4,
config=MonitorConfig(poll_interval=60, timeout=3600)
)
resource_monitor.watch_until()
# Submit job
job = client.submit(job)
print(f"Job {job.job_id} submitted")
# Monitor job with notifications
job_monitor = JobMonitor(
job_ids=[job.job_id],
config=MonitorConfig(poll_interval=30),
callbacks=[callback]
)
job_monitor.watch_until()
print("Job completed!")
Automated Job Queue Management#
#!/bin/bash
# automated_queue.sh
# Wait for resources
srunx monitor resources --min-gpus 8 --timeout 7200
# Submit batch of jobs
for experiment in exp1 exp2 exp3; do
srunx submit python train.py --name $experiment
done
# Monitor all submitted jobs
srunx monitor jobs --all --notify $SLACK_WEBHOOK
Cluster Status Dashboard#
# Start scheduled reporting for cluster dashboard
srunx monitor cluster \
--schedule "*/15 * * * *" \
--notify $SLACK_WEBHOOK \
--include jobs,resources,running