AWS Parallel Computing Service Adds Support for Slurm 25.11 and New Monitoring Endpoints

The latest update to AWS Parallel Computing Service integrates Slurm version 25.11 to enhance workload orchestration and observability for high-performance computing clusters. This release enables a Prometheus-compatible OpenMetrics endpoint, allowing administrators to collect cluster metrics directly using standard monitoring tools. These additions provide deeper visibility into scheduler performance and job lifecycle management.
Comparison
| Aspect | Before / Alternative | After / This |
|---|---|---|
| Slurm Version | Older versions such as 23.11 or 24.05 | Version 25.11 |
| Metric Collection | Custom scripts or indirect log parsing | Native Prometheus-compatible OpenMetrics endpoint |
| Audit Capabilities | Standard job and system logging | Dedicated scheduler audit logs for compliance |
| Job Handling | Standard re-queueing mechanisms | Introduction of expedited re-queue functionality |
Action Checklist
- Verify compatibility of existing Slurm plugins with version 25.11 Custom plugins may require recompilation or code updates
- Configure the OpenMetrics endpoint in your Prometheus monitoring stack Ensure network access between the Prometheus server and the PCS controller
- Enable scheduler audit logs in the AWS Management Console Check CloudWatch Logs storage costs for high-volume clusters
- Test job re-queueing behavior in a development environment Validate how the expedited re-queue affects priority scheduling
Source: AWS What's New
This page summarizes the original source. Check the source for full details.


