Architecting SLURM HPC for Enterprise

Introduction

In the last article, we have seen about HPC capacity planning on cloud, in this article we will look at the architecture and features of SLURM workload manager that is popularly used in HPC (High Performance Computing) use cases. We will look at how AWS adopted SLURM as part of the AWS ParallelCluster solution for HPC workload and resource management. We will also discuss the best practices adopted in the Enterprise HPC systems using AWS ParallelCluster to setup and configure SLURM for workload management.

Why SLURM?

SLURM, which stands for Simple Linux Utility for Resource Management, is an industry standard open source resource management and job scheduling system used in high-performance computing (HPC) environments.

SLURM is designed to efficiently handle computing clusters of varying sizes, from small setups to massive supercomputers hosting thousands of nodes.

SLURM provides a range of features and customisation options to address diverse computing needs. Administrators can fine-tune the system with different job scheduling algorithms, resource management policies, and job prioritisation strategies based on their requirements.

SLURM is very stable and reliable. It has been extensively tested and used in production environments, demonstrating high uptime and robust performance even under heavy workloads.

SLURM provides reliable resource management capabilities, including job scheduling, job accounting, resource monitoring and job checkpointing. This enables efficient utilisation of computing resources and helps administrators track resource usage for billing or reporting purposes.

SLURM has a large and active user community with ongoing development, regular updates and support for the software. Provides good documentation and user resources.

SLURM Architecture

slurmctld

Slurm has a centralized manager, slurmctld, to monitor resources and work. There may also be a backup manager to assume those responsibilities in the event of failure.

slurmd

Each compute server (node) has a slurmd daemon, which can be compared to a remote shell: it waits for work, executes that work, returns status, and waits for more work. The slurmd daemons provide fault-tolerant hierarchical communications.

slurmdbd

There is an optional slurmdbd (Slurm DataBase Daemon) which can be used to record accounting information for multiple Slurm-managed clusters in a single database.

slurmrestd

There is an optional slurmrestd (Slurm REST API Daemon) which can be used to interact with Slurm through its REST API.

Tools to run and monitor jobs

User tools include srun to initiate jobs, scancel to terminate queued or running jobs, sinfo to report system status, squeue to report the status of jobs, and sacct to get information about jobs and job steps that are running or have completed. The sview commands graphically reports system and job status including network topology. There is an administrative tool scontrol available to monitor and/or modify configuration and state information on the cluster. The administrative tool used to manage the database is sacctmgr. It can be used to identify the clusters, valid users, valid bank accounts, etc. APIs are available for all functions. Slurm architecture is referred below:

Source: https://slurm.schedmd.com/overview.html

Why AWS ParallelCluster?

AWS ParallelCluster is an open source cluster management tool that makes it easy for you to deploy and manage High Performance Computing (HPC) clusters on AWS Cloud.

AWS ParallelCluster uses a simple graphical user interface (GUI) or text file to model and provision the resources needed for your HPC applications in an automated and secure manner. Provides a CLI(command line interface) and configuration files to automate cluster creation and management. Users can define cluster configurations using the text file and customise the instance types, network settings, various storage options and required software packages.

The configuration file allows to configure job submission queues with single and multiple instance types and job schedulers like AWS Batch and Slurm.

AWS ParallelCluster is built on the popular open source CfnCluster project and is released via installing the GUI through an AWS CloudFormation template or the Python Package Index (PyPI). ParallelCluster’s source code is hosted on the Amazon Web Services repository on GitHub. AWS ParallelCluster is available at no additional charge, and you pay only for the AWS resources needed to run your applications.

Reference: https://aws.amazon.com/hpc/parallelcluster/

ParallelCluster seamlessly integrates with other AWS services, such as Amazon EC2 for compute instances, Amazon S3 for object storage, and Amazon VPC for networking. This integration simplifies cluster management and enables users to leverage additional AWS features and services as needed.

ParallelCluster supports highly scalable cluster deployments, adds and removes compute nodes dynamically based on workloads and enables large scale simulations.

AWS ParallelCluster incorporates security best practices, takes advantage of AWS’s robust security features, such as encryption, identity and access management (IAM), and network security controls, to protect the data and infrastructure.

AWS Services used by AWS ParallelCluster: https://docs.aws.amazon.com/parallelcluster/latest/ug/aws-services-v3.html

*Fig 2: Reference Architecture of AWS ParallelCluster*

SLURM Architecture – Design Considerations and Best Practices

In this section, we will discuss SLURM architecture and its components and how AWS ParallelCluster enabled setting up the cluster with ease.

The key components of the SLURM architecture are

HeadNode

In the HPC cluster, the purpose of HeadNode is to manage and coordinate the resources within the cluster.

HeadNode schedules the jobs submitted by the users or applications.
Based on the availability of resources, priority of jobs and other conditions, schedules the jobs.
Head Node runs the slurm scheduler, slurmdb and slurmrestd daemons.
Supports Queues where the jobs are lined up and executed. Manages fair allocation of resources among the jobs.
Monitors the cluster’s health and performance, tracking resource usage, job execution times, and other relevant metrics.
Slurm scheduler daemon does not store job data permanently. This can be solved by enabling Slurm accounting that uses an external database. Slurm scheduler can read and write data from and to an external database. The Slurm daemon writes all the accounting records to the permanent data store.
Handles failover and fault-tolerance to handle node failures and job failures.
Slurmrestd exposes API endpoints to interact with the Slurm scheduler and database.

Selection of Instance type

As Head Node does all of the above mentioned points and also runs some of the remote solver applications like RSM server, choosing the right instance type is critical.

Please refer to our blog HPC Capacity Planning On Clou d to know more about the choice of instance type for HeadNode and other components as well.

Cluster Size

The Head Node orchestrates all the scaling activities in the cluster and it takes care of adding new nodes to the scheduler. To scale up and down a cluster that has a large number of nodes, provide the head node some extra compute capacity. The cluster size is determined by some of the constraints mentioned as part of the Queue and QOS configurations in the cluster. We will see more details about the Queue and QOS configurations in the next few sections.

Custom Scripts

AWS ParallelCluster supports custom scripts for the head node based on the custom actions.

Below are the available options in the custom scripts for the Head Node. All the custom actions can be provided as shell scripts stored in the S3.

OnNodeStart: Actions are called before any node deployment bootstrap action is started
OnNodeConfigured: Actions are called after the node bootstrap processes are complete.

OnNodeUpdated: Actions are called after the head node update is completed and the scheduler and shared storage are aligned with the latest cluster configuration changes.

HeadNode configuration using AWS ParallelCluster config

Head Node configuration for the AWS ParallelCluster where we have used c6in.32xlarge as Head Node and local storage of 1000GB.

HeadNode:  
 InstanceType: c6in.32xlarge  
 Networking:     
 SubnetId: subnet-01234abc  
 Ssh:    
 KeyName: <Key Name>    
 AllowedIps: 10.0.0.0/8  
 LocalStorage:    
 RootVolume:      
 VolumeType: gp3      
 Size: 1000      
 Encrypted: true  
 Iam:    
 S3Access:      
 - EnableWriteAccess: true        
 BucketName: s3bucket    
 AdditionalIamPolicies:      
 - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore  
 CustomActions:    
 OnNodeConfigured:      
 Script: s3://s3bucket/abc/Headnode/headnode-wrapper.sh    
 OnNodeUpdated:      
 Script: s3://s3bucket/abc/Headnode/nodeupdate/slurmdbconfig_update.sh  
 DisableSimultaneousMultithreading: false

Compute Nodes

Compute nodes are the instances responsible for executing the jobs submitted by users or applications. These nodes form the backbone of the cluster and are where the actual computation takes place.

Compute nodes work together to run large simulations by distributing job tasks across multiple nodes and coordinating their execution. The slurmd daemon installed on each compute node communicates with the SLURM HeadNode to receive job assignments, report node availability, and update resource status. Compute nodes with different configurations like the number of CPUs and type of CPU, the memory and number of GPUs.

When a user submits a job to the SLURM scheduler, SLURM determines the appropriate resources needed for the job and assigns it to one or more compute nodes for execution. The compute nodes then carry out the computation and return the results upon completion.

Compute node configurations are provided as part of the QUEUE configurations. Capacity planning is done based on different design simulation use cases that the customer wants to run.

Queue

Slurm supports multiple queues, create necessary queues based on the requirements. Each queue has many configurations and all these can be configured through the AWS ParallelCluster configuration.

Selection of Instance type

Please refer to our blog HPC Capacity Planning On Cloud to know more about the choice of instance type for Compute Nodes.

Single and Multiple instance types

Slurm queue supports single or multiple instance types. When selecting the multiple instances ensure each instance type has a similar core, same number of accelerators of the same manufacturers. If EFA is set to true, all the instances must have EFA support.

Single AZ and Multi AZ

AWS ParallelCuster supports instances across multiple AZs from a single HPC Cluster. Loosely coupled jobs can be used to span instances across multiple AZs , whereas tightly coupled workloads need instances to run in a single availability zone.

Key Considerations:

Elastic Fabric Adapter (EFA) cannot be enabled in queues than span Availability Zones.
Cluster Placement Groups are not supported in queues than span Availability Zones.
Network traffic between Availability Zones is subject to higher latency and incur charges

On-demand and Spot instances

AWS ParallelCluster supports on demand and spot instances based on the configuration options. When you are using Spot instances, you will probably want to optimize the chances that your jobs will run to completion instead of being interrupted. This is especially the case for workloads where it may be quite expensive to checkpoint and re-start work in progress. You can configure a ParallelCluster queue with this optimization by adding an AllocationStategy key to the queue and setting it to capacity-optimized, rather than its default value of lowest-price.

EFA and Placement groups

EFA brings the scalability, flexibility, and elasticity of the cloud to tightly coupled high performance computing (HPC) applications. With EFA, tightly coupled HPC applications have access to lower and more consistent latency and higher throughput than traditional TCP channels, enabling them to scale better. EFA support can be enabled dynamically, on demand on any supported EC2 instance without pre-reservation, giving you the flexibility to respond to changing business and workload priorities.

EFA is a network interface that provides low-latency, high-bandwidth communication between instances in a placement group. It is specifically designed to accelerate communication for MPI (Message Passing Interface) applications commonly used in HPC. If your HPC applications heavily rely on MPI, using instances that support EFA can greatly improve performance.

Using placement groups in an HPC (High-Performance Computing) cluster can improve the performance and communication between instances, leading to better auto scaling capabilities. Placement groups help ensure that EC2 instances are placed in close physical proximity, reducing network latency and improving inter-instance communication. Placement groups can be specified either in Auto scaling launch configuration or Auto scaling group settings.

Compute Node Configuration

Compute node scaling features are crucial for ensuring optimal resource utilization, efficient job scheduling, and cost-effectiveness. Here are some key compute node scaling features commonly found in HPC cluster environments.

Auto Scaling allows the HPC cluster to automatically add or remove compute nodes based on predefined scaling policies and thresholds. When workload increases, new compute nodes are automatically provisioned to handle the additional tasks. Conversely, nodes are terminated during low-demand periods to save costs.

SlurmQueues:
  - Name: queue1
    Networking:
        SubnetIds:
           - subnet-12345
    ComputeResources:
    - Name: queue1
      Instances:
           - InstanceType: c6i.32xlarge
      MinCount: 0
      MaxCount: 50
      DisableSimultaneousMultithreading: true
      Efa:
           Enabled: true
      Networking:
          PlacementGroup:
               Enabled: true
   ComputeSettings:
      LocalStorage:
         RootVolume:
               VolumeType: gp3
               Encrypted: true
               Size: 50
    CustomActions:
        OnNodeConfigured:
               Script: s3:/s3bucket/abc/ComputeNode/computenode-wrapper.sh
    Iam:
        AdditionalIamPolicies:
              - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
        S3Access:
              - EnableWriteAccess: true
                BucketName: s3bucket
   Tags:
             - Key: custom:queue_name
               Value:queue1

Accounting DB

The Slurm job scheduler can collect accounting information for each job (and job step) that runs on your HPC cluster into a relational database. By default, in AWS ParallelCluster, job information is only persisted while the job is running. In order to persist job information after completion we need to enable Slurm accounting. Slurm DB configuration: Amazon Aurora MySQL is recommended as a standard accounting DB as part of the AWS ParallelCluster setup. Importantly, you can also use Slurm accounting to meter and limit consumption. And, you can use it to run detailed post-hoc usage reports, which can be helpful for billing and usage efficiency analysis.

SLURM Accounting DB configuration using AWS ParallelCluster:

Scheduling:
 Scheduler: slurm
    SlurmSettings:
    Database:
       Uri: <RDS Endpoint :3306>
       UserName: dbadmin
       PasswordSecretArn:  <Secretsmanager ARN>
ScaledownIdletime: 10

Storage – Shared file systems

In the Head Node and Compute Nodes NFS is used to share the files so as to ensure enough bandwidth is available and also ensure that the cluster is able to handle the artefacts that need to be shared between the compute nodes and head node.

Amazon FSx for NetApp ONTAP is recommended to use as the shared storage. Amazon FSx for NetApp ONTAP offers a highly available and durable file system, designed to meet the needs of enterprise workloads. It provides support for advanced features like data deduplication, data compression, thin provisioning, and SnapMirror data replication. FSx shared storage volume will be mounted on cluster nodes as well as on the workstations to enable users to share the data files. Amazon FSx NetApp ONTAP can support both Windows and Linux operating systems.

Capacity planning for Storage is explained in detail as part of this blog HPC Capacity Planning On Cloud.

AWS ParallelCluster provides the following config for Storage configuration:

SharedStorage:
 - MountDir: /apps
   Name: app
   StorageType: FsxOntap
   FsxOntapSettings:
      VolumeId: fsvol-123456
 - MountDir: /shared
   Name: data
   StorageType: FsxOntap
   FsxOntapSettings:
      VolumeId: fsvol-123456

SLURM Rest API

Slurm provides a REST API daemon named slurmrestd. This daemon is designed to allow clients to communicate with Slurm via a REST API (in addition to the command line interface (CLI) or C API). Slurmrestd is stateless as it does not cache or save any state between requests. Each request is handled in a thread and then all of that state is discarded. Any request to slurmrestd is completely synchronous with the Slurm controller (slurmctld or slurmdbd) and is only considered complete once the HTTP response code has been sent to the client. Slurmrestd will hold a client connection open while processing a request. Slurm database commands are committed at the end of every request, on the success of all API calls in the request.

Job Scheduling and Management

Slurm HPC Job Schedulers

In Slurm, the scheduler organizes jobs into queues and performs resource allocation based on the specifications of the jobs and the available resources in the HPC cluster. Each queue represents a set of jobs with similar characteristics, and the scheduler determines the order in which the jobs are scheduled for execution. Different queues may have distinct scheduling policies, job priorities, and resource limits. Here’s how Slurm scheduler handles queues and resource allocation:

Queue Configurations

Administrators can define different queues with specific properties in the Slurm configuration. Each queue is associated with certain characteristics, such as maximum runtime, priority, and allowed resources. For example, there may be queues for short jobs, long jobs, GPU jobs, memory-intensive jobs, etc. There are multiple queues required to support different use cases. The following is the recommended queue configuration based on the requirements provided for the use cases.

Some of the basic queue configurations listed below are based on the capacity requirements estimated for running different types of CFD (Computational Fluid Dynamics) use cases.
The details of the capacity planning for queue configurations is mentioned in detail in this blog HPC Capacity Planning On Cloud.

Example-1:

Queue	hpccompute1
Instance Type	c6in.32xlarge
MinCount	0
MaxCount	50
DisableSimultaneousMultithreading	True
Efa.Enabled	True
Networking.PlacemenGroup.Enabled	True

Example-2:

Queue	hpccompute2
Instance Type	c6in.16xlarge
MinCount	0
MaxCount	50
DisableSimultaneousMultithreading	True
Efa.Enabled	False
Networking.PlacemenGroup.Enabled	True

QOS

QOS stands for Quality of Service. QOS in SLURM allows administrators to prioritise and allocate resources such as CPU time, memory, and nodes among different users or user groups based on defined policies.

SLURM’s QOS feature enables administrators to manage resources more effectively by ensuring that critical jobs or users get the necessary resources first, while still allowing fair access for others. QOS settings can include parameters like job priority, resource limits, access restrictions, and scheduling policies. This flexibility helps optimise resource utilisation and improve overall system performance.

Overall, QOS improves productivity and user satisfaction by enhancing the resource management, prioritisation, fairness, efficiency, customization, and control.

Sample QOS Configuration

QOS is defined as a resource in SLURM. It is defined in the SLURM database by adding or modifying a QOS with specific flags that affect the job scheduling priority, job preemption and job limits.

Once defined QOS can be associated with a specific user, user group(Account), a partition/Queue or a job when the job is submitted.

All QOS operations are done using the “sacctmgr” command. Multiple QOS can be defined with different sets of priority and limitations. By default a ‘normal’ QOS is added when SLURM is installed.

Some key configurations that are generally used.

Priority – Setting the priority of the job

QOS: low-priority

sacctmgr add qos low-priority
sacctmgr modify qos low-priority set priority=3

QOS: medium-priority

sacctmgr add qos medium-priority
sacctmgr modify qos medium-priority set priority=2

QOS: high-priority

sacctmgr add qos high-priority
sacctmgr modify qos high-priority set priority=1

MaxSubmitjobs: Max Submitted/Queued jobs per user

This QOS setting defines the maximum number of jobs that can be submitted by a user at a time.

sacctmgr add qos low-qos
sacctmgr modify qos low-qos set MaxSubmitJobs=1

MaxJobsPerUser: Max number of running jobs per user

This QOS setting defines the maximum number of jobs that can be running at a time, submitting by a single user.

sacctmgr add qos medium-qos
sacctmgr modify qos medium-qos set MaxJobsPerUser=2

MaxCPUsPerUser: Max number of CPUs that can be used per user

This QOS setting defines the maximum number of jobs that can be running at a time, submitting by a single user. This QOS allows up to 512 CPUs per user.

sacctmgr add qos high-qos
sacctmgr modify qos high-qos set MaxCPUsPerUser=512

Additional SLURM cluster design best practices

Slurm Login Node

In an enterprise Slurm-based HPC cluster, it’s likely you interact with your cluster using a login node. It’s the portal through which you access your cluster’s vast computational resources. You’ve probably used one to browse your files, submit jobs (and check on them) and compile your code. You can do all these things using the headnode, too, but when a cluster is shared among multiple users in an enterprise or lab, someone compiling their code on the headnode can hamper other users trying to submit jobs, or just doing their own work. Some AWS ParallelCluster customers have worked around this limitation by manually creating login nodes for their users, but this involved a lot of undocumented steps and forced their admins to know about ParallelCluster’s internals.

AWS ParallelCluster 3.7 supports adding login nodes to your cluster, out of the box. Refer to this post to understand how you can up an HPC cluster with login nodes.

Slurm Connectivity Options

User Connectivity: The on-premises network is connected to AWS cloud via a SD-WAN and Transit Gateway. SD-WAN is currently provisioned to establish a flexible, scalable and secure network connectivity between on-premises locations to AWS cloud.
Transit Gateway is a fully managed service that simplifies network connectivity between Amazon Virtual Private Clouds (VPCs) and on-premises networks. It acts as a hub that can connect multiple VPCs and VPN connections, making it an ideal choice for multi-VPC and multi-account AWS environments.
VPN Connection – Users can connect using VPN connections between the SD-WAN appliance at the on-premises data centre and the Transit Gateway in AWS. The VPN connections ensure encrypted and secure communication between on-premises and cloud resources
Once the VPN connection is established, users can use RDP to connect directly to the Windows instance in the private subnet from an on-premises machine.

Multi User Enablement

AWS ParallelCluster supports multi-user management through AD integration. User authentication and authorization will be done by using Active Directory. The cluster nodes and the workstations will be integrated with AD seamlessly

By default the home directory of the user will be on the Head Node but it can be moved to the shared file storage by mentioning the path in the override_homedir. Moving it to a shared file system will reduce the load on the Head Node and will be easy for the management.

DirectoryService:
 DomainName: DC=example,DC=com
 DomainAddr: ldaps://10.0.0.1,ldaps://10.0.0.2
 PasswordSecretArn: <Secretsmanager ARN>
 DomainReadOnlyUser: CN=_svc_aws_hpc,CN=users,DC=example,DC=com
 GenerateSshKeysForUsers: True
 LdapTlsReqCert: true
 AdditionalSssdConfigs:
   debug_level: "0xFFF0"
   override_homedir: /data/home/%u

Tagging

Tagging is very important for the HPC Operations and Cost Calculation when it comes to HeadNode and ComputeNode. AWS ParallelCluster propagates the tags when it is mentioned under the Tags section. Please refer to the section as below in the configuration section.

If there is need for Queue level custom tags, they can be added under the Queue configuration under “Scheduling / SlurmQueues / Tags” . When Compute nodes are created all the tags including the Queue level tags will be added to the EC2 instance resource. And any matching Queue level tags will override the generic tag values configured under the “Tags” section of the AWS ParallelCluster configuration.

Tags:
 - Key: Name
   Value: <Name>
 - Key: Department
   Value: <Department>
 - Key: OU
   Value: <OU>
 - Key: Environment
   Value: PRD
 - Key: Application_Name
   Value: <Parallel Cluster>
 - Key: Project
   Value: <Project>
 - Key: OS
   Value: Centos 7.9
 - Key: Application_Owner
   Value: <Application_Owner>
 - Key: Application_Type
   Value: AWS Parallel Cluster
 - Key: CostCenter
   Value: "12345"

Custom Scripts

AWS ParallelCluster supports custom scripts for the head node and compute nodes based on the custom actions.

Below are the available options in the custom scripts for the Head Node and Compute Nodes. All the custom actions can be provided as shell scripts stored in the S3.

OnNodeStart: Actions are called before any node deployment bootstrap action is started
OnNodeConfigured: Actions are called after the node bootstrap processes are complete.
OnNodeUpdated: Actions are called after the head node update is completed and the scheduler and shared storage are aligned with the latest cluster configuration changes.

SLURM Rest API enablement, SLURM email notifications, package installation etc is handled through the custom scripts

SLURM Email Notification

Slurm has an option to send emails when your job changes status. This is useful to get notifications when your job completes or submitted. The emails can be customised to include useful information such as stdout, stderr, runtime etc,

The email configuration consists of the SMTP server configuration with the necessary email attributes. The configuration is set up using custom scripts and scripts are run on the HeadNode when the head node is configured.

Custom AMI for AWS ParallelCluster

Every version of ParallelCluster release, AWS provides prebuilt AMIs support for Ubuntu, CentOS and other linux flavours In the Enterprise scenario, there is a need to use the AMI which is approved by the internal security team. AWS ParallelCluster provides a option to specify custom AMI during the cluster creation which will be used by Head and Compute Nodes

There are three alternative ways to use a custom AWS ParallelCluster AMI, two of them require to build a new AMI that will be available under your AWS account and one does not require to build anything in advance:

modify an AWS ParallelCluster AMI, when you want to install your software on top of an official AWS ParallelCluster AMI
build a custom AWS ParallelCluster AMI using pcluster cli, when you have an AMI with customization and software already in place, and want to build an AWS ParallelCluster AMI on top of it
use a Custom AMI at runtime, when you don’t want to create anything in advance, AWS ParallelCluster will install everything it needs at runtime (during cluster creation time and scale-up time)

Security

Key considerations in the security for ParallelCluster on AWS

AWS ParallelCluster uses roles to access AWS resources like S3, etc. Roles are created automatically and attached to the machines based on the configuration values.
Passwords used for the database, ldap server should be stored in the secrets manager and referred during the deployment.
Enforce TLS encryption between slurmdbd and the database server, By default it is enabled on the AWS ParallelCluster configuration
Enforce LDAPS (TLS/SSL) is used for the SSSD service authentication. AWS ParallelCluster configuration has the options to provide the certificates
Ensure all the EBS volumes have encryption enabled and leverage the latest generation of gp3 volumes.
Ensure Instance Metadata Service Version 2 is enabled on all the instances and it is driven through the configuration.
Open necessary ports in the Security Groups for Head Node, Computer nodes and Storage filesystems.
Enforce encryption at rest and transit for the Storage filesystems.

Observability for SLURM

Monitoring is an important part of maintaining the reliability, availability, and performance of the SLURM cluster. AWS ParallelCluster leverages CloudWatch Logs for logs and CloudWatch for the metrics.

Amazon CloudWatch

AWS ParallelCluster configures the Cloudwatch agent for metrics collection on the Head Node and creates custom dashboards automatically during the AWS ParallelCluster deployment. Dashboards have the metrics of the Head Node and Cluster status.

AWS ParallelCluster also creates the CloudWatch Alarms for the head node related to disk usage, etc.

Amazon CloudWatch Logs

AWS ParallelCluster configures the Cloudwatch agent for logs collection on the Head Node and Compute nodes. On the Head and Compute Node it captures the logs related to Slurm components and pushes them to log streams. Log rotation configuration can be done during the cluster creation process. Provides a default CloudWatch dashboards for monitoring the cluster, master node detail, compute node details.

Sample Dashboards showing HeadNode, FSx Storage and Cluster Health metrics

Custom HPC-Specific Metrics Monitoring

Slurm is configured with a specific set of queues that determines where the job is executed. Capturing metrics for Queues and Jobs are very critical. Custom metrics can be captured from the Slurm and pushed to AWS CloudWatch for the visualisation by implementing a custom solution built by Invisibl Cloud using AWS SDK.

A custom job level metric dashboard will be implemented to visualise the job metrics.

Multiple Queue Custom Dashboard: (instance level data for each queue)
Queue State [count of which queue have how many nodes in which state]
CPU Used
Memory Used %
Disk used %
Single Queue Custom Dashboard:
Jobs CPU Utilization Percent
Jobs Memory Utilization Percent
Jobs Max Disk Read Bytes
Jobs Max Disk Write Bytes
Jobs CPU Usage Seconds· Jobs Memory Usage Bytes

Custom HPC-Specific Logging

AWS ParallelCluster by default sends Slurm service logs to CloudWatch logs. Fluent-Bit can be configured on the Head Node with the required regex pattern to automatically detect and push the HPC job logs to Amazon CloudWatch logs.

License Management

License management is critical for effective and fair share usage of HPC applications across multiple HPC teams and users. SLURM provides Local and Remote license management to enable license control at the cluster level.

Local license management uses a static set of license details setup as a resource managed within SLURM. This is suitable for simple HPC setup and for teams which do not use multiple HPC application software.

Remote license management is recommended for large enterprises with many HPC applications using their own License managers. A custom script can be implemented that periodically synchronises the license availability, in-use and freely available details for all HPC solvers between the application license manager and SLURM license manager.

With Remote license management, licenses can be shared by multiple teams within the enterprise by allocating a certain number of licenses per cluster. With this the license usage can be checked and controlled at the time of job submission on the cluster itself.

LMSTAT is the command used to connect to the remote license manager and get the usage details for all the HPC solvers.

sacctmgr CLI utility is used to add and update license resource in a SLURM cluster

sacctmgr -i add resource name=ansys cluster=prod count=3000 allowed=50 \ server=flex_host servertype=flexlm flags=absolute

sacctmgr -i update resource ansys set lastconsumed=512

Prolog and Epilog Scripts

The prolog scripts run before the user’s job and are run with root permissions. Prolog scripts have several benefits like mentioned below.

Configuring the user’s environment for running their application
Clearing up any unwanted files or data from the previous users
Setting up specific storage directories on specific storage systems
Copying the user’s input to the storage directories, and copying back any output data to the user’s /home directory or group storage directory

The epilog scripts run with root permissions after a user’s job has completed. Typically, the script is run on all nodes used in the job, Epilog scripts have the following uses.

Cleaning up after the user. As an example, if directories were created for the user, then a prolog can copy all the user data to a specific location (e.g., the user’s /home directory) – if all goes well, without going over quota.
Cleaning up the node to get ready for the next user.
Running health checks to make sure the node is healthy

AWS ParallelCluster Example Configuration

The following is a reference ParallelCluster SLURM configuration that shows all the attributes that are configured for each of the major components discussed in this article.

Region: eu-west-1
Imds:
  ImdsSupport: v2.0
Image:
   Os: centos7
SharedStorage:
 - MountDir: /apps
   Name: app
   StorageType: FsxOntap
   FsxOntapSettings:
      VolumeId: fsvol-123456
 - MountDir: /shared 
   Name: data 
   StorageType: FsxOntap 
   FsxOntapSettings: 
       VolumeId: fsvol-123456 
HeadNode: 
   InstanceType: c6in.32xlarge 
   Networking: SubnetId: 
       subnet-01234abc 
   Ssh: 
       KeyName:  <Key Name>
       AllowedIps: 10.0.0.0/8 
   LocalStorage: 
       RootVolume: 
           VolumeType: gp3 
           Size: 1000 
           Encrypted: true 
   Iam: 
       S3Access: 
         - EnableWriteAccess: true
           BucketName: s3bucket
       AdditionalIamPolicies:
        - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
   CustomActions:
       OnNodeConfigured:
          Script: s3://s3bucket/abc/Headnode/headnode-wrapper.sh
       OnNodeUpdated:
          Script: s3://s3bucket/abc/Headnode/nodeupdate/slurmdbconfig_update.sh
   DisableSimultaneousMultithreading: false
Scheduling:
   Scheduler: slurm
     SlurmQueues:
        - Name: queue1 Networking: SubnetIds: subnet-12345
     ComputeResources:
        - Name: queue1 
          Instances: 
             - InstanceType: c6i.32xlarge
         MinCount: 0
         MaxCount: 50
         DisableSimultaneousMultithreading: true
         Efa:
           Enabled: true
         Networking:
           PlacementGroup:
             Enabled: true
      ComputeSettings:
         LocalStorage:
           RootVolume:
              VolumeType: gp3
              Encrypted: true
              Size: 50
      CustomActions:
         OnNodeConfigured:
              Script: s3:/s3bucket/abc/ComputeNode/computenode-wrapper.sh
      Iam:
        AdditionalIamPolicies:
            - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
        S3Access:
            - EnableWriteAccess: true
              BucketName: s3bucket
     Tags:
           - Key: custom:queue_name
             Value:queue1
     SlurmSettings:
     Database:
          Uri: <RDS Endpoint :3306>
          UserName: dbadmin
          PasswordSecretArn:
       ScaledownIdletime: 10
DirectoryService:
  DomainName: DC=example,DC=com
  DomainAddr: ldaps://10.0.0.1,ldaps://10.0.0.2
  PasswordSecretArn: <Secretsmanager ARN>
  DomainReadOnlyUser: CN=_svc_aws_hpc,CN=users,DC=example,DC=com
  GenerateSshKeysForUsers: True
  LdapTlsReqCert: true
  AdditionalSssdConfigs:
       debug_level: "0xFFF0"
       override_homedir: /data/home/%u
Monitoring:
  Dashboards:
       CloudWatch:
           Enabled: true
  DetailedMonitoring: false
  Logs:
       CloudWatch:
           DeletionPolicy: Retain
           Enabled: true
           RetentionInDays: 14
Tags:
        - Key: Name
          Value: HeadNode
        - Key: Department
          Value: <Department>
        - Key: OU
          Value: <OU>
        - Key: Environment
          Value: PRD
        - Key: Application_Name
          Value: <Parallel Cluster>
        - Key: Project
          Value: <Project>
        - Key: OS
          Value: Centos 7.9
        - Key: Application_Owner
          Value: <Application_Owner>
        - Key: Application_Type
          Value: AWS Parallel Cluster
        - Key: CostCenter
          Value: "12345"

Conclusion

This article described in detail, the architecture best practices for setting up an enterprise grade SLURM as an HPC workload manager including the selection of Instance types, cluster size, custom scripts for HeadNode, Queue Configurations, Selection of Instance types for compute nodes, Single AZ and Multi AZ considerations, On demand Vs spot EC2 considerations, EFA and Placement Groups, Accounting DB, Storage, Rest API, Job scheduling and Management, QOS, Login Nodes, Connectivity options, Multi User Enablement, Tagging, Customer AMI for AWS ParallelCluster, Security, Observability for Slurm and License Management. There are many such design decisions required to launch and run a HPC cluster. The platform admins/infrastructure team are required to spend a lot of hours designing and setting up the HPC systems as per enterprise standards. In order to help improve the platform admin/infrastructure team’s productivity we have designed and built a self-service platform called Tachyon. The platform makes life easier for the infrastructure team in provisioning and managing the HPC clusters. We will discuss more about the self-service platform in the future articles.

Architecting SLURM HPC for Enterprise

Introduction

Why SLURM?

SLURM Architecture

Why AWS ParallelCluster?

SLURM Architecture – Design Considerations and Best Practices

HeadNode

Selection of Instance type

Cluster Size

Custom Scripts

HeadNode configuration using AWS ParallelCluster config

Compute Nodes

Queue

Selection of Instance type

Single and Multiple instance types

Single AZ and Multi AZ

On-demand and Spot instances

EFA and Placement groups

Compute Node Configuration

Accounting DB

Storage – Shared file systems

SLURM Rest API

Job Scheduling and Management

Slurm HPC Job Schedulers

Queue Configurations

QOS

Sample QOS Configuration

Priority – Setting the priority of the job

MaxSubmitjobs: Max Submitted/Queued jobs per user

MaxJobsPerUser: Max number of running jobs per user

MaxCPUsPerUser: Max number of CPUs that can be used per user

Additional SLURM cluster design best practices

Slurm Login Node

Slurm Connectivity Options

Multi User Enablement

Tagging

Custom Scripts

SLURM Email Notification

Custom AMI for AWS ParallelCluster

Security

Observability for SLURM

Amazon CloudWatch

Amazon CloudWatch Logs

Custom HPC-Specific Metrics Monitoring

Custom HPC-Specific Logging

License Management

Prolog and Epilog Scripts

AWS ParallelCluster Example Configuration

Conclusion

Leave a ReplyCancel reply

Recent Posts

Discover more from

Products

Services

Resources

Company