HPC Capacity Planning on Cloud

time lapse photo of stars on night

Introduction

Enterprise manufacturing companies need to do capacity planning for running HPC workloads on Cloud. HPC engineers, HPC researchers and HPC scientists can maximize productivity and reduce time to results when running their HPC workloads on Cloud.

HPC system owners can deploy a cloud-based cluster environment in the matter of minutes, incorporating a range of Cloud services designed specifically for demanding HPC applications such as computational fluid dynamics (CFD), Weather and Climate Modeling, Seismic reservoir, and structural simulations.

The capacity planning was done based on the HPC workload requirements of an enterprise manufacturing customer whose HPC workloads were running on an on-prem infrastructure and migrating to cloud. Capacity planning for running enterprise grade HPC workloads on AWS Cloud is carefully done with the following considerations,

  1. Compute Instance based on HPC use cases
  2. Amazon EC2 instance types and selection
  3. HPC HeadNode
  4. HPC ComputeNodes
  5. Availability of resources in the region
  6. Availability of resources in the availability zones
  7. ComputeNode Scaling Configuration
  8. Network
  9. Placement Groups
  10. Storage
  11. Database
  12. Remote visualization
  13. Infrastructure cost
  14. License cost

HPC High Level Architecture on AWS Cloud

HPC
AWS Parallel Custer SLURM HPC Architecture

Compute Instance based on HPC use cases

Amazon EC2 lets you choose from a variety of compute instance types that can be configured to suit your needs. Instances come in different families and sizes to offer a wide variety of capabilities.

Some instance families target specific workloads, for example, compute-, memory-, or GPU-intensive workloads, while others are general purpose. Both the targeted-workload and general-purpose instance families are useful for CFD applications based on the step in the CFD process.

When considering CFD steps, different instance types can be targeted for pre-processing, solving, and post-processing. In pre-processing, the geometry step can benefit from an instance with a GPU, while the mesh generation stage may require a higher memory-to-core ratio, such as general-purpose or memory-optimized instances. When solving CFD cases, evaluate your case size and cluster size.

Hyperthreading

AWS enables simultaneous multithreading (SMT), or hyper-threading technology for Intel processors, commonly referred to as “hyperthreading” by default for supported processors. Hyperthreading improves performance for some systems by allowing multiple threads to be run simultaneously on a single core.

Most CFD (HPC) applications do not benefit from hyperthreading, and therefore, disabling it tends to be the preferred environment. Hyperthreading is easily disabled in Amazon EC2. Unless an application has been tested with hyperthreading enabled, it is recommended that hyperthreading be disabled and that processes are launched and pinned to individual cores.

Amazon EC2 Instance Types and Selection

HPC use cases are divided into 2 categories based on how the parallel processes interact with each other during the simulation. They are shared-memory parallelism (SMP) and distributed-memory parallelism (DM). 

The shared-memory parallelism is achieved by executing multiple threads at a time across many cores within a single node. These are loosely coupled workloads that run independently and do not interact. The processes can run in any order at any time during the simulation. Some examples of simulations that use shared-memory parallelism are Electronic Design Automation (EDA), genomics analysis and image processing etc.

The distributed-memory parallelism is achieved by executing the simulation processes across multiple nodes. The parallel processes are dependent on each other to carry out the calculation. These processes are tightly coupled. They run in multiple iterations and require communication with each other. They rely on Message Passing Interface (MPI) for inter process communication.

Based on these different categories of HPC applications, we propose purpose-built HPC queues for Electronic Design Automation (EDA), Computational Fluid Dynamics (CFD) and other HPC workloads.

In order to select the right instance type, we needed to run a few use cases to measure the performance on different instance types.

Let us take the CFD (Computational Fluid Dynamics) use case.

The objective is to perform an automated CFD workflow on the cloud HPC platform using a regular 90-degree bend pipe. There are many commercial CFD software available to run CFD simulations. We chose ANSYS FLUENT as it was used by the customer. 

CFD simulations use 3 main stages that are performed by preprocessor, solver and post-processor.

Computational Fluid Dynamics(CFD) on AWS Cloud HPC

Following are the primary steps to execute the CFD, to simulate a steady laminar flow through a 90-degree bend pipe.

Step1: Use ANSYS Workbench to create a geometry for Fluid Flow (FLUENT). This will be run on the virtual workstation.

Step2: Create a mesh using the HPC cluster. Mesh size is 13.5 million cells.

Step3: Solve the steady laminar flow using Fluent on HPC cluster. 

Step4: Automatic post processing in HPC cluster.

Step5: Visualisation of results on the virtual workstation.

To execute this case, we set up the necessary virtual workstation and HPC cluster. The details are as below. The users of the customer were in the AWS EU west region. This was one of the key considerations for the capacity planning.

For Step 1 and 5

The selection of instance type for virtual workstation was based on the following factors

  • HPC users would connect from their desktops or laptops on premise to the remote workstations using NiceDCV clients and hence the remote workstation needs to run the NiceDCV server.
  • Multiple users can share sessions on the remote workstation to collaborate on the HPC research.
  • The creation of the 3D geometry using ANSYS Workbench
  • Visualisation of the results
  • Network bandwidth to transfer input and output files between the workstation and the storage.
  • Cloud regional availability
  • Cost and Performance

The hardware requirement of the workstation is as follows,

We compared the features of G4dn/P4d and P3 instance types which are recommended by AWS for HPC use cases.

We tested the geometry step in all the instance types and found that the performance was acceptable in all of them. But the major differentiator was the cost of the G4dn.8xlarge instance. Similarly, the visualisation of the results matched the expected performance and the G4dn.8xlarge instance type was chosen as the standard for all virtual workstations.

For Step 2 to 4

In order to execute this case with multiple instance types we used AWS Parallel Cluster to set up the HPC cluster with SLURM as the job scheduler. HPC cluster setup consists of a HeadNode, an accounting Database, a network shared storage and compute nodes. In this section we will look at the HeadNode and Compute nodes.

HPC HeadNode

The selection of instance type for HeadNode was based on the following factors

  • The HeadNode runs the SLURM job scheduler which does the following functions
    • Runs SLURM controller daemons.
    • Runs SLURM DB daemon.
    • The workload manager allocates resources to the jobs submitted by users.
    • Manages queues to prioritise jobs.
  • High network bandwidth to transfer input and output files to and from the compute nodes.
  • Requires good compute and memory.
  • ANSYS RSM runs on the HeadNode to submit jobs to the SLURM scheduler. This needs good compute.

The hardware requirement for the HeadNode is,

We compared the features of c6i.16xlarge/ c6in.16xlarge instance types which are recommended by AWS for HPC use cases.

C6i.32xlarge is chosen as the suitable instance type considering the hardware requirements, network requirements and region availability as it gives the balance of both the CPU/Memory capacity and EFA networking bandwidth.

HPC ComputeNodes

The selection of instance type for Compute Nodes was based on the following factors

  • The SLURM workload manager launches compute nodes and allocates them for jobs. The compute nodes are launched based on the number of cores requested by the job. The CFD use case was tested with 128, 256 and 512 cores.
  • 1:4 CPU to memory ratio is preferred. 
  • Network bandwidth to transfer input and output files to and from the compute nodes.

The CFD use case,

We compared the specs of c6i.16xlarge, c6i.32xlarge, c6in.32xlarge instance types which are recommended by AWS for HPC use cases.

CFD Benchmarking results on AWS Cloud

Based on our observations from the CFD use case executed with different settings on different instance types, we have designed the Slurm queues based on the following example.

Following example shows a use case-based queue capacity configuration for HPC workloads.

There are also three different types of HPC-optimized instances available on AWS Cloud:

  1. Amazon EC2 Hpc7g instances
  2. Amazon EC2 Hpc7a instances
  3. Amazon EC2 Hpc6id instances

We wanted to highlight why we didn’t choose HPC optimized instances on AWS Cloud.

Lack of availability of Hpc6a or Hpc7a instance types in the EU west region at the time of implementation of the project. Also considering the fact that the HPC purpose built instance types had more cores per node than the requirement. Considering the HPC workloads were benchmarked in on-premise infrastructure by using all the cores in an instance, the decision was taken to use instances which matched the CPU core requirements.

Our customer needed between 3500 – 4000 cores to run various simulations. These simulations would be executed by different users concurrently. So the availability of instance types in large numbers is a key consideration and we will discuss this in the next section.

Availability of Resources in the region

The selection of compute instance types for the HPC cluster depends on where the users are located and the latency requirements to connect to the HPC cluster. It is recommended to set up the HPC cluster on the AWS region which is nearest to the location of the users.

In some cases, the users will be spread across multiple regions and this will require the HPC cluster to be set up in multiple regions. Hence the compute instance types will be selected depending on the availability of certain instance types in that particular AWS region.

This can be optimised by working with AWS HPC solution architects and the account management team who are responsible for recommending and helping the customers in getting the necessary resources.

For instance, in the case of our customer as explained in the previous section, all the users in the EU west region. So, one of the key considerations for us in selecting the instance type was the availability in the eu-west-1 region.

In order for all the users to run simulations without having to wait for resources and their jobs being queued for a long time, the AWS region should have the required nodes to scale up to 4000 cores which includes multiple instance types as detailed in the previous section.

In many cases, the availability of certain instance types suitable for other use cases may not be available in required numbers. In those situations, it is better to work with the AWS Account teams to design and plan the capacity accordingly.

Availability of Resources in the AZs (Availability Zones)

Most HPC use cases that are tightly coupled and require low-latency must use compute nodes that reside within the same Availability Zone.

So when the users run multiple jobs that require multiple nodes to run using distributed-memory parallelism, the HPC infrastructure should be capable of horizontally scale within the Availability Zone without the jobs having to wait or fail due to unavailability of compute.

This is one of the major factors in choosing the compute instance type. Multiple availability zones are preferred if failure tolerance is required. If the use cases are loosely coupled simulations and do not require low-latency communication, then the ability to deploy in multiple availability zones will help improve failure tolerance.

As seen in the CFD scenario, the simulation runs in a tightly coupled parallelised model and this requires a lot of low-latency communication between the instances. This requires that the entire simulation run within a single cluster placement group and Availability zone. This also took care of the failure tolerance as all the nodes reside in the same AZ.

Compute Node Scaling Configuration

Compute node scaling features are crucial for ensuring optimal resource utilization, efficient job scheduling, and cost-effectiveness. Here are some key compute node scaling features commonly found in HPC cluster environments

  • Slurm resource manager: Slurm allows to partition the cluster into logical subsets of nodes, each with its own set of resources and scheduling policies. Partitions can be configured based on various criteria such as hardware characteristics, user groups, or job types. For example, the partitions can be configured for CPU-only nodes and GPU-accelerated nodes, or partitions dedicated to specific research projects or departments.
  • AWS ParallelCluster supports Slurm’s methods to dynamically scale clusters by using Slurm’s power saver plugin.
  • Dynamic Node Provisioning: Slurm supports dynamic node provisioning, allowing compute nodes to be added or removed from the cluster on-the-fly based on workload demand. Dynamic node provisioning is implemented through cloud bursting, where additional compute resources are provisioned on the cloud providers.
  • Multiple instance type allocation in ParallelCluster 3.3.0:  This feature enables you to specify multiple instance types to use when scaling up the compute resources for a Slurm job queue. Your HPC workloads will have more paths forward to get access to the EC2 capacity they need, helping you to get more computing done.
  • Job-level resume or job-level scaling: AWS ParallelCluster 3.8.0 uses Job-level resume or job-level scaling as the default dynamic node allocation strategy to scale the cluster: AWS ParallelCluster scales up the cluster based on the requirements of each job, the number of nodes allocated to the job, and which nodes need to be resumed. AWS ParallelCluster gets this information from the SLURM_RESUME_FILE environment variable.

Network

HPC workloads deployed on AWS Cloud require an optimal network solution, optimal network solution for HPC workloads varies based on parameters like latency, bandwidth, and throughput requirements.

Tightly coupled HPC applications require the lowest latency possible for network connections between compute nodes. For moderately sized tightly coupled workloads, it is possible to select large instance types with many cores so that the application fits entirely within the instance without crossing the network at all. 

Alternatively, some applications are network bound and require high network performance. EC2 instances with higher network performance can be selected for these applications. Higher network performance is usually obtained with the largest instance type in a family. 

AWS provides options for high bandwidth low latency networking with Elastic fabric Adapter (EFA), which attaches to EC2 compute instances and delivers up to 200 Gbit Ethernet connectivity coupled with AWS low latency drivers and OS bypass, enabling demanding Message Passing Interface (MPI) applications to run without constraint.

  • Elastic Fabric Adapter (EFA) Is a network interface custom built by AWS that enables HPC customers to run applications with low-latency, high-throughput internode communications at scale.
  • AWS Autoscaling – Enables your HPC Cluster to grow and shrink on demand.
  • Cluster Placement Groups – Ensures compute resources are deployed physically close to reduce network hops and therefore latency.
  • Enhanced Networking – Enables you to configure your HPC environment with best practices for HPC deployments.
  • AWS VPC –  enables you to provision locally isolated sections of the AWS cloud so you can launch AWS resources in a virtual network that you define.

VPC CIDR

CIDR for the VPC is arrived at based on current and future requirements and subnets need to be created in minimum of 2 AZs in consideration with scaling and instance availability. Slurm Head Node and Compute nodes will be running in the same AZ and choice of AZ will be decided based on the EC2 Availability from the AWS  team. 

In this use case, the initial requirement is to onboard 50 users and max CPU of 512 cores for each job. Considering the scenario that the user would run 2 jobs at the same time and also factoring the future scaling requirements, the no. of IPs can be arrived at as shown in the table below.

Placement Groups

Using placement groups in an HPC (High-Performance Computing) cluster can improve the performance and communication between instances, leading to better auto scaling capabilities. Placement groups help ensure that EC2 instances are placed in close physical proximity, reducing network latency and improving inter-instance communication.

Placement groups can be specified either in Auto scaling launch configuration or Auto scaling group settings. Multiple instances with low latency between the instances are required for large tightly coupled applications.

On AWS, this is achieved by launching compute nodes into a Cluster Placement Group (CPG), which is a logical grouping of instances within an Availability Zone. A CPG provides non-blocking and non-oversubscribed connectivity, including full bisection bandwidth between instances. Use CPGs for latency sensitive tightly coupled applications spanning multiple instances. 

Elastic Fabric Adapter (EFA) uses a custom-built operating system bypass technique to enhance the performance of inter-instance communications, which is critical to scaling HPC applications.

With EFA, HPC applications using popular HPC technologies like Message Passing Interface (MPI) can scale to tens of thousands of CPU cores. EFA supports industry-standard libfabric APIs, so applications that use a supported MPI library can be migrated to AWS with little to no modification.

EFA is available as an EC2 networking feature that you can enable on supported EC2 instances at no additional cost.

Availability Zones – To achieve low latency and high-performance throughput Single Availability Zone (AZ) is considered for HPC workloads on AWS Cloud. Multiple Availability Zones based deployment is considered for High Availability and Fault Tolerance. The recommended approach is using multi-AZ for most production workloads given the high availability and durability model it provides. Single-AZ deployment is designed as a cost-efficient solution for test and development workloads.

Storage

Once you have selected the most appropriate EC2 instances on AWS Cloud to meet the needs of your workload, your next consideration will be which storage option best meets your requirements. You will want to select both for storage attached directly to your chosen compute instances and storage presented as a file system attached to your cloud based HPC cluster.  AWS provides following options for storage:

  • Amazon Elastic Block Store (EBS), which can be added directly to EC2 compute instances, providing local scratch storage or access to a POSIX file system should this be required by your applications
  • Amazon Elastic File System (EFS)
  • Amazon FSx, a suite of fully managed services from AWS designed to help customers to deploy and manage file systems in the cloud
  • Amazon FSx for Lustre is a managed service that provides a cost effective and performant solution for HPC architectures requiring a high-performance Lustre parallel file system. 
  • Amazon FSx for OpenZFS is a fully managed file storage service that provides a ZFS file system. 
  • Amazon FSx for NetApp ONTAP and Amazon FSx for Windows File Server.

Why have we chosen the Amazon FSx for NetApp ONTAP as the storage option?

Support for Multi-protocol: Amazon FSx for NetApp ONTAP provides access to shared file storage over all versions of the Network File System (NFS) and Server Message Block (SMB) protocols, and also supports multi-protocol access (i.e. concurrent NFS and SMB access) to the same data. As a result, you can access Amazon FSx for NetApp ONTAP from virtually any Linux, Windows, or macOS client. Amazon FSx for NetApp ONTAP also provides shared block storage over the iSCSI protocol.

SAMBA access: Samba provides an easy way to connect to Linux storage systems. Users can view, copy, edit, delete, etc any files that you have access to. Within Windows, the cluster can simply be mapped as a network drive and accessed via Windows Explorer.

Access from On Prem and AWS compute services: Amazon FSx for NetApp ONTAP provides shared storage for up to thousands of simultaneous clients running in Amazon EC2 and on prem infrastructure. Users need the shared storage to be mounted on their existing desktops or laptops running on the local on prem infrastructure network.

Performance: Amazon FSx for NetApp ONTAP is designed to deliver fast, predictable, and consistent performance. It provides up to tens of GB/s of throughput per file system, and millions of IOPS per file system. To get the right performance for your workload, you choose a throughput level for your file system and scale this throughput level up or down at any time.

Low-latency access: Amazon FSx for NetApp ONTAP is built to deliver consistent sub-millisecond latencies when accessing data on SSD storage, and tens of milliseconds of latency when accessing data in capacity pool storage. It provides fast, consistent performance for latency- and performance-sensitive workloads.

Scalability: Each Amazon FSx for NetApp ONTAP file system scales to petabytes in size, allowing you to store virtually unlimited data in a single namespace. Scale-out file systems deliver the performance of multiple file systems in one by automatically spreading customers’ workloads across multiple file servers.

Support for HPC workloads: With sub-millisecond latencies and scalability to up to millions of IOPS per file system, Amazon FSx for NetApp ONTAP provides highly-available shared file storage for your high-performance computing workloads. It also supports common database features such as application-consistent snapshots (using NetApp SnapCenter), FlexClone (a data cloning feature), Continuously Available (CA) SMB shares, and Instant File Initialization.

Elastic capacity pool tiering: Each Amazon FSx for NetApp ONTAP file system has two storage tiers: primary storage and capacity pool storage. Primary storage is provisioned, scalable, high-performance SSD storage that’s purpose-built for the active portion of your data set.

Capacity pool storage is a fully elastic storage tier that can scale to petabytes in size and is cost-optimized for infrequently-accessed data. Amazon FSx for NetApp ONTAP automatically tiers data from SSD storage to capacity pool storage based on your access patterns, allowing you to achieve SSD levels of performance for your workload while only paying for SSD storage for a small fraction of your data.

Capacity pool storage automatically grows and shrinks as you tier data to it, providing elastic storage for the portion of your data set that grows over time without the need to plan or provision capacity for this data.

Multi-AZ deployments: Amazon FSx offers a multiple availability (AZ) deployment option, designed to provide continuous availability to data, even in the event that an AZ is unavailable.

Multi-AZ file systems include an active and standby file server in separate AZs, and any changes written to disk in your file system are synchronously replicated across AZs to the standby.

During planned maintenance, or in the event of a failure of the active file server or its AZ, Amazon FSx automatically fails over to the standby so you can resume file system operations without a loss of availability to your data.  

On Prem Caching:  Amazon FSx for NetApp ONTAP fully supports NetApp’s Global File Cache and FlexCache solutions, which you can deploy on premises to provide low-latency access for your most frequently-read data to on-premises clients and workstations.

Identity-based authentication: Amazon FSx for NetApp ONTAP supports identity-based authentication over NFS or SMB if you join your file system to an Active Directory (AD). Your users can then use their existing AD-based user identities to authenticate themselves and access the file system, and to control access to individual files and folders. You can access your FSx for ONTAP file systems from on-premises using AWS VPN and AWS Direct Connect

We have done network latency benchmark for data accessing [uploads and downloads] in the HPC environments deployed on AWS Cloud and on prem and the results for an upload/download from/to on-prem to HPC Cluster on Cloud is shown below:

Database

The HPC cluster setup is based on the Slurm scheduler. SLURM scheduler is set up with a database to store all the SLURM related configuration, job transactions etc.

The database can be an RDBMS DB and MySQL or similar databases are suitable for the scheduler.

The capacity requirement for the database is minimal as the SLURM DB interactions are not heavy. The database should run inside the same subnet so that the latency is also minimum.

The natural choice was to set up the Aurora MySQL Serverless version of the database as AWS provides managed service, multi AZ support, reliability and backup features by default. 2-10 ACUs of the database capacity was enough for this setup.

Remote Visualization

Visualization of results is an important aspect of an HPC workflow. Remote visualization helps accelerate turnaround times for HPC scientists and engineers because users no longer need to download data from AWS Cloud to analyze the HPC jobs’ results.

Similarly, HPC users can benefit from remote visualization techniques also to prepare their inputs directly on AWS Cloud, instead of working on fat local clients and then upload the file to the cluster to start off their simulations. NICE DCV is a remote visualization technology that enables users to securely connect to graphic-intensive 3D applications hosted on a remote server.

Infrastructure Cost

Optimizing costs depends on selecting the appropriate instances and resources on AWS Cloud. The choice of instance can significantly impact the overall cost of running an HPC workload.

For instance, opting for a cluster of smaller servers may result in a longer runtime for tightly coupled HPC workloads, while using fewer, larger servers may compute the result faster, but at a higher cost per hour.

Similarly, selecting the right storage solution can affect cost considerations. It’s crucial to balance job turnaround time with cost optimization by experimenting with different instance sizes and storage options.

AWS provides various pricing options:

  • On-Demand Instances offer flexibility, allowing to pay only for compute capacity by the hour without any minimum commitments. 
  • Reserved Instances enable you to reserve capacity upfront, offering savings compared to On-Demand pricing.
  • Spot Instances leverage unused EC2 capacity, providing further cost savings relative to On-Demand pricing.

HPC Checkpointing: Saving the state of an HPC  computation so that it can be resumed later. This can potentially save the cost of running compute resources. This would be particularly useful when choosing Spot instances to run simulations as the Spot instances may get reallocated by AWS based on the demand.

License Cost

Generally, the license cost for HPC applications are calculated based on the number of cores that are used to run the simulations.

In order to get better memory to core ratio and hence getting a better performance the simulations have to be run on a single thread per core. This way the number of licenses used per simulation also can be optimised.

The licenses are checked out from the license server based on the number of CPUs requested for the task by the HPC application.

Licenses are bound to the MAC addresses and may cause it to double up if the compute nodes are not stable during the simulation. Higher reliability is required for the compute nodes to make sure the licenses are not wasted.

This is typically not the case when On-demand compute instances are used, but this issue may be seen while using Spot instances.

Conclusion

This article describes the best practices for HPC Cluster Capacity planning on AWS Cloud focused on high level architecture and the necessary capacity considerations on Compute, Network, Storage, Region, Availability Zones, etc. Following the best practices presented in this article allows you to architect, design and plan your capacity on the AWS Cloud environment for CFD workloads.

Leave a Reply

Discover more from

Subscribe now to keep reading and get access to the full archive.

Continue reading