Storage Options for HPC workloads on Cloud

We are running a HPC blog series to share our experience in building large scale HPC systems on Cloud for large Enterprise customers. Through this we share some of the solutions and the best practices we used while designing a HPC system on Cloud.

Previously, we discussed HPC Capacity Planning on Cloud, Architecting SLURM HPC for Enterprise and Remote Workstations for HPC on Cloud.

This article is the 4^th in the series that discusses the challenges and choice of storage options used for HPC workloads.

Storage

The main factors to be considered for setting up storage for HPC(High Performance Computing) systems are performance, scalability, durability of data and cost.

We have discussed here our approach to designing a storage solution for HPC workloads on AWS.

Here are some of the key factors considered while choosing a storage service and designing a solution.

Understanding the workload

Gather the size of data that is generated by the HPC applications for input, processing and output.
How frequently is the data accessed from the storage.
Identify if heavy data is written or read frequently.

Choice of storage

There are multiple storage options available on AWS. We will look at some of the file storage services.

Amazon EFS (Elastic File System): Suitable for workloads that require shared access to files. EFS is a fully managed NFS (Network File System) service that can scale automatically based on demand.
Amazon FSx for Lustre: If your HPC workloads require high-performance parallel file systems, FSx for Lustre can provide low-latency access to data for compute-intensive applications.
Amazon FSx for Windows File Server: If you’re running Windows-based HPC workloads, FSx for Windows File Server provides fully managed Windows file shares.
Amazon FSx for NetApp OnTap: If your HPC system requires flexibility in accessing data from both Windows and Linux systems, a scalable, high performance, flexible and cost-effective storage solution.

Performance

For achieving the best performance for simulations and rendering based HPC workloads, consider using SSD backed storage volumes using EFS or FSx Lustre with SSD or FSx for OnTap.

Scalability

Design the storage architecture to scale seamlessly as your HPC workload grows. AWS EFS or FSx services, which can automatically scale capacity and performance based on demand.
Implement sharding or partitioning strategies to distribute data across multiple storage resources for better performance and scalability.

Durable Data

Implement appropriate data protection mechanisms such as data replication, snapshots, and backups to ensure data durability and availability.

Security

Configure access controls and encryption to secure your data both in transit and at rest.

Monitoring

Continuously monitor the performance of your storage infrastructure using AWS CloudWatch metrics and other monitoring tools.

Performance Tuning

Use performance tuning techniques such as adjusting I/O sizes, optimising file system parameters, and implementing caching mechanisms to improve storage performance.

Cost Optimization

Consider data tiering and data lifecycle management policies to optimise cost.
Leverage AWS Cost Explorer and AWS Budgets to monitor and manage your storage costs effectively.

For the HPC workloads, for many customers we recommended FSx for NetApp OnTap if they needed flexibility, performance, scalability and cost-effectiveness. We will discuss the rationale behind the choice in the next section.

Why FSx for NetApp ONTAP?

AWS FSx for NetApp ONTAP provides several advantages for HPC (High-Performance Computing) workloads.

Enterprise-Grade Features

NetApp ONTAP brings enterprise-class data management capabilities to AWS, including advanced data deduplication, compression, thin provisioning, snapshots, and data replication. These features are crucial for managing large datasets efficiently in HPC environments.

High Performance

FSx for NetApp ONTAP offers high-performance block storage optimised for HPC workloads. It provides low-latency access to data, high throughput, and low variability in performance, making it suitable for demanding compute-intensive applications like simulations, modelling, and analytics.

File System Flexibility(Windows and Linux mounting)

NetApp ONTAP supports both NAS (Network Attached Storage) and SAN (Storage Area Network) protocols, providing flexibility in accessing data for different types of HPC workloads. It allows seamless integration with existing NFS (Network File System) and SMB (Server Message Block) environments, enabling easy migration of HPC applications to AWS.

Data Management and Protection

NetApp ONTAP offers robust data management and protection capabilities, including snapshots, clones, data replication, and data encryption. These features ensure data integrity, availability, and security, which are critical requirements for HPC workloads.

Scalability

FSx for NetApp ONTAP can scale both capacity and performance dynamically to accommodate growing HPC workloads. It supports incremental capacity expansion without disrupting operations, allowing organisations to scale their storage infrastructure seamlessly as their computational needs evolve.

Integration with AWS Services

Being a native AWS service, FSx for NetApp ONTAP integrates seamlessly with other AWS services and features, such as AWS Direct Connect, AWS CloudFormation, AWS Backup, and AWS IAM (Identity and Access Management). This enables organisations to leverage the full power of the AWS ecosystem for their HPC workflows.

Cost-Effective

FSx for NetApp ONTAP offers a pay-as-you-go pricing model, allowing organisations to align storage costs with actual usage. It eliminates the need for upfront hardware investments and provides predictable pricing, making it a cost-effective solution for HPC workloads on AWS.

Architecture

This section discusses the high level architecture of FSx for OnTap and also shows how it fits into the overall HPC architecture

The diagram shows the key components of the FSx for OnTap storage setup.

File System

The file system serves as the central resource in FSx for ONTAP, similar to an on-premises NetApp ONTAP cluster. NetApp provides CLI commands using which the connection can be established and can be used to manage and troubleshoot the file system.

Storage Virtual Machine (SVM)

A Storage Virtual Machine (SVM) functions as an independent virtual file server that provides management and data access endpoints. The coordination between FSx for ONTAP and an Active Directory domain happens at the SVM level. When there are Active Directory-related errors, the admin can troubleshoot at the SVM level.

Volumes

Volumes are the virtual layer that is used to organise the data. They act as the containers where the data resides using the physical storage within the file system. The volumes are built inside the SVM (Storage Virtual Machines). The volumes can be configured with tiering policies thus optimising both performance and cost.

FSx for NetApp OnTap in HPC Architecture

The following diagram shows how the FSx for Netapp OnTap storage is integrated with the HPC cluster and the virtual workstations in the HPC architecture.

The storage volumes are provisioned on OnTap and are mounted on the HPC cluster nodes and workstations.

FSx for NetApp OnTap in HPC Architecture

The “/data” and “/apps” volumes provisioned within FSx for NetApp ONTAP are strategically designed to accommodate the unique demands of HPC (High-Performance Computing) workloads:

/data Volume

The “/data” volume serves as a dedicated repository for storing essential data sets and resources integral to HPC workflows.
It provides a centralised location for housing various data types, including simulation inputs, output results, research datasets, and project-specific files.
HPC administrators typically allocate this volume to facilitate efficient data access and management for computational tasks, ensuring seamless execution of HPC workload

/apps Volume

The “/apps” volume is specifically configured to cater to the storage requirements of HPC applications and associated resources.
It offers a specialised environment optimised for hosting application binaries, libraries, dependencies, and configuration files necessary for HPC software stacks.
This volume plays a crucial role in facilitating the deployment, execution, and scalability of HPC applications, providing a reliable storage platform for critical application components.

User Level Storage Quota

In an HPC (High-Performance Computing) environment leveraging FSx for NetApp ONTAP, user-level storage quotas are allocated to individual users to govern their access to storage resources at a granular level.

Each user is identified and authenticated within the HPC system using unique user accounts managed by the system or integrated identity providers Active Directory.
Storage quotas are assigned to individual users based on their specific needs, roles, or project affiliations.
FSx for NetApp ONTAP offers features to enforce user-level storage quotas at the file system level.
The utilisation is monitored and the administrator can ensure the usage is under the limit and not over utilised.
Quotas can be changed over a period of time depending on the user requirements.

Multi Region

While FSx for NetApp ONTAP does not offer native multi-region support, implementing data replication between instances in different regions allows you to achieve multi-region redundancy and disaster recovery for your file storage.

Provision FSx for NetApp ONTAP instances in the AWS regions where redundancy and disaster recovery are needed.
NetApp ONTAP offers SnapMirror feature to replicate data between FSx for NetApp ONTAP instances in different regions.
Set up SnapMirror relationships to replicate data synchronously or asynchronously between source and destination FSx for NetApp ONTAP volumes.
Use VPC peering, VPN connections, or AWS Direct Connect to facilitate communication between regions.
Regularly review replication metrics and perform maintenance tasks such as updating replication policies as needed.
Create a disaster recovery plan for failover and failback in the event of region-wide outages or data loss incidents.

AD Authentication

Many enterprise customers use AD(Active Directory) based Identity management systems for authenticating and securing access to their IT infrastructure. The AD environment can be integrated with your cloud infrastructure and with the storage system.

To enable Active Directory (AD) authentication for FSx for NetApp ONTAP, you need to integrate your FSx for NetApp ONTAP environment with your existing Active Directory infrastructure.

You need appropriate permissions in your Active Directory environment to create computer objects and join them to the domain. This is configured through a service account resource.
Configure DNS resolution in your network environment to ensure that the FSx for NetApp ONTAP file system’s DNS name can be resolved by domain-joined clients.
From the FSx for NetApp ONTAP management console or CLI, join the FSx for NetApp ONTAP instance to your Active Directory domain.
Provide the necessary Active Directory domain information, including domain name, organisational unit (OU), and credentials with permissions to join computers to the domain.
Once FSx for NetApp ONTAP is joined to the domain, configure Active Directory authentication settings within the ONTAP management interface.
Specify the Active Directory domain and domain controller(s) to use for authentication.
Configure the appropriate LDAP settings, including LDAP servers, base DN (Distinguished Name), and bind credentials.
Ensure that users can authenticate using their Active Directory credentials and access files and directories based on their permissions.

Windows and Linux Permissions (NTFS, SAMBA)

Many of our customers needed the storage to support both Windows and Linux machines. This requirement is satisfied by FSx NetApp OnTap. FSx for NetApp ONTAP provides support for both NTFS (New Technology File System) and SAMBA (SMB/CIFS) compatibility, allowing seamless integration with Windows-based environments as well as Linux and other Unix-like operating systems.

NTFS

FSx for NetApp ONTAP supports the NTFS file system, which is the default file system for Windows operating systems.
Supports file and directory permissions, ACLs, file shares, encryption, compression and symbolic links are fully supported.

SAMBA

SAMBA is the standard protocol used for sharing files and printers between Windows and Unix/Linux systems.
FSx for NetApp ONTAP implements SAMBA compatibility to ensure seamless interoperability with Windows-based clients, Linux systems, and other devices that support the SMB protocol.
Files and directories created or modified by Windows-based clients using NTFS permissions are fully compatible with SAMBA clients, and vice versa.
This seamless interoperability ensures that users can collaborate and share files across diverse operating environments without compatibility issues.

Caching

By leveraging caching mechanisms such as read caching, write caching, metadata caching, and adaptive caching, FSx for NetApp ONTAP enhances the performance of file system operations, reduces latency, and improves overall responsiveness, delivering a high-performance and scalable file storage solution for enterprise workloads.

SnapMirror

For some of our customers we have implemented network storage in multiple regions as the HPC users are spread across multiple regions. The HPC users would want to share the data files with their colleagues in other regions. For the users to access the files across regions, the latency has to be as minimum as possible as the file sizes may range from GBs to TBs.

SnapMirror is a data replication feature in FSx for NetApp ONTAP that enables the asynchronous and synchronous replication of data between ONTAP storage systems. It allows you to create and manage replication to efficiently replicate data across different locations, such as within the same AWS region or across AWS regions, for purposes such as disaster recovery, data migration, and data distribution.

SnapMirror replicates data at the volume level, allowing you to replicate entire file systems or individual volumes between ONTAP storage systems.
It employs techniques such as incremental data transfer and block-level change tracking to replicate only the changed data blocks since the last replication cycle, reducing the amount of data transferred over the network and the time required for replication.
SnapMirror supports various replication topologies, including cascade, fan-out, and multi-directional replication, allowing you to replicate data between multiple ONTAP storage systems in complex deployment scenarios.
You can configure one-to-one, one-to-many, or many-to-one replication relationships to meet your specific replication requirements.
You can configure policies to specify replication intervals, retention policies, and other replication settings.
You can monitor replication status, track replication lag, and view replication performance metrics using the ONTAP management interface or command-line interface (CLI).

Data Migration

Most enterprises that start with HPC systems on-premises will have their data files in shared storage systems within their corporate network. When migrating the HPC workloads to the cloud, the data files need to be migrated to the cloud based network storage as well. As the HPC data files use TBs of capacity, we have recommended our customers to use DataSync to transfer the files.

AWS DataSync is a data transfer service designed to streamline, automate, and expedite the process of moving and replicating data between on-premises storage systems and AWS storage services via the internet or AWS Direct Connect. With DataSync, transferring your file system data along with associated metadata, including ownership, timestamps, and access permissions, becomes effortless and efficient.

You can use DataSync to transfer files between two FSx for ONTAP file systems, and also move data to a file system in a different AWS Region or AWS account.

Transferring files from a source to a destination using DataSync involves the following basic steps:

Download and deploy an agent in your environment and activate it (not required if transferring between AWS services).
Create a source and destination location.
Create a task.
Run the task to transfer files from the source to the destination.

Backup and Restore

Many times the data files can get deleted by human error or the file system issue in a particular zone or region. In order to ensure the resilience and reliability of the data stored in the file system, it is necessary to backup the files periodically.

There are multiple features and methods available within the Netapp OnTap for backup and restore.

Snapshot-Based Backup and Restore

NetApp ONTAP supports snapshot-based backup and restore operations, allowing you to create point-in-time snapshots of your file system data and restore data from snapshots when needed.
To perform a backup, you can create snapshots of your file systems using the NetApp ONTAP management interface or command-line interface (CLI). Snapshots capture the state of your file system at a specific point in time.
To restore data from a snapshot, you can use the snapshot restore feature to roll back your file system to a previous snapshot, effectively reverting changes made since the snapshot was taken.

SnapMirror Replication

We already discussed SnapMirror based replication. SnapMirror, a data replication feature in NetApp ONTAP, enables asynchronous and synchronous replication of data between ONTAP storage systems.

Backup to Cloud Storage

You can configure NetApp ONTAP to back up your file system data to Amazon S3 buckets using features like Cloud Backup, which allows you to create backups of your file system data in S3 for long-term retention and archival.
To restore data from a backup stored in Amazon S3, you can use NetApp ONTAP to retrieve the backup data from S3 and restore it to your file system.

Conclusion

AWS provides multiple storage solutions and particularly high performance storage options through FSx storage service. By following the key design considerations and the best practices mentioned in this article, the HPC systems can benefit from the high performance storage solutions. The storage solutions also are easy to migrate to and the data migration can be done using standard tools. All the important goals like performance, scalability, durability, security, optimisation, backup & restore and cost effectiveness can be achieved at enterprise scale.

Storage Options for HPC workloads on Cloud

Storage

Why FSx for NetApp ONTAP?

Architecture

FSx for NetApp OnTap in HPC Architecture

User Level Storage Quota

Multi Region

AD Authentication

Windows and Linux Permissions (NTFS, SAMBA)

Caching

SnapMirror

Data Migration

Backup and Restore

Conclusion

Leave a ReplyCancel reply

Recent Posts

Products

Services

Resources

Company

Storage Options for HPC workloads on Cloud

Storage

Why FSx for NetApp ONTAP?

Architecture

FSx for NetApp OnTap in HPC Architecture

User Level Storage Quota

Multi Region

AD Authentication

Windows and Linux Permissions (NTFS, SAMBA)

Caching

SnapMirror

Data Migration

Backup and Restore

Conclusion

Leave a ReplyCancel reply

Recent Posts

Discover more from

Products

Services

Resources

Company