Remote Workstation for HPC workloads on Cloud

We are running a HPC blog series to share our experience in building large scale HPC systems on Cloud for large Enterprise customers. Through this we share some of the solutions and the best practices we used while designing a HPC system on Cloud. 

Previously, we discussed HPC Capacity Planning on Cloud and Architecting SLURM HPC for Enterprise.

This article is the 3rd in the series that discusses the challenges and choice of workstations used for HPC workloads.

Why Remote Workstations?

We have seen in our previous blogs on how to design an enterprise grade HPC system on cloud. Any research user who wants to run experiments such as CFD(Computational Fluid Dynamics), Computational Biology, Financial risk analysis, seismic imaging, genomics research or basic scientific research can get access to the HPC systems on the cloud and run those experiments. 

The experiments consist of multiple steps like pre-processing, simulations and post-processing. Traditionally in on-premises the R&D users would run the modelling steps using their over the desk workstations and run the simulations on the HPC clusters. Sometimes, they would run smaller simulations on their workstations as well. Finally they would also visualise the results on the workstations at their desk. This requirement of the users forces the enterprise to buy expensive workstations and requires a lot of upfront investment, maintenance cost and other overheads. 

Moving to remote workstations on the cloud brings a lot of advantages in terms of cost, choice of capacity in CPUs and GPUs, flexibility, reliability and availability of resources. The users do not need to have powerful desktop workstations anymore. All the users need is just a regular laptop or workstation to connect to the special purpose remote workstations and perform their experiments.

Workstations on AWS

A common practice on AWS is to use Amazon EC2 instances as workstations. They bring many benefits out of the box. Amazon EC2 offers a wide range of instance types with varying CPU, memory, storage, and GPU capabilities. This allows us to choose the instance type that best suits your specific HPC workload requirements, whether it is for pre-processing and visualisation of results or even running a small scale simulation. 

Provides full control over the operating system and the configurations of the EC2 instances. Although both Windows and Linux based instances are available, in most cases the HPC users would require a Windows based workstation to run their HPC applications like ANSYS, COMSOL, MATLAB, PowerFlow tools etc.

The instances can be provisioned in no time, can be stopped and restarted as per the need and also can be terminated if they are no longer needed. Based on the number of R&D users the EC2 workstations can be launched and terminated. This allows optimisation of cost by scaling instances as needed.

The R&D users can connect to the EC2 Windows workstations using remote desktop protocol (RDP). AWS also provides a better remote visualisation option called NiceDCV which uses the DCV protocol. We will see more details about NiceDCV in the following sections.

The remote EC2 workstation can be set up on the same VPC as the HPC cluster nodes and can connect to the cluster in order to submit jobs.

How does the Workstation fit into HPC architecture?

As we discussed above the EC2 workstation can be launched and configured to connect to the HPC cluster. The R&D users from the enterprise network can connect to the AWS EC2 workstation instances using RDP or NiceDCV protocols. The following architecture diagram shows a high level view of how the EC2 workstation, HPC cluster and the users connect with each other.

Fig1: HPC Workstations Architecture

AWS recommends multiple instance types that are suitable for remote workstations like Amazon EC2 G3, G4dn, G4ad, or G5 instance types. These instance types offer GPUs that support hardware-based OpenGL and GPU sharing. For more information, see Amazon EC2 G3 Instances, Amazon EC2 G4 instances, and Amazon EC2 G5 Instances.

The selection of instance type for virtual workstation was based on the following factors

  • HPC users would connect from their desktops or laptops on premise to the remote workstations using NiceDCV clients and hence the remote workstation needs to run the NiceDCV server.
  • Multiple users can share sessions on the remote workstation to collaborate on the HPC research.
  • Hyper-Threading
  • Choose a GPU instance type based on the specific requirements of your HPC workload. AWS offers instances with NVIDIA GPUs optimised for various tasks, including general-purpose computing, machine learning and graphics rendering.
  • Number of GPUs and Memory
  • The creation of the 3D geometry using ANSYS Workbench.
  • Visualisation of the results.
  • Network bandwidth to transfer input and output files between the workstation and the storage.
  • Cloud regional availability.
  • Cost and Performance.

The hardware requirement of the workstation is as follows,


We compared the features of G4dn/P4d and P3 instance types which are recommended by AWS for HPC use cases. 

We tested the geometry step in all the instance types and found that the performance was acceptable in all of them. But the major differentiator was the cost of the G4dn.8xlarge instance. Similarly, the visualisation of the results matched the expected performance and the G4dn.8xlarge instance type was chosen as the standard for all virtual workstations.

Why NiceDCV?

NICE DCV is a high-performance remote display protocol that provides HPC users with a secure way to deliver remote desktops and application streaming from any cloud or data center to any device, over varying network conditions. With NICE DCV and Amazon EC2, HPC users can run graphics-intensive applications remotely on EC2 instances, and stream their user interface to simpler client machines (thin clients), eliminating the need for expensive dedicated workstations. HPC users across a broad range of HPC workloads use NICE DCV for their remote visualisation requirements. 

The following are the key benefits of using NICE DCV as a remote visualisation solution for HPC environments.

GPU-Accelerated Performance

NiceDCV harnesses GPU acceleration to deliver high-performance graphics rendering and visualisation. This enables smooth interaction with complex 3D models, simulations, and virtual environments.

Multi-User Scalability

NiceDCV efficiently scales to support multiple users accessing and interacting with remote applications simultaneously. It ensures that each user receives optimal performance and responsiveness, even during peak usage periods.

Cross-Platform Compatibility

NiceDCV is compatible with a wide range of client devices and operating systems, including Windows, macOS, Linux, and mobile platforms. This versatility allows users to access their applications from various devices and environments.

Adaptive Streaming Technology

NiceDCV incorporates adaptive streaming technology to dynamically adjust streaming quality based on network conditions and client device capabilities. This ensures a consistent user experience, even in challenging network environments with variable bandwidth.

Client-Side Rendering Optimization

NiceDCV optimises rendering tasks by offloading certain processing tasks to the client device when feasible. This reduces the workload on the server side and improves overall performance, particularly in scenarios with limited server resources.

Security Features

NiceDCV prioritises security and implements encryption and authentication mechanisms to protect data in transit and ensure secure remote access. It integrates seamlessly with AWS security services and features, providing a secure environment for sensitive workloads.

Specialized Workload Support

NiceDCV is optimised for a wide range of specialised workloads, including engineering simulations, scientific visualisation, automobile design and many other High Performance Computing workloads. It provides the performance and flexibility needed to support these demanding applications effectively.

In the next section we will discuss how to setup workstation image and automate the launching of workstation.

Workstation with NiceDCV vs AWS Workspace

We have seen the features and advantages of the NiceDCV in the previous section.

  • AWS WorkSpaces is a fully managed desktop-as-a-service (DaaS) solution. It provides users with a cloud-based virtual desktop environment accessible from anywhere, using any supported device.
  • WorkSpaces offers a standard desktop experience with access to commonly used productivity tools like Microsoft Office, web browsers, email clients, etc.
  • While WorkSpaces does support basic graphics capabilities, it’s not optimised for high-performance visualisation tasks like those required in CAE or scientific simulations.
  • WorkSpaces is ideal for scenarios where users need a general-purpose virtual desktop environment for everyday tasks, remote work, collaboration, and accessing corporate applications.
  • In summary, NiceDCV is tailored for demanding graphics applications that require high-performance visualisation capabilities, while AWS WorkSpaces is a more general-purpose virtual desktop solution suitable for a wide range of business use cases.

How to setup a Workstation image?

Setting up EC2 Windows instances with NiceDCV involves several steps to configure both the EC2 instance and the NiceDCV server.

Launch EC2 Windows Instance

  • Log in to the AWS Management Console and navigate to the EC2 dashboard.
  • Click on “Launch Instance” to start the instance creation wizard.
  • Choose a Windows Server AMI (Amazon Machine Image) with appropriate Windows version.
  • Select an instance type with sufficient CPU, memory, and optionally GPU resources for your workload.
  • Configure instance details such as network settings, storage, and security groups.
  • Add any additional configurations or user data scripts as needed.
  • Review and launch the instance.

Configure NiceDCV on EC2 Instance

  • Once the instance is running, connect to it using Remote Desktop Protocol (RDP).
  • Download and install the NiceDCV server software on the Windows instance.
  • Configure NiceDCV settings, including authentication, display resolution, and network settings, according to your requirements.
  • Ensure that the necessary firewall rules and security group settings allow inbound connections to the NiceDCV server port (default is TCP port 8443).

Join EC2 Instances to the Active Directory Domain

  • Once the EC2 instances are running, connect to each instance using Remote Desktop Protocol (RDP).
  • Open the Server Manager, navigate to “Local Server” settings, and click on “Workgroup” to change the system settings.
  • Click on “Change” and enter the name of your Active Directory domain. You’ll be prompted to provide domain administrator credentials.
  • After joining the domain, restart the EC2 instances for the changes to take effect.

Configure AD Authentication on EC2 Instances

  • After joining the domain, you can configure AD authentication for user access to the EC2 instances.
  • Log in to each EC2 instance using domain user credentials to verify that AD authentication is working correctly.
  • You can now manage user access and permissions on the EC2 instances using Active Directory group policies and permissions.

Connect to NiceDCV Session

  • Once NiceDCV is configured on the EC2 instance, you can connect to it using a NiceDCV client application installed on your local machine.
  • Launch the NiceDCV client and enter the public IP address or DNS name of your EC2 instance.
  • Enter the AD authentication credentials and connect to the NiceDCV session.

Optimise NiceDCV Performance

  • Fine-tune NiceDCV settings to optimise performance based on your specific workload requirements.
  • Adjust settings such as image quality, frame rate, and compression level to achieve the desired balance between performance and visual fidelity.
  • Consider leveraging GPU resources for graphics-intensive workloads by installing GPU drivers and configuring NiceDCV to utilise GPU acceleration.

Test and Validate

  • Test the NiceDCV session by running your applications or workloads on the EC2 instance.
  • Validate performance, responsiveness, and visual quality to ensure that NiceDCV meets your expectations.
  • Monitor resource utilisation on the EC2 instance to identify any potential bottlenecks or performance issues.

How to automate the Workstation deployment?

Once the workstation setup is validated, we automated the deployment of EC2 instances with NiceDCV using AWS CloudFormation.

AWSTemplateFormatVersion: ‘2010-09-09’
Description: ‘CloudFormation template for EC2 Windows instance with NiceDCV’

Resources:
EC2Instance:
Type: AWS::EC2::Instance
Properties:
InstanceType: g4dn.2xlarge
ImageId: ami-xxxxxxxxxxxxxxxxx # Specify a Windows Server AMI
KeyName: MyKeyPair
SecurityGroupIds:
– !Ref InstanceSecurityGroup
UserData:
Fn::Base64: !Sub |
<powershell>
# Install NiceDCV prerequisites
Install-WindowsFeature -Name Server-Media-Foundation -IncludeAllSubFeature -IncludeManagementTools
# Download NiceDCV installer
$url = “https://d1uj6qtbmh3dt5.cloudfront.net/NiceDCV-2021.2-<architecture>-Setup.exe”#
Replace <architecture> with x86 or x64
$output = “C:\NiceDCV-Setup.exe”
Invoke-WebRequest -Uri $url -OutFile $output
# Install NiceDCV
Start-Process -FilePath $output -ArgumentList “/S” -Wait
</powershell>
Tags:
– Key: Name
Value: EC2WithNiceDCV

InstanceSecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: Allow inbound RDP and NiceDCV traffic
SecurityGroupIngress:
– IpProtocol: tcp
FromPort: 3389
ToPort: 3389
CidrIp: 0.0.0.0/0 # Allow RDP from anywhere (for demonstration purposes)
– IpProtocol: tcp
FromPort: 8443
ToPort: 8443
CidrIp: 0.0.0.0/0 # Allow NiceDCV traffic from anywhere (for demonstration purposes)

CFT templates can be used to launch new workstations for R&D HPC users. The workstation IP can be shared to the user for whom it is assigned.

How to connect to the remote Workstation?

This section describes the steps to connect to the remote workstation using Nice DCV. 

Two options are available

  • Connection via Nice DCV client app
  • Connection via a web browser.

The recommended option (described first) is to connect via Nice DCV client app.

Nice DCV Session via Client App 

  1. Open NICE DCV Client 
  2. Enter the IP address of the virtual workstation 
  3. Click Trust and Connect, on the Server Identity check page 
  4. Enter the Username and Password and click Login
    • Username: <AD Domain>\username 
  5. Wait for the connection Allow the Nice DCV client to establish a connection with the remote server. This may take a moment, depending on your internet connection and the server’s performance. 

Nice DCV Session via Web Browser If you cannot install the Nice DCV client, you can use the web browser option. 

  1. Launch your preferred web browser on your machine. 
  2. Navigate to the URL as mentioned below In the address bar of the web browser, enter the web address provided by your system administrator. It typically looks like [https://:8443].
  3. Click the “Advanced” button on the non-secure connection warning page. 
  4. Click Proceed to IP address (unsafe).
  5. You will be directed to the Nice DCV web portal. Log in with your credentials, including your username and password. E.g. <AD Domain>\username 
  6. Click sign in. 
  7. Allow the Nice DCV client to establish a connection with the remote server. This may take a moment, depending on your internet connection and the server’s performance. 

In the next set of sections we will look at some of the key features that are useful for HPC teams.

Multiple Users Session Sharing

NICE DCV users can collaborate on the same session, enabling screen and mouse sharing. Users can join authorised sessions while session owners can disconnect users from any session collaboration. To take advantage of this feature, users must be joining the same session identified by the same session ID. By default, the only user that can connect to a NICE DCV session is the owner of that session. For other users to collaborate on the same session, the active permissions applied to the session need to be updated to include the display parameter. For more information on editing the permissions file, see Configuring NICE DCV authorization

Workstation Lifecycle & Automation

Enterprises can save cost on remote workstations launched by R&D users by making sure any unused workstations are stopped or terminated depending on the usage patterns. Although it looks like a simple process, it can save a lot of cost over long periods of time considering multiple users running multiple workstations. 

The AWS Instance Scheduler is a solution that enables you to automatically start and stop Amazon EC2 instances on a schedule. This helps you save costs by ensuring that instances are only running when they are needed.

  1. The Instance Scheduler allows you to define start and stop schedules for your EC2 instances based on your organisation’s usage patterns and requirements.
  2. You can schedule instances to start, stop, or both at specific times of the day, week, or month, depending on your workload and operational needs.
  3. You can configure the Instance Scheduler using AWS CloudFormation templates provided by AWS. These templates create the necessary AWS resources, such as AWS Lambda functions, Amazon CloudWatch Events rules, and Amazon DynamoDB tables, to manage instance schedules.
  4. The Instance Scheduler uses tags on EC2 instances to determine which instances to include in the scheduling process. You can specify tags to identify instances and associate them with specific schedules.
  5. It uses Events rules to trigger Lambda functions based on predefined schedules, which then perform actions such as starting or stopping instances.
AWSTemplateFormatVersion: ‘2010-09-09’
Description: ‘AWS Instance Scheduler’

Resources:
InstanceSchedulerLambdaFunction:
Type: ‘AWS::Lambda::Function’
Properties:
Handler: ‘index.handler’
Role: !GetAtt InstanceSchedulerLambdaRole.Arn
Runtime: ‘python3.8’
Code:
ZipFile: |
import boto3
import os
def handler(event, context):
# Initialize AWS clients
ec2 = boto3.client(‘ec2’)

# Retrieve EC2 instance IDs based on tags
instances = ec2.describe_instances(Filters=[{‘Name’: ‘tag:InstanceScheduler’, ‘Values’: [‘true’]}])

# Start or stop instances based on schedule
for reservation in instances[‘Reservations’]:
for instance in reservation[‘Instances’]:
instance_id = instance[‘InstanceId’]
action = event[‘action’]
if action == ‘start’:
ec2.start_instances(InstanceIds=[instance_id])
elif action == ‘stop’:
ec2.stop_instances(InstanceIds=[instance_id])

InstanceSchedulerLambdaRole:
Type: ‘AWS::IAM::Role’
Properties:
AssumeRolePolicyDocument:
Version: ‘2012-10-17’
Statement:
– Effect: Allow
Principal:
Service: ‘lambda.amazonaws.com’
Action: ‘sts:AssumeRole’
Policies:
– PolicyName: ‘EC2InstanceSchedulerPolicy’
PolicyDocument:
Version: ‘2012-10-17’
Statement:
– Effect: Allow
Action:
– ‘ec2:DescribeInstances’
– ‘ec2:StartInstances’
– ‘ec2:StopInstances’
Resource: ‘*’

InstanceSchedulerEventRuleStart:
Type: ‘AWS::Events::Rule’
Properties:
Description: ‘Schedule EC2 instance start’
ScheduleExpression: ‘cron(0 8 * * ? *)’ # Start instances at 8:00 AM UTC daily
State: ‘ENABLED’
Targets:
– Arn: !GetAtt InstanceSchedulerLambdaFunction.Arn
Id: ‘InstanceSchedulerStart’
EventBusName: ‘default’
InstanceSchedulerEventRuleStop:

The above descriptions of the Workstation lifecycle management is based on AWS native services. 

We have implemented workstation lifecycle management as part of our HPC self-service platform called Tachyon. We will be sharing more information on the product in a series of upcoming articles.

Workstation Modifications

In order to make the remote workstations enterprise grade secure workstations, we recommend changing some of the configurations as mentioned here.

  1. Install necessary software like productivity tools and development environments.
  2. Configure graphics drivers and install NiceDCV for remote visualisation.
  3. Ensure security with firewall settings, user permissions, and antivirus software.
  4. Set up networking, RDP for remote access, and performance optimization.
  5. Implement backup solutions, monitoring tools, and document configurations.

Workstation Usage / Utilisation Patterns

There are a couple of ways to track usage and utilisation of your EC2 workstation for HPC workloads

Using Amazon CloudWatch

CloudWatch is a free monitoring service offered by AWS that provides detailed insights into your EC2 instances.

Monitor CPU and GPU Utilisation

  • CloudWatch allows you to monitor CPU utilisation (like CPUUtilization metric) and GPU utilisation (depending on the GPU type, specific metrics may vary). This helps identify if your workloads are maxing out the resources.

Track Network Traffic

  • Monitor network traffic metrics (like NetworkIn and NetworkOut) to understand data transfer between the workstation and external sources.

Disk Usage

  • CloudWatch offers metrics for disk space usage to identify if your storage is nearing capacity.

Set up Alarms

  • Define CloudWatch alarms based on these metrics. For example, set an alarm to trigger if CPU utilisation goes above a certain threshold, indicating potential bottlenecks.

We have implemented workstation observability that can track CPU, memory, network traffic, disk usage and alerts as part of our HPC self-service platform called Tachyon. We will be sharing more information on the product in a series of upcoming articles.

Workstation CPU / GPU Considerations

Choosing the right CPU and GPU for your EC2 workstation for HPC workloads depends on the specific needs of your applications.

CPU
  • vCPUs (virtual CPUs): HPC workloads often benefit from high core counts for parallel processing. Look for EC2 instances with a high vCPU count, such as the C6g or R6g instances.
  • Clock Speed: While core count is important, clock speed also matters. CPUs with faster clock speeds will improve single-threaded performance, which can benefit certain HPC tasks. Consider a balance between vCPUs and clock speed based on your workload.
GPU
  • Memory: HPC workloads involving large datasets can benefit from GPUs with ample memory. Instances like P4d with NVIDIA A100 GPUs offer significant memory for complex computations.
  • Processing Power: Different GPU architectures are suitable for different types of workloads. HPC workloads involve graphics-intensive applications, G4dn instances with NVIDIA Tesla GPUs are more suitable.
Additional Considerations
  • Cost: GPU instances can be expensive. There are other cost-effective options like G5g instances with Graviton2 processors and NVIDIA T4G GPUs for workloads that don’t require the top-tier performance of G4dn instances.
  • Storage: Depending on your data size, choose an instance with sufficient storage capacity or consider attaching EBS volumes for additional storage. Usually a NFS based storage is preferred for long term storage of HPC workload input and output files. AWS FSx provides multiple options like FSx for Lustre and FSx for NetApp OnTap storage solutions.

Workstation Authentication

Users can use AD credentials to login into the remote workstation while connecting either through RDP or NiceDCV. The details of how AD integration can be configured is given as part of the image creation section above. This section summarises the key steps involved.

Here are the important configurations to enable AD authentication.

Join EC2 Instance to the Domain

  1. Connect to the EC2 Windows instance using Remote Desktop Protocol (RDP).
  2. Open “System Properties” by right-clicking on “This PC” and selecting “Properties.”
  3. Click on “Change settings” under “Computer name, domain, and workgroup settings.”
  4. Select “Change” and enter the domain name. Provide domain admin credentials when prompted.
  5. Restart the instance to apply changes.

Configure AD Authentication

  1. Once the instance is joined to the domain, users can log in using their Active Directory credentials.
  2. Ensure that domain users have the necessary permissions and access rights on the instance.

Security Group Configuration

  1. Adjust security group settings to allow communication with the Active Directory domain controllers on the required ports (e.g., TCP/UDP 389 for LDAP).
  2. Ensure that necessary ports for AD authentication are open in the Windows Firewall.

DNS Configuration

  1. Ensure that the EC2 instance’s DNS settings are configured to point to the Active Directory domain controllers.
  2. Update the DNS server settings in the instance’s network adapter properties.

After all the configuration is done, connect to the remote workstation from your laptop/desktop using RDP or NiceDCV and test the AD authentication by logging in with domain user credentials.

Workstation Costing

Here’s a comparison of EC2 G4dn, p3, and p2 instance types focusing on GPUs and costs

GPU
  • G4dn: Features NVIDIA Tesla T4 GPUs with 16 GB of GDDR6 memory. These GPUs are well-suited for machine learning inference and cost-effective small-scale training.
  • P3: Uses NVIDIA Tesla P100 or P4d with NVIDIA A100 GPUs. Memory varies depending on the specific instance size (ranging from 16 GB to 40 GB). P3 instances offer higher performance compared to G4dn for various workloads including deep learning training and inference.
  • P2: Equipped with older generation NVIDIA Tesla K80 GPUs with 12 GB of GDDR5 memory. P2 instances are the least expensive option among the three but offer the lowest performance.
Cost
  • G4dn: Generally the most cost-effective option, especially for workloads that don’t require top-tier performance.
  • P3: More expensive than G4dn due to the more powerful GPUs.
  • P2: The least expensive option but may not be cost-effective if your workload requires significant processing power.
    • Use the AWS Pricing Calculator https://aws.amazon.com/ec2/pricing/ to get the latest on-demand pricing for different instance types and regions.
    • Consider exploring AWS reserved instances or Savings Plans for significant cost savings if you plan to use the EC2 workstation for extended periods.

Workstation Catalog

Catalog of workstations can be created with different EC2 AMIs that are tailor made for specific HPC applications for specific use cases. As part of Tachyon which is our self-service platform provides a workstation catalog feature that can be used by HPC users to dynamically request to launch new workstations. The users can use and manage the workstations using the Tachyon platform.

Conclusion

AWS EC2 service provides many choices of instance types that are suitable for remote workstations that can be used for HPC CAE workloads. By following all the key design considerations and best practices mentioned in this article, HPC systems can provide flexibility, scalability, availability to the R&D users. They can also provision and manage remote workstations in a cost effective manner.

Leave a Reply

Discover more from

Subscribe now to keep reading and get access to the full archive.

Continue reading