Getting HPC/HTC Workloads To Their Destination Safely and Timely

Author: Corin Kockenower, Senior Software Engineer

Has configuration drift in your HPC/HTC cluster been the root cause for inconsistent job runtimes, failures, or even non-deterministic results? Whether your job is to cure cancer, or to administer an HPC/HTC cluster, configuration management/automation tools provide both HPC/HTC cluster users and administrators invaluable tools to setup, monitor (event bus), troubleshoot, and teardown the environment in which they operate. These tools become essential and are well worth the added time and expense when workload and schedule are less tolerant to failure.

How tolerant are your workflows to failure? Do you perform a minimal inspection of your vehicle before you get in and drive away? How often do you check tire pressure, lights, and fluid levels? How likely are you to leave these items unchecked before a long trip? I would venture to say that most people become less fault tolerant when traveling longer distances or into unfamiliar territory. Wouldn’t you have more confidence and peace of mind if you performed a pre-ride check for nodes allocated to your mission-critical HPC/HTC workload? Doing so provides greater confidence that your scheduled, allocated nodes (heterogeneous or homogeneous) are in the exact state (software/hardware) they need to be in before your workload even has a chance to depart on its journey?

Regardless of distance and variables involved, at a minimum you have a plan and your plan includes a schedule and some resources (your car and your route). Your frequent commute to the office may require less planning than a long road trip, but you have a plan nonetheless. With minimal planning, your 20-minute commute to the office could easily turn into a 60-minute commute if weather or traffic conditions are unchecked and you take your traditional route. With some additional planning using tools integrated with your vehicle/mobile devices, a slightly more sophisticated plan may get you to work on-time using a different, slightly longer route, or you may conclude that telecommuting until traffic and/or weather clear up would be more efficient.

The transportation industry has recognized that many people do not perform essential pre-ride checks. With the advances in technology, many vehicle components/sub-systems are now monitored automatically by micro-controllers and sensors. Furthermore, GPS systems capable of receiving live traffic/weather feeds are now integrated into vehicles and mobile devices. Are travelers taking full advantage of these advances in technology? With a little guidance all travelers would benefit greatly from such technology. Is your HPC/HTC cluster always ready to get your workload to its destination in a safe and timely manner?

iStock_000048095962_Large

Adaptive Computing encourages customers to perform regular health (pre-ride) checks on compute nodes. Torque supports the use of ComputeNode Health Check scripts to accomplish this task. Compute Node Health Check scripts provide the scheduler (such as Moab) invaluable information useful for optimizing HPC/HTC job schedule and placement. These scripts are akin to the onboard diagnostic (OBD) systems in your vehicles. When a health check fails, a message can be associated with a node and then routed to the scheduler. Schedulers can then convey this information to administrators by way of scheduler triggers, scheduler diagnostic commands, and automatically mark the node down until the node passes health checks.

 

It is entirely up to you to define what checks are performed and how they are performed. For instance, configuration management tools like Salt, Puppet, Chef, or Ancible can be used in your Compute Node Health Check scripts to ensure each node allocated to your job(s) is in a consistent/expected state. If a node is not in the expected state, you can either fail fast by adjusting the node’s state or use your chosen configuration management tool to put the node(s) in the expected state.

 

The overhead of ensuring your nodes are healthy and your job’s environment is in the expected pre-ride state so workload runs in a consistent and more predictable manner has the potential of paying huge dividends. If your jobs run consistently and reliably, you will have more time to spend in the lab curing cancer or testing the next generation cluster instead of troubleshooting and fixing configuration drift problems.