Job Schedulers and Malleable/Evolving Jobs – 02

This entry is part 2 of 4 in the series Malleable and Evolving Jobs

Author: Gary D. Brown, Adaptive Computing HPC Product Manager

Introduction

In Part 2 of this 4-part blog series, we discussed the scalability problems associated with ever-larger HPC systems and compute nodes, the newer programming models and runtime environments that allow faster computation and better system utilization, the taxonomy of jobs (rigid, moldable, malleable, evolving, adaptive) and their characteristics, and then took an in-depth look at the rigid and moldable job types. If you missed Part 1, click here to read it.

This blog, Part 2, discusses the malleable, evolving, and adaptive job types, all of which are dynamic in nature with regard to resource allocations during a job’s lifetime.

Malleable Jobs

Unlike a rigid or moldable job that has a static resource allocation during the job’s lifetime, a malleable job can respond to modification requests by the job scheduler for changes in the job’s resource allocation, and the job, or more precisely the application and/or runtime environment (RTE), dynamically adapts to the new allocation while it executes.

This “malleability” capability permits the scheduler to expand or contract the job’s resource allocation in response to conditions external to the job, such as other jobs completing and making their compute nodes available to other jobs, other higher-priority jobs queuing and needing compute nodes to run, etc.

Figure 1 shows a space/time diagram illustrating a scheduler modifying a malleable job’s compute node allocation five times during the job’s lifetime so it actually experiences six different node allocations.

Malleable Job Space-Time Diagram

Figure 1 – Malleable Jobs Space/Time Diagram

Now let’s see how a scheduler can take advantage of a job’s malleability capability.

Figure 2 illustrates multiple rigid jobs in a space/time diagram, jobs “A” through “G”, and a single malleable job, job “M”, who’s node allocation the scheduler alters four times during its lifetime. The diagram denotes each change in job M’s node allocation using a subscript number, so M1 shows job M’s initial node allocation and duration, M2 identifies job M’s second node allocation and its duration after the scheduler’s first modification of the node allocation, etc.

Note the diagram starts with rigid jobs A, B, and C already executing and job M starting its execution. Users submit additional jobs “D” through “H” at the times indicated by their respective circles atop the space/time diagram.

Space-Time Diagram for Malleable Job with Rigid Jobs Execution

Figure 2 – Malleable Job with Rigid Jobs Space/Time Diagram

It is easy to see the advantage of having at least one malleable job running in an HPC system. Compute nodes that would be idle with only rigid jobs running are usable by a malleable job, thus permitting the HPC system to complete more work in the same amount of time.

The following job submission in IBM Platform LSF syntax (TORQUE job submission syntax does not support malleable jobs) indicates the job is auto-resizable (-ar) (malleable), requires a minimum of 100 to a maximum of 200 “processors” (­n 100,200), and the job’s “resize notification command” is located at “<path>” (-rnc <path>).

bsub –ar –n 100,200 –rnc <path> …

The scheduler, and sometimes the user and oftentimes an administrator, can “expand” a job’s resource allocation so the job has more resources or “contract” its resource allocation so the job has fewer resources. This blog will use the terms “expand” and “contract” to identify the relevant malleable operations that add and take away resources, respectively.

Expand Operation

When the scheduler wants to give a malleable job more resources, it tentatively allocates the resources to the job and then notifies the job of the additional resources (LSF uses the “resize notification command” located at “<path>” to “grow” a resizable job). If the notification succeeds, the scheduler commits the resources to the job’s allocation; otherwise, it releases them for use by other jobs.

Contract Operation

When the scheduler wants to take resources from a malleable job, it notifies the job of the resources it wants to take (LSF uses the “resize notification command” located at “<path>” to “shrink” a resizable job). The job then must cease using the resources. If the notification returns success, the scheduler deallocates the resources from the job’s resource allocation, thus making them available to other jobs; otherwise, it keeps the resources in the job’s allocation.

Evolving Jobs

Unlike a rigid job with a static resource allocation during the job’s lifetime or a malleable job who’s allocation the scheduler controls, an evolving job can request changes in its own resource allocation of the scheduler and when granted, the job, or more precisely the application and/or runtime environment (RTE), dynamically adapts to the new allocation it receives while it executes.

This “evolution” capability permits the job to request permission from the scheduler to grow or shrink its own resource allocation in response to conditions internal to the job.

Figure 3 shows a space/time diagram illustrating an evolving job requesting a scheduler to modify its compute node allocation three times during the job’s lifetime so it actually experiences four different node allocations.

Evolving Job Space-Time Diagram

Figure 3 – Evolving Jobs Space/Time Diagram

Now let’s see how a scheduler can take advantage of a job’s evolution capability.

Figure 4 has the same jobs as Figure 2 except Job D is actually an evolving-capable job that starts with one compute node for the job’s first 30 minutes and finishes with seven compute nodes for its last 30 minutes, which Figure 2 shows as a rigid job because its scheduler could not handle evolving jobs. Note also Job D’s part D1 lasts 45 minutes, not 30 minutes, because the scheduler could not give it the additional six nodes it requested until Job C completed.

Space-Time Diagram for Malleable and Evolving Jobs with Rigid Jobs Execution

Figure 4 – Malleable and Evolving Jobs with Rigid Jobs Space/Time Diagram

Use Case Examples

To illustrate why jobs may evolve, there are various algorithmic approaches and/or programming models that can change conditions within a job so it needs to grow or shrink its resource allocation. Here are a couple of examples.

Adaptive Mesh Refinement is one technique that can speed time-to-solution by starting out with a coarse granularity and then successively refining the granularity until the solution attains the desired accuracy. Performing the entire job at the desired accuracy can take much

In this technique, illustrated in Figure 5, the application divides the problem space into just a few data domains and then using perhaps only one or a few compute nodes, it computes its algorithms on the data at coarse granularity (Few). As the solution progresses to where it requires additional granularity to increase accuracy, the job evolves to “grow” its resource allocation so it can divide the current data domains into more finer-grained domains, which requires more compute nodes (Some). As the solution nears its final accuracy, the application once again evolves to grow the current data domains into many more domains with the finest granularity, which requires the most compute nodes (Many). Once the application computes a solution with the required accuracy, it writes out the solution and the job terminates. This is an example of a “growing” evolving job.

AMR Coarse-to-Fine Diagram

Figure 5 – Evolving Job “Grow” Model – Adaptive Mesh Refinement

Another example is a job with many tasks, [MPI] ranks, or processes that execute on many compute nodes. The job computes the solution and then the first rank, task, or process executing on the first compute node allocated to the job collects the data from the job’s other compute nodes in order to assemble the results and write them out. If the results output process takes a long time to complete, the job can “shrink” its resource allocation so it has only the first node; thus, freeing the other compute nodes for use by other jobs. This is an example of a “shrinking” evolving job.

Grow Operation

To grow, an evolving application must ask the scheduler to give it additional resources, which request the scheduler must schedule and then when resources are available, tentatively allocate them to the job. The scheduler informs the job, or more specifically, the application or runtime environment, of the additional requested resources. The job then notifies the scheduler it has successfully grown into the new resources, at which point the job uses them and the scheduler commits them to the job’s resource allocation. If for some reason the job cannot grow into its new resources, it notifies the scheduler of the failure and the scheduler releases the resources for use by other jobs.

Shrink Operation

To shrink, an evolving application must cease using some of its resources and then ask the scheduler to take them back. The scheduler removes the resources from the job’s resource allocation; thus freeing them for use by other jobs, and then notifies the job of the requested successful deallocation of its resources.

Adaptive Jobs

An adaptive job is simply a job that is both and evolving, which means it can respond to scheduler-initiated malleable operations and can also initiate evolving operations with the scheduler.

Figure 6 shows a space/time diagram illustrating an adaptive job requesting a scheduler to modify its compute node allocation three times and the scheduler requesting the job to accept another node allocation modification during the job’s lifetime so the job actually experiences five different node allocations.

 

fig6

 

Operations Naming Convention

For the reason of adaptive jobs and race conditions, this blog has named the scheduler-initiated malleable operations “expand” and “contract” and the job-initiated evolving operations “grow” and “shrink”. To illustrate with an adaptive job type, if the scheduler initiated an “expand” operation and at the same time the job initiated a “grow” operation, it would be difficult to discuss the race condition and keep separate who is doing what if the malleable and evolving operations used the same terms (e.g., grow and shrink).

Benefits of Scheduling Malleable and Evolving Jobs

The next blog will discuss the benefits accrued to an HPC site and its users when job schedulers can schedule malleable and evolving (and adaptive) applications. Stayed tuned!

Series Navigation<< Job Schedulers and Malleable/Evolving JobsJob Schedulers and Malleable/Evolving Jobs – 03 >>