FOR SALE: Condos

Author: Nick Ihli, Director of Field Services

In many Academic and Research organizations, the HPC systems are purchased through grants that different professors and researchers have obtained. These principal investors (PI) help craft the policies for the HPC system in determining who is most important and given priority. Determining these policies is a challenge for administrators, as competing PIs require their groups to have guaranteed access to the number of nodes they purchased with their grants. It is easy enough to silo off the system for these different PI groups, but then you loose the benefits of having a shared cluster. You might as well just separate your system into multiple small clusters. Also, many times not all the system is owned by PIs, so there is some general population space for anyone to access.

Contemporary Residential Building Exterior in the Daylight

Sharing the system and balancing that with providing the proper SLAs and delivering high utilization amongst competing politics is Moab’s forte. We call this the Condo Model.

Use Case:
A new system has been purchased that contains 25 nodes. These nodes were paid for through grants from 5 PI groups. These groups require the number of nodes they have purchased to always be available for jobs from their membership. The five groups are: joneslab, reynoldslab, moorelab, taylorlab, thracelab. The leftover nodes are available for any user to use.

This is a simple request that is satisfied using Moab’s reservation capability; however, using just the basic reservation can lead to a lack of system efficiency. Below I will take you through the configuration of how to make the reservation play the balancing act necessary to have an efficiently shared system that delivers the required SLAs.

Reservations are the way to segment off the resources for different groups. A reservation requires three components; WHAT resources to reserve, WHO gets access to the resources, and WHEN is the reservation in place.

A reservation is signified by the “SRCFG[]” parameter. The brackets contain the name of that reservation.

The reserved resources can be directly tied to specific nodes or Moab can choose from a pool of nodes and create the reservation on the total number of nodes required. The benefit for this is if a node goes down, Moab can replace it with another node.

#WHAT resources are reserved
SRCFG[joneslab]       HOSTLIST=node1,node2,node3,node4#WHO can acces the reservation – the group “joneslab”
SRCFG[joneslab]       GROUPLIST=joneslab

 

#WHEN is the reservation in place – infinity or all the time
SRCFG[joneslab]       PERIOD=INFINITY

 

Because PI groups “owned” resources sometimes go unused, the utilization of the system goes down. This can sometimes even be for long stretches of time. SLAs are preserved but at the sacrifice of overall system utilization. The solution is to share the unused “owned” resources when the PIs are not using those resources. This is accomplished in Moab using the Hard Policy Enable (HPEnable) Access Control List (ACL) modifier in a reservation.

ACL Modifiers:
ACL Modifiers are a way for sites to change the default behavior of ACL processing. By default, a reservation can be accessed if one or more of its ACLs are met by the requestor. A modifier changes the behavior of how ACLs are handled. It allows other users to access the reservation, but only on the second scheduling or the backfill-scheduling pass. If there are no eligible jobs from those that have normal ACL access, then those with HPEnable access can run their jobs, if that job will start and finish before a normal ACL user’s job would be able to start. HPEnable is signified using the tilde sign (~).

SRCFG[joneslab]       HOSTLIST=node1,node2,node3,node4

 

#group joneslab gets priority access to the reservation, normal group has secondary access
SRCFG[joneslab]       GROUPLIST=joneslab,~normal

 

SRCFG[joneslab]       PERIOD=INFINITY

 

Affinity:
Jobs, by default, have positive affinity to a reservation; meaning a job that has ACL access to a reservation will be attracted toward the reservation. This is to make sure that reserved resources don’t go unused while jobs that can access the reservation run in unreserved resources, blocking non-reservation jobs. In the condo model, we recommend to use negative affinity to push away the jobs of secondary users or HPEnable users. This will direct those jobs to use other resources first before using the PI’s resources, and only use the PI resources as a last case scenario. Negative affinity helps maximize the potential availability of PI resources for PIs, but still allowing other secondary users access to the resources if needed. Positive affinity is signified by the + sign, negative by the – sign and neutral affinity by the = sign.

SRCFG[joneslab]       HOSTLIST=node1,node2,node3,node4

 

#Moab will try to run normal jobs on other resource first before it tries this reservation

SRCFG[joneslab]       GROUPLIST=joneslab,~normal-

 

SRCFG[joneslab]       PERIOD=INFINITY

 

Short queues and MAXTIME:
The types of agreements made with PIs mainly determine who the “secondary users” are. One recommendation is to limit the secondary users to jobs that are less than a certain period of time, for example 4 hours. With this, a PI can know the longest they will have to wait for their resources to become free of non-PI jobs is 4 hours. This is accomplished by filtering jobs less than 4 hours into a “short” queue or using the MAXTIME ACL. Using routing queues (remapping classes), based on the walltime of a job, it will be directed to a specific queue. Jobs less than 4 hours` will automatically go to the short queue. The reservation option of MAXTIME, is an ACL that limits jobs that don’t meet other ACL requirements to using the reservation only if their walltime is less than the MAXTIME value.

SRCFG[joneslab]       HOSTLIST=node1,node2,node3,node4

 

#Jobs in the short queuehave negative affinity and secondary access to the reservation

#or using MAXTIME, jobs 1 hour or less have negative affinity and secondary access

SRCFG[joneslab]       GROUPLIST=joneslab    CLASSLIST=~short- MAXTIME=~1:00:00-

 

SRCFG[joneslab]       PERIOD=INFINITY

 

Preemption:
Sometimes even waiting for a few minutes is too long for a PI to wait for their resources. If that is the case, then enabling preemption within the reservation is the answer. A PI is marked as an owner and that owner is given preemption capability over non-owner jobs in that reservation. Sometimes immediate preemption makes sense, but our recommendation is instead to implement HPEnable, negative affinity and an owner preemption parameter that allows the owner to preempt, but only after a PI job has been queued for a period of time. By utilizing all three capabilities, we maximize owned resource availability for PIs, maximize utilization for the resources, and maximize delivery of the agreed upon SLA between stakeholders and PIs.

SRCFG[joneslab]       HOSTLIST=node1,node2,node3,node4

 

#Jobs in the short queuehave negative affinity and secondary access to the reservation

#or using MAXTIME, jobs 1 hour or less have negative affinity and secondary access

SRCFG[joneslab]       GROUPLIST=joneslab,~normal-

 

#group joneslab is the owner of the reservation, if it is queued for an hour than it is allowed to preempt a normal group job in the reservation

SRCFG[joneslab]       OWNER=GROUP:joneslab FLAGS=OWNERPREEMPT OWNERPREEMPTQT=1:00:00

 

SRCFG[joneslab]       PERIOD=INFINITY

 

Here is an example of what the greater configuration would look like with each PIs reservation, where jobs from the normal group can access on a secondary basis and could be subject to preemption. For more information on the parameters detailed here, see this documentation page:

https://docs.adaptivecomputing.com/9-0-0/basic/MWM/help.htm – topics/moabWorkloadManager/topics/resourceAccess/managingreservations.html

SRCFG[joneslab]       HOSTLIST=node1,node2,node3,node4

SRCFG[joneslab]       GROUPLIST=joneslab,~normal-

SRCFG[joneslab]       OWNER=GROUP:joneslab FLAGS=OWNERPREEMPT OWNERPREEMPTQT=1:00:00

SRCFG[joneslab]       PERIOD=INFINITY

 

SRCFG[reynoldslab]    HOSTLIST=node5,node6

SRCFG[reynoldslab]    GROUPLIST=reynoldslab,~normal-

SRCFG[reynoldslab]    OWNER=GROUP:reynoldslab FLAGS=OWNERPREEMPT OWNERPREEMPTQT=1:00:00

SRCFG[reynoldslab]    PERIOD=INFINITY

 

SRCFG[moorelab]       HOSTLIST=node7,node8,node9,node10,node11,node12

SRCFG[moorelab]       GROUPLIST=moorelab,~normal-

SRCFG[moorelab]       OWNER=GROUP:moorelab FLAGS=OWNERPREEMPT OWNERPREEMPTQT=1:00:00

SRCFG[moorelab]       PERIOD=INFINITY

 

SRCFG[taylorlab]      HOSTLIST=node13,node14,node15

SRCFG[taylorlab]      GROUPLIST=taylorlab,~normal-

SRCFG[taylorlab]      OWNER=GROUP:taylorlab FLAGS=OWNERPREEMPT OWNERPREEMPTQT=1:00:00

SRCFG[taylorlab]      PERIOD=INFINITY

 

SRCFG[thracelab]      HOSTLIST=node16,node17,node18

SRCFG[thracelab]      GROUPLIST=thracelab,~normal-

SRCFG[thracelab]      OWNER=GROUP:thracelab FLAGS=OWNERPREEMPT OWNERPREEMPTQT=1:00:00

SRCFG[thracelab]      PERIOD=INFINITY