Resource Request 2.0 Job Placement

Hopefully, you have my previous entry describing the new resource request syntax coming out in the next few weeks. I’d like to take some time to explain how the job is placed on each node. Obviously, Moab determines where the job gets launched through it’s policy-rich scheduling algorithm, so for this entry placement is describing how the job is laid out on the node’s cores, threads, and/or other internal contents. pbs_server decides how the job is placed inside each node and communicates this information to the moms through the mother superior for the job.

Jobs Using the Place Option

For jobs using the place option (-L tasks=X:place=node|socket|numanode|core|thread) the placement is relatively straightforward. For each task on the host, the placement option is used if it is completely open and meets the requirements. For example, if you request one socket per task and Moab assigns that task to node001, then pbs_server will skip over any socket that is partially used and select one that is currently unused by any job. That socket would then be off-limits for any other job, even if it has 16 cores and the task will only use one of them. The placement option is used to guarantee a larger boundary for the job. Also, if the job is requests 1 GPU and there are only GPUs on socket 1, then the task will use socket 1 and skip socket 0. MICs are treated the same as GPUs in this regard.

With place=socket and place=numanode, lprocs will be interpreted as cores unless the requested lprocs are greater than the number of cores, and the cores will be as spread as possible inside the object. So, if there’s a numanode with 4 cores and 8 threads, place=numanode:lprocs=2 will get the first and third cores. If lprocs were 6, then threads would be used out of necessity. If lprocs were 4, then each core would be added to the cgroup.

Jobs Not Specifying a Placement Option

numa_placementPlease refer to the diagram to the right. Legend:

req – one -L request is one req.

task – if the req value for tasks is > 1, then each of these is one task.

The TL;DR explanation of the diagram is basically that the default is compression. By default, pbs_server will fit as many of the tasks as have been assigned to this host into as small of a space as possible, first trying to fit it all into one numa node, then one socket, then trying to fit each task in the same order if there are leftovers, then finally spreading out if tasks must cross the boundaries.

Since it does look at things at the task level, provided that a task isn’t required to cross the boundaries of numa nodes and sockets, tasks requesting gpus or mics will have their computational resources placed closely to them even though they will not be given exclusive use of the numa node or socket in this situation.

Note: when cgroups are configured, jobs using the old (-l) syntax will be placed the same as jobs which use -L but do not specify a placement option.

To Place or Not To Place

The main goal in offering this new syntax is to offer more control. Obviously, choosing how a job should be placed should depend on the needs of the user as well as the performance of the application. Specifying how placement should happen can  dramatically speed up jobs and make execution times far more consistent when used judiciously. The main drawback is that if it isn’t used carefully, it can end up reserving a lot of resources, perhaps decreasing the number of jobs that can concurrently execute on the cluster. In clusters with large variations of node size, this risk can be greatly diminished by routing jobs of varying sizes to nodes of corresponding sizes. As with all things, knowledge is power and hopefully this article sheds some light on how things are working under the hood, enabling better use of the new syntax.