Data Staging

Author: Tech Support

In HPC, there are often times when data needs to be processed as part of the job workflow. Many times that data does not reside on the compute node and needs to be “Staged” in. In traditional, Torque-only, job submission staging, jobs must wait until the data are moved into place and will hold the node(s) hostage while that is taking place.

gears data staging cropped

 

With Adaptive data staging, Moab will create a system job to get the data in while leaving the node relatively free to run other workflow.

In Adaptive data staging, we support data staging in the following cases:

  1. Staging data to or from a shared file system
  2. Staging data to or from local node storage on a single compute node
  3. Staging data to or from a shared file system on an unspecified cluster–resolved at job migration–in a grid configuration

In configuring Moab for data staging, you configure generic metrics in your cluster partitions, job templates to automate the system jobs, and a data staging submit filter for data staging scheduling, throttling, and policies.

So let’s get into a simple use-case. In this example, you have a dataset of data residing on a node in the cluster (node name: dsnode). It is a 1GB file of consisting of numbers 1 through 1 million. Our task is to get that file onto a compute node in the cluster and then reverse-sort the file. Once that task is completed, send the file back to the dsnode. One of the requirements is to allow other jobs to run on the cluster nodes while the data is copied to and from the compute node.

To configure this scenario, you will modify the $MOABHOMEDIR/moab.cfg to include the following:

SUBMITFILTER /opt/moab/tools/data-staging/ds_filter
JOBCFG[ds] TEMPLATEDEPEND=AFTEROK:dsin TEMPLATEDEPEND=BEFORE:dsout SELECT=TRUE
PARCFG[pbs] GMETRIC[DATASTAGINGBANDWIDTH_MBITS_PER_SEC]=58

NODECFG[GLOBAL] GRES=bandwidth:10

JOBCFG[dsin] DATASTAGINGSYSJOB=TRUE
JOBCFG[dsin] GRES=bandwidth:2
JOBCFG[dsin] FLAGS=GRESONLY
JOBCFG[dsin]

TRIGGER=EType=start,AType=exec,Action="/opt/moab/tools/data-staging/ds_move_rsync --stagein",Flags=attacherror:objectxmlstdin:user

JOBCFG[dsout] DATASTAGINGSYSJOB=TRUE
JOBCFG[dsout] GRES=bandwidth:1
JOBCFG[dsout] FLAGS=GRESONLY
JOBCFG[dsout]

TRIGGER=EType=start,AType=exec,Action="/opt/moab/tools/data-staging/ds_move_rsync --stageout",Flags=attacherror:objectxmlstdin:user

 

With the dataset file residing on dsnode:/home/fred/ds directory, create a job script to reverse-sort that file when the job runs:

#!/bin/bash
#MSUB -l nodes=1,flags=allprocs,walltime=100
tac ~/dataset > ~/dataset.out

In order to show data staging actually working, create a simple script that will run while data are being staged in:

#!/bin/bash
#PBS nodes=1,flags=allprocs,walltime=60
ping -c 60 mgmtnode

Now submit the staging job and the ping job:
msub --stagein=dsnode:~/ds/dataset%fred@mgmtnode:~/dataset --stageinsize=1000 --stageout=mgmtnode:~/dataset.out%fred@dsnode:~/ds/ --stageoutsize=1000 ds.sh ; msub testping.sh

You will see that the jobs are both active, but the data staging job is using no procs:

mdiag-t_output_fix

 

Notice that the dsin job is using 0 procs. It is not consuming resources. Only the Moab.2 (testing.sh) job is. This shows the strength of staging: Placeholders are used in the queue rather than tying up the resources, until the data is finished copying and/or staging. Job 0 (ds.sh) will run after the data is staged. The dsout job will finally run to get the manipulated data back to the staging server. After the Moab.1.dsin job completes, then the Moab job (Job ID 0) will be available to run.

Check out the trigger progress:

showq_output_fix

Looking at the trigger output, you’ll see that Trigger ID3 is Blocked until the data is moved. Once moved, the trigger launches.

That’s it. You have allowed a ping job to run before the staging of the data files are complete, gaining huge efficiencies in the cluster.