Data Staging
Moab Workload Manager® for Grids

17.12 Data Staging

Moab allows sites to manage job data staging requirements so as to minimize resource inefficiencies and maximize system utilization. Without scheduler-controlled data staging, a job must handle its own data staging. This leads to inefficiencies as a job does not use its assigned compute resources while waiting for its data to be staged. Moab's data staging facilities prevent the loss of compute resources due to data blocking and can significantly improve cluster performance.

NOTE: Moab Workload Manager can only schedule data-staging operations if the involved resource managers are using Moab as their primary scheduler.  If this is not the case, then the capabilities described below will not be functional.

This section describes the following:

17.12.1 Data Staging Models

Moab supports, or plans to support, four different data staging models:

  1. Verified Data Staging (External)
  2. Prioritized Data Staging (Loose)
  3. Fully-Scheduled Data Staging (Tight)
  4. Data Staging to Allocated Nodes (Local)

The DATASTAGEMODEL parameter is used to configure Moab to use one of these models.

In each model, Moab handles data staging using a storage resource manager interface. This interface is configured using the RMCFG parameter. To actually drive the storage resource manager, a number of RM interface attributes must be set. The TYPE, RESOURCETYPE, and SYSTEMQUERYURL attributes must always be set. In addition, other attributes will be required depending on the data staging model used. Then, job submission resource managers can use this storage interface to stage data by specifying it with the DATARM attribute.

  • TYPE - must be NATIVE in all cases
  • RESOURCETYPE - must be set to STORAGE in all cases
  • SYSTEMQUERYURL - specifies method of determining file attributes such as size, ownership, etc.
  • CLUSTERQUERYURL - specifies method of determining current and configured storage manager resources such as available disk space, etc.
  • SYSTEMMODIFYURL - specifies method of initiating file creation, file deletion, and data migration

Moab is pre-packaged with several interface scripts that will work for many situations. These scripts are located in the tools directory (those beginning with dstage) and may be customized to fit your particular needs. To use these scripts, simply define a resource manager with the needed URL attribute pointing to the appropriate script.

17.12.2 Verified Data Staging (External)

In this model, an external data server entity is responsible for staging needed job data. Moab has no control or influence over the timing or execution of data staging decisions. It can only determine that a job has data staging requirements and avoid starting the job until it can verify that these requirements are met. This data staging model eliminates the situation where a job is assigned resources it is unable to immediately use.

To determine when the stage-in operation is complete, Moab uses a storage resource manager SYSTEMQUERYURL interface to retrieve information about the files being staged (see below for more information). Optionally, Moab will provide diagnostic information about the storage resource manager if the CLUSTERQUERYURL interface is specified.

To take advantage of Verified Data Staging, a job must be submitted with an indication of its stage-in data requirements. The resource manager extension STAGEIN is used to indicate a job's stage-in data files. This extension can be used directly by the user or inserted via a portal or submit filter. For an example, see the TORQUE submission filter page.

Example (w/TORQUE)

moab.cfg
...
RMCFG[torque] TYPE=PBS DATARM=data

DATASTAGEMODEL EXTERNAL
RMCFG[data] TYPE=NATIVE  RESOURCETYPE=STORAGE
RMCFG[data] SYSTEMQUERYURL=exec://$TOOLSDIR/system.query.dstage.pl
...

qsub
> qsub -W x="STAGEIN:file:///home/jsmith/big301.dat" job.cmd

1435.jupiter submitted 

Diagnostics

Moab displays information about data staging in:

Checkjob

The checkjob command reports information on both input and output data stage requests. This information includes the following:

  • stage type - input or output
  • file name - reports destination file only
  • status - pending, active, or complete
  • file size - size of file to transfer
  • data transfered - for active transfers, reports number of bytes already transferred

Example

checkjob
$ checkjob -v 412
job 412 (RM job '412.geophys.icluster')

State: Idle
Creds:  user:test2  group:test2  class:batch  qos:DEFAULT
WallTime: 00:00:00 of 00:16:40
SubmitTime: Mon Jun  6 15:11:24
  (Time Queued  Total: 00:00:56  Eligible: 00:00:39)

StageIn:  File=$HOME/data14.txt  Size=91 MB  Status=complete
...

Checknode

The checknode command will report information on storage managers' pending, active, and completed data stage requests as well as cluster resources dedicated to these requests. This information includes the following:

  • active and max storage manager data staging operations
  • dedicated and max storage manager disk usage
  • file name - reports destination file only
  • status - pending, active, or complete
  • file size - size of file to transfer
  • data transfered - for active transfers, reports number of bytes already transferred

Example

checknode
$ checknode -v storage.koa
node storage.koa

State:      Idle  (in current state for 00:01:59)
Configured Resources: DISK: 71G  dsop: 8
Utilized   Resources: DISK: 25G
Dedicated  Resources: ---
Active Data Staging Operations:  1 (limit: 8)
  job              410  complete (3091 bytes)  ($HOME/test.dat)
  job              411  complete (42 MB)  ($SCRATCH/modeldata.3)
  job              414  complete (813 MB)  ($SCRATCH/phys.john13)
  job              415  complete (16544 bytes)  ($SCRATCH/iolist.ng)
  job              419  complete (91 bytes)  ($HOME/data37.txt)
  job              422    active (37 of 83 MB)  ($SCRATCH/modeldata.4)

Dedicated Storage Manager Disk Usage:  938 of 73057 MB (Target=18264 MB)
Cluster Query URL:  exec:///$HOME/tools/dsquery.pl
Partition:  ALL  Rack/Slot:  ---
Flags:      rmdetected
RM[storage]:    TYPE=NATIVE:AGFULL

Total Time: 3:01:01:08  Up: 3:01:01:08 (100.00%)  Active: 00:00:00 (0.00%)

Reservations:  ---

...

17.12.3 Prioritized Data Staging (Loose)

In this model, Moab is assumed to have influence over the order in which data staging operations are executed. Moab still doesn't have full control over the staging, but is responsible for initiating the data staging operations for each job. Also, Moab assumes that the data server is unable to provide an accurate estimate of when a data migration request will be complete.

To allow Moab to initiate a data staging operation, a storage manager must be configured with the SYSTEMMODIFYURL and the SYSTEMQUERYURL attributes set. Further, if data manager throttling is desired, the CLUSTERQUERYURL attribute should be set to allow Moab to monitor data resource usage and prevent possible data cache thrashing.

If Moab detects a job with data stage-in requirements it first checks that the job's assigned resource manager has a storage manager associated with it. If this is the case and the storage manager has the SYSTEMMODIFYURL attribute set, it will attempt to stage the data by utilizing the interface defined by SYSTEMMODIFYURL. Moab will block the job until the staging operation is complete. Because this model allows Moab to explicitly request data migration actions, Moab can control when each request is made and, to some degree, have data staged according to batch system job prioritization and compute resource availability constraints. Consequently, Moab can seek to maximize the use of the data manager so as to optimize cluster performance and minimize response times for the most important jobs.

As mentioned above, if the CLUSTERQUERYURL attribute is set, Moab will monitor and control the disk usage on the storage resource manager. In addition to this attribute, the $MOABHOMEDIR/dataspaces.tab file must be created/modified to include any data space areas that you would like Moab to monitor. Multiple locations on remote nodes can be monitored for availbility and disk space. (See below example for syntax.)

Example 1: Prioritized Data Staging with Data Cache Constraints (w/TORQUE)

moab.cfg
...
RMCFG[torque] TYPE=PBS DATARM=data

DATASTAGEMODEL LOOSE
RMCFG[data] TYPE=NATIVE RESOURCETYPE=STORAGE
RMCFG[data] SYSTEMQUERYURL=exec://$TOOLSDIR/system.query.dstage.pl
RMCFG[data] CLUSTERQUERYURL=exec://$TOOLSDIR/cluster.query.dstage.pl
RMCFG[data] SYSTEMMODIFYURL=exec://$TOOLSDIR/system.modify.dstage.pl
...

dataspaces.tab
# FORMAT: <protocol>://<host>/<remote_path> STATE=active

scp://head_node/home/ STATE=active
scp://data_node/cluster/users/storage/ STATE=active

qsub
> qsub -W x=STAGEIN:file:///tmp/big01.dat|file:///tmp/big02.dat,file:///home/test/ chembio.cmd

1455.jupiter submitted 

Example 2: Prioritized Data Staging with Data Cache and Transfer Agent Constraints

A given site uses a hierarchical storage manager (HSM) in conjunction with a single large SMP system. Preliminary monitoring indicates that only 25% of SMP to HSM traffic is input file based and 75% is output file based. The site also currently manages data stageback using a homegrown solution which stages data back afterdata_node/cluster/users/storage/ STATE=active job completion. Consequently, in order to free up compute resources at the earliest time possible, Moab needs to intelligently prestage the data to ensure that total data stage does not exceed 25% of total SMP disk resources.

In addition, the HSM system is known to perform best with 8 or fewer active data transfer agents. When this value is exceeded, some level of thrashing appears and performance is reduced. In the following configuration, the MAXDSOP attribute is used to prevent more than 8 simultaneous stagein requests and the TARGETUSAGE attribute prevent more than 25% of available disk resources to be consumed by input data staging requests.

moab.cfg
...
RMCFG[smp] DATARM=hsm

DATASTAGEMODEL LOOSE
RMCFG[hsm] TYPE=NATIVE RESOURCETYPE=STORAGE
RMCFG[hsm] TARGETUSAGE=80%  MAXDSOP=8
RMCFG[hsm] SYSTEMQUERYURL=exec://$TOOLSDIR/system.query.dstage.pl
RMCFG[hsm] CLUSTERQUERYURL=exec://$TOOLSDIR/cluster.query.dstage.pl
RMCFG[hsm] SYSTEMMODIFYURL=exec://$TOOLSDIR/system.modify.dstage.pl
...

Example 3: Grid Data Staging

moab.cfg
...
SCHEDCFG[source] MODE=NORMAL SERVER=gridhead:5353
ADMINCFG[1] USERS=sys

RMCFG[base] TYPE=PBS

RMCFG[cluster3] SERVER=moab://gridcluster3:5353 DATARM=c3storage

DATASTAGEMODEL LOOSE
RMCFG[c3storage] TYPE=NATIVE RESOURCETYPE=STORAGE
RMCFG[c3storage] SYSTEMQUERYURL=exec://$TOOLSDIR/dstage-ssh.systemquery.pl
RMCFG[c3storage] CLUSTERQUERYURL=exec://$TOOLSDIR/dstage-ssh.clusterquery.pl
RMCFG[c3storage] SYSTEMMODIFYURL=exec://$TOOLSDIR/dstage-ssh.systemmodify.pl
...

17.12.4 Fully-Scheduled Data Staging (Tight)

With fully-scheduled data staging, Moab is able to tightly control how needed data files are managed and can schedule and guarantee their availability to batch jobs. In this model, data migration time estimates are provided by the data manager for each transfer request or are calculated internally by Moab. With this information, Moab is able to schedule jobs with satisfied data requirements immediately, and resource compute resources for other jobs at such a time when the needed data will be ready.

From a Moab configuration point of view, there are no changes to the configuration. However, the interfaces called via the URL attributes must provide additional information. The response of the CLUSTERQUERYURL must return configured and available disk cache space as well as optionally report a default data transfer rate.

In addition, the response of the SYSTEMQUERYURL must report the size of the queried file and, if a default data transfer rate is not reported via the CLUSTERQUERYURL interface, SYSTEMQUERYURL must report an estimated stage time.

With this information, Moab orchestrates data and compute resource activity to eliminate bottlenecks and maximize cluster performance. Most activity occurs behind the scenes with no user or admin involvement. If failures are encountered, they are report via standard Moab notification mechanisms and can also be viewed via mdiag -R.

17.12.5 Data Staging to Allocated Nodes (Local)

The final data staging model planned to be implemented in Moab assumes that data is to be scheduled directly to the local disk located on the allocated compute nodes. As such, many additional considerations must be taken into account so as to prevent interruption of service to other jobs, maximize cluster utilization, and minimize response time. From a scheduling point of view, the cost of data migration must be determined in terms of I/O, memory, CPU, and network usage and this migration must be throttled so as to not interfere with active workload already running on the cluster.

To guide this scheduling, and after this feature is fully implemented, site administrators may need to tune Moab parameters so as to indicate what resources are consumed in order to migrate data and what usage levels are acceptable. From an end-user's point of view, the only change is that the data stage-in URL must be set with $LOCAL as the destination host. It is planned that by default, with $LOCAL set, all data will be staged directly to the master node allocated to the job.

17.12.6 Interface Scripts for a Storage Resource Manager

Moab's data staging capabilities can utilize up to 3 different native resource manager interfaces. The use of these interfaces is detailed below.

17.12.6.1 Cluster Query Interface

In data staging, the cluster query interface is used to obtain information about the data resources including configured storage (disk) space, available storage (disk) space, and possibly data transfer rate. If the CLUSTERQUERYURL points to a script, the script will be passed no arguments and will return a WIKI based resource line describing available storage resources. The Moab tools directory contains a sample data staging cluster query script which can be customized to meet site specific needs.

17.12.6.2 System Query Interface

The system query interface is used to query generic compute resource objects. In the case of data staging, it is used to obtain information about data staging files. In particular, this interface returns information regarding file access, file size, and possibly estimated data staging time. If the SYSTEMQUERYURL points to a script, the script will be passed 5 arguments as described in the table below. The script may use or ignore these arguments as needed according to site specific needs.

Argument Number Argument Name Description
1 SubCommand Describes type of query to be performed.  For data staging, this argument is either filesize or stagetime.
2 User Username under which this script's results will be calculated (usually owner of the job)
3 Source File URL URL pointing to the source file, containing file access protocol information, etc.
4 Destination File URL
(used only with stagetime subcommand)
URL pointing to the destination file, containing file access protocol information, etc.
5 TransferRate
(used only with stagetime subcommand)
Approximate rate of data transfer as reported via the cluster query interface or configured within Moab.  This rate is reported in KB/s and if not specified, will be reported as 0.0.

The output of this interface depends on the given subcommand. For filesize it is a string in the format <FILESIZE>, giving the size of the file in bytes. For stagetime it is a string in the format <STAGETIME>[,<FILESIZE>]. If stagetime is unknown, the interface should return a value of -1. If filesize is unknown, no value should be reported.  The Moab tools directory contains a sample data staging system query script that can be customized to meet site specific needs.

17.12.6.3 System Modify Interface

The system modify interface is used to manipulate generic compute resource objects. In the case of data staging, it is used to request that the data server initiate data staging and to clean up stale data files. If the SYSTEMMODIFYURL points to a script, the script will be passed 4 arguments as described in the table below. The script may utilize or ignore these arguments as needed according to site specific needs.

Argument Number Argument Name Description
1 SubCommand Describes type of query to be performed.  For data staging, this argument will always be set to either stage or remove
2 User Username under which this script's actions will be performed (usually owner of the job)
3 Source File URL For the command stage, this is the URL for the source file.  For the remove command, this is the URL for the file to be removed.
4 Destination File URL
(used only with stage subcommand)
For the command stage, this is the URL for the destination file.

The output of this interface is identical to the output of the system query interface. See that section for details. The Moab tools directory contains a sample data staging system modify script which can be customized to meet site specific needs.

17.12.7 Submitting Jobs which Request Data Staging Services

Jobs submitted directly by end users, from grid schedulers, or via application or user portals may request intelligent data staging of input files (stage-in) by using the STAGEIN resource manager extension.  This alerts Moab that the job cannot start until the input data files are staged in by the SYSTEMMODIFYURL interface.

All jobs submitted to a resource manager that has an associated storage manager may also exploit an implicitly staging-out of standard out and standard error files. If the associated storage manager is configured with a SYSTEMMODIFYURL interface, then when the job completes successfully, the standard out and error files will be transfered automatically back to the user's home directory. This feature is especially useful for disparate clusters in a grid environment. (NOTE: This implicit stage-out feature is not currently available for all resource managers.)