TORQUE/Moab Integration with Altix CPU Sets

TORQUE/Moab Integration with Altix CPU Sets

1.0 SGI Altix CPUSet

A cpuset is a named set of CPUs, which may be defined to be restricted (EXCLUSIVE) or open. A restricted cpuset only allows processes that are members of the cpuset to run on the set of CPUs. An open cpuset allows any process to run on its cpus, but a process that is a member of the cpuset can only run on the CPUs belonging to the cpuset

2.0 TORQUE Integration

With TORQUE 1.2.0p5 and higher, no special integration should be required.  Moab and TORQUE should automatically detect and properly manage Altix cpusets.

   TORQUE 1.2.0p5 does not support cpusets which span multiple hosts and does not yet support static cpusets, only dynamics cpusets.

   With TORQUE 1.2.0p5 and higher, there is no need to add a submit filter/qsub wrapper to use cpusets.

   Maui Scheduler should also work with TORQUE 1.2.0p5 and higher with no additional configuration.

2.1 Altix CPUSet Implementation for TORQUE

The initial implementation for native cpusets on SGI Altix requires that the user request processors using ncpus when submitting his job as opposed to using the nodes=X notation.

# bad qsub
> qsub -l nodes=1:ppn=4

# good qsub
> qsub -l ncpus=4 

   The 'ppn' limitation is temporary and will be addressed in TORQUE 1.2.0p6.

The pbs_mom will read the ncpus request and create a cpuset as large as the request. The pbs_mom will then ensure that all processes spawned by the job are attached to the cpuset for that job.  At the end of the job, pbs_mom will delete the cpuset and release any attached processes.

2.2 Compiling CPUSet in torque-1.2.0p5

To compile TORQUE with Altix cpuset support, add "#define PENABLE_DYNAMIC_CPUSETS 1" to src/include/pbs_config.h. The string "-lcpuset" will also need to be added to the MOMLIBS line in src/resmom/Makefile.  This process will be handled via configure in future releases.

2.3 Verification of CPUSet Create on Batch Node

To verify that the cpuset was correctly created on a batch node, use the SGI provided command "cpuset", which will show all the active cpusets.

The command "cpuset -Q" should show something like the following:

cpuset -Q
[root@host root]# cpuset -Q
CPUSET Queues:
   QUEUE[re313635]
   QUEUE[jhc13622]
   QUEUE[jco13641]

Use "cpuset -q <ID> -Q" to show the processors attached to a cpuset.

cpuset -q X -Q
[root@host root]# cpuset -q re313635 -Q
CPUSET Queue[re313635] contains 4 CPUs:
   CPU[300]
   CPU[301]
   CPU[302]
   CPU[303]

Use "cpuset -q <ID> -l" to show the processes attached to a cpuset.

cpuset -q X -l
[root@host root]# cpuset -q re313635 -l
Processes Attached to CPUSET Queue [re313635]:
-----------------------------------------
 15001
 15160
 15320
 15337
 15340
 15341
 15342

3.0 PBSPro Integration

3.1 PBSPro CPUSet Primer

When PBSPro is configured to support SGI Altix cpusets some caution must be taken when using Moab. PBSPro will create a cpuset based on the number of processors requested and the memory requested. In some environments, each processor on the compute machines has at least 2GB of resident memory. During job submission, if the user requests more memory than ( 95% * 2GB * number of processors requested), PBSPro will dedicate additional processors to meet the memory requirement of the user.

For example if a user requests 2 processors and 16GB of memory. Two processors will only account for 4GB of the needed 16GB of memory. In a time shared environment, it is not desired to have more than one job using a given location in memory. To prevent jobs from trampling on each others memory and caches, PBSPro will allocate an additional six processors to account for the 12GB of memory still needed to fulfill the user's request. Since the additional processors are a part of the exclusive cpuset, only the processes associated with the user's job can access the memory on those additional processors. This is a crude implementation of memory fencing. Since the cpuset is exclusive no other jobs on the machine can utilize the resources associated with this job. Any attempts to migrate another user's processes to this cpuset will result in an error.

In this case, there will be idle processors that are only utilized for memory requirements and will have no compute utilization. The behavior for locking memory within an SGI cpuset works well in a time shared environment since it protects resources from others. However, PBSPro does not report this behavior through any channels, command or api. The scheduler is under the assumption that example job above is only utilizing two processors when actually its residing within eight processors. The scheduler thinks it has six processors free for scheduling other jobs activities. When the scheduler attempts to schedule jobs on those six processors, PBSPro will kick the job back stating that there are no free resources available.

3.2 PBSPro Submission Filter

To get around the PBSPro limitation, a workaround is required to notify Moab of the correct number of utilized processors. By using an algorithm to calculate the true processor count needed to fulfill a large memory requests, a qsub wrapper / submission filter can be configured to calculate this "true processor count" and append the result into a generic resource called "cpuset=". The generic resource would act as a hint to notify Moab of the actual memory adjusted processor count. The wrapper does not look at memory alone since some jobs do not request large amounts of memory. To account for most cases, the number of processors need to fulfill the memory request is calculated and the result compared to the number of actual processors requested by the user. The higher number is used as the value for the generic resource, "cpuset=". The actual string used in the job example above would look something like this, "-l cpuset=8". (Example CPUSet Wrapper Snippet)

3.3 PBSPro resourcedef

In order for PBSPro to understand the cpuset generic resource, put the following entry into the /var/spool/pbs/server_priv/resourcedef. To determine the syntax of the resourcedef, consult the PBSPro documentation. Adding this to the resourcedef file allows specification of "cpuset=X" as a resource with "-l" in the job submission.

/var/spool/pbs/server_priv/resourcedef
cpuset type=long flag=q

3.4 Moab

Once the job is submitted with the generic resource above, the true processor count of any job can be accessed through the PBS API. Moab will be able to keep track of the true processor utilization via the generic resource, cpuset=X. The only remaining piece of knowledge Moab needs to know is the maximum number of cpuset resources per machine. This information is provided to Moab through the native interface.

/usr/local/moab/node.dat
co-compute1 STATE=idle CRES=cpuset:512 ARES=cpuset:512
co-compute2 STATE=idle CRES=cpuset:512 ARES=cpuset:512
co-login1 STATE=idle CRES=cpuset:24 ARES=cpuset:24
...

moab.cfg
RMCFG[cpuset]  TYPE=NATIVE FLAGS=slave
RMCFG[cpuset]  CLUSTERQUERYURL=file:///usr/local/moab/node.dat
Moab now knows the maximum number of cpusets per machine and will enforce it.

See Also