[Moabusers] Invalid job running in SR

Matthew Britt msbritt at umich.edu
Fri Aug 25 12:33:13 MDT 2006


As an additional note, the mdiag does pick up from torque the user  
restrictions:

[root at cac-admin02 log]# mdiag -v -c violi
Class/Queue Status

ClassID        Priority Flags        QDef              QOSList*  
PartitionList        Target Limits

violi                 0 ---          ---                   ---   
---                   0.00     ---
   REQUIREDUSERLIST=choesh,ckcumaa,fiedler

Thanks,
   - matt

On Aug 25, 2006, at 10:09 AM, Matthew Britt wrote:

> We're trying to figure out how certain jobs are running in SRs when  
> they aren't supposed to be.  We've defined queue-based acls in  
> torque to limit which users can submit jobs, the defined SRs which  
> require that class.  In this case, the queue/class is called violi.
>
> [root at cac-admin02 log]# mdiag -r violi.2254
> Diagnosing Reservations
> RsvID                      Type Par   StartTime     EndTime      
> Duration Node Task Proc
> -----                      ---- ---   ---------     -------      
> -------- ---- ---- ----
> violi.2254                 User nyx -1:17:47:48  1:14:18:55    
> 3:08:06:43   16   16   32
>     Flags: STANDINGRSV,SPACEFLEX,DEDICATEDRESOURCE,ISACTIVE
>     ACL:   RSV==violi.2254= CLASS==violi+
>     CL:    RSV==violi.2254
>     FLIST=cac
>     Task Resources: PROCS: 2
>     Active PH: 1996.66/4126.19 (48.39%)
>     SRAttributes (TaskCount: 16  StartTime: 00:00:00  EndTime:  
> 1:00:00:00  Days: ALL)
>     Rsv-Group: violi
>
> Class definition in moab:
> SRCFG[violi]    CLASSLIST=violi
> SRCFG[violi]    NODEFEATURES=cac
> SRCFG[violi]    RESOURCES=PROCS:2
> SRCFG[violi]    TASKCOUNT=16
> SRCFG[violi]    PERIOD=WEEK
> SRCFG[violi]    DEPTH=2
> SRCFG[violi]    ACCESS=DEDICATED
> SRCFG[violi]    FLAGS=SPACEFLEX,DEDICATEDRESOURCE
>
>
> The job in question was submitted after the SR started, so it  
> couldn't have had a reservation prior to the SR being created.
> Job Id: 14978.nyx.engin.umich.edu
>     Job_Name = wh8x90y30
>    [snip]
>    qtime = Thu Aug 24 18:37:29 2006
>
> This problem only seems to happen on SRs that are free-floating.   
> We use the cac feature to define nodes we own (as opposed to  
> privately owned), so reservations such as these can slide around  
> and maintain their taskcounts in the case a node goes down.  Our  
> node-locked reservations (using either a HOSTLIST or a feature  
> limited to a set of privately-held nodes) are not violated.
>
> Any ideas on how to debug why certain jobs are allowed to use these  
> reserved nodes or why this would be happening?
>
> Thanks,
>  - matt
>
>
>
> _______________________________________________
> moabusers mailing list
> moabusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/moabusers
>
>



More information about the moabusers mailing list