[Moabusers] moab not sleeping

Brock Palen brockp at umich.edu
Wed Aug 16 10:40:21 MDT 2006


Here is the results of mdiag -R -v

[root at cac-admin02 log]# mdiag -R  -v
diagnosing resource managers

RM[nyx]  State: Active
   Type:               PBS  ResourceType: COMPUTE
   Server:             nyx.engin.umich.edu
   Version:            '2.1.2'
   Nodes Reported:     346 (762 procs)
   Flags:              executionServer,noTaskOrdering
   Partition:          nyx
   Event Management:   EPORT=15004
   NOTE:  SSS protocol enabled
   DefaultClass:       route
   Total Jobs Started: 38
   RM Performance:     AvgTime=0.00s  MaxTime=38.18s  (3109443 samples)
   RM Languages:       PBS

RM[nyx] Failures:
   jobstart         (6 of 274 failed)
       -20:30:43  'cannot start job 11259 - RM failure, rc: 15044,  
msg: 'Resource temporarily unavailable REJHOST=nyx306 MSG=cannot  
allocate node 'nyx306' to job - node not currently available (nps  
needed/free: 2/0,  joblist: 11257.nyx.engin.umich.edu: 
0,11257.nyx.engin.umich.edu:1)''
       -19:55:49  'cannot start job 11276 - RM failure, rc: 15044,  
msg: 'Resource temporarily unavailable REJHOST=nyx306 MSG=cannot  
allocate node 'nyx306' to job - node not currently available (nps  
needed/free: 2/0,  joblist: 11274.nyx.engin.umich.edu: 
0,11274.nyx.engin.umich.edu:1)''
        -7:26:23  'cannot start job 11244 - RM failure, rc: 15041,  
msg: 'Execution server rejected request MSG=connection to mom timed  
out''
        -7:26:23  'cannot start job 11305 - RM failure, rc: 15041,  
msg: 'Execution server rejected request MSG=connection to mom timed  
out''
        -7:18:01  'cannot start job 9133 - RM failure, rc: 15041,  
msg: 'Execution server rejected request MSG=connection to mom timed  
out''
        -7:18:01  'cannot start job 9134 - RM failure, rc: 15044,  
msg: 'Resource temporarily unavailable REJHOST=nyx050 MSG=cannot  
allocate node 'nyx050' to job - node not currently available (state:  
down)''
   clusterquery     (1 of 810 failed)
        -3:55:39  'cannot load cluster info - pbs_errno=0'
   queuequery       (1 of 810 failed)
        -3:55:39  'cannot get queue info - no data available'


NOTE:  use 'mrmctl -f messages <RMID>' to clear stats/failures
AM[bank]  type: 'GOLD'  state: 'Active'
   ALERT:  no security algorithm specified (see moab-private.cfg)
   ALERT:  no secret key specified (see moab-private.cfg)
   socketprotocol: 'HTTP'  wireprotocol: 'SSS2'
   Version: '1'
   AM Performance:  Avg Time: 8.41s  Max Time:  15.00s  (68 samples)

AM[bank] Failures:
   Wed Aug 16 11:01:39  joballocreserve  'cannot read message header'
   Wed Aug 16 11:01:39  joballocreserve  'cannot read message header'


Most of these messages look old.

Brock Palen
Center for Advanced Computing
brockp at umich.edu
(734)936-1985


On Aug 16, 2006, at 11:53 AM, wightman wrote:

> I would start by running the command
>
> mdiag -R -v
>
> And checking for any errors on the Resource Manger interface.  Is Moab
> having problems with TORQUE?
>
> Also, you should be able to grep out ALERT and WARNING from the  
> moab log
> files to check for anything out of the ordinary.
>
> Let us know what you find.
>
> Thanks,
>
>
> - Douglas
>
> On Wed, 2006-08-16 at 11:32 -0400, Brock Palen wrote:
>> Is there a way to find out why moab decided to  keep going?  It
>> happens so often and lasts for so long most the moab commands (mdiag
>> showres etc) dont work.  If there is a problem with a job starting i
>> want to know what job and why it cant start.
>>
>> Brock Palen
>> Center for Advanced Computing
>> brockp at umich.edu
>> (734)936-1985
>>
>>
>> On Aug 16, 2006, at 11:06 AM, wightman wrote:
>>
>>> Moab does not always take a break between iterations.  For
>>> instance, if
>>> there is a job that is failing to start, Moab may push the next
>>> iteration to start early, and try starting the job again.  If TORQUE
>>> tells Moab that a new job has entered the queue Moab will normally
>>> schedule it immediately.
>>>
>>> If Moab is continually skipping its sleep cycle, and you can see  
>>> Moab
>>> chewing up lots of CPU, then there may be an issue.
>>>
>>> - Douglas
>>>
>>> On Wed, 2006-08-16 at 10:26 -0400, Brock Palen wrote:
>>>> Ever so often we see moab go on a sprint and not sleep between
>>>> iterations. Below is a snip of a log
>>>>
>>>> 08/16 10:13:25 INFO:     total jobs selected in partition ALL:  
>>>> 28/28
>>>> 08/16 10:13:25 INFO:     iteration:  114   scheduling time:  1.046
>>>> seconds
>>>> 08/16 10:13:25 INFO:     current util[114]:  179/339 (52.80%)  PH:
>>>> 40.59%  active jobs: 182 of 210 (completed: 20563)
>>>> 08/16 10:13:25 ALERT:    node 'nyx180' sync from expected state
>>>> 'Idle' to state 'Running' at Wed Aug 16 10:13:24
>>>> 08/16 10:13:25 INFO:     scheduling complete.  sleeping 90 seconds
>>>> 08/16 10:13:25 INFO:     starting iteration 115 (loglevel=2)
>>>> 08/16 10:13:25 INFO:     PBS data updated for iteration 115
>>>> 08/16 10:13:25 INFO:     346 PBS resources detected on RM nyx
>>>> 08/16 10:13:25 INFO:     resources detected: 346
>>>> 08/16 10:13:25 INFO:     0 PBS classes/queues detected on RM nyx
>>>> 08/16 10:13:25 INFO:     queues detected: 0
>>>>
>>>> notice there is no time between sleeping for 90 seconds and  
>>>> starting
>>>> iteration 115.
>>>>
>>>>
>>>> Brock Palen
>>>> Center for Advanced Computing
>>>> brockp at umich.edu
>>>> (734)936-1985
>>>>
>>>>
>>>> _______________________________________________
>>>> moabusers mailing list
>>>> moabusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/moabusers
>>>
>>>
>>>
>>
>> _______________________________________________
>> moabusers mailing list
>> moabusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/moabusers
>
>
>



More information about the moabusers mailing list