[Moabusers] moab not sleeping
Brock Palen
brockp at umich.edu
Wed Aug 16 10:40:21 MDT 2006
Here is the results of mdiag -R -v
[root at cac-admin02 log]# mdiag -R -v
diagnosing resource managers
RM[nyx] State: Active
Type: PBS ResourceType: COMPUTE
Server: nyx.engin.umich.edu
Version: '2.1.2'
Nodes Reported: 346 (762 procs)
Flags: executionServer,noTaskOrdering
Partition: nyx
Event Management: EPORT=15004
NOTE: SSS protocol enabled
DefaultClass: route
Total Jobs Started: 38
RM Performance: AvgTime=0.00s MaxTime=38.18s (3109443 samples)
RM Languages: PBS
RM[nyx] Failures:
jobstart (6 of 274 failed)
-20:30:43 'cannot start job 11259 - RM failure, rc: 15044,
msg: 'Resource temporarily unavailable REJHOST=nyx306 MSG=cannot
allocate node 'nyx306' to job - node not currently available (nps
needed/free: 2/0, joblist: 11257.nyx.engin.umich.edu:
0,11257.nyx.engin.umich.edu:1)''
-19:55:49 'cannot start job 11276 - RM failure, rc: 15044,
msg: 'Resource temporarily unavailable REJHOST=nyx306 MSG=cannot
allocate node 'nyx306' to job - node not currently available (nps
needed/free: 2/0, joblist: 11274.nyx.engin.umich.edu:
0,11274.nyx.engin.umich.edu:1)''
-7:26:23 'cannot start job 11244 - RM failure, rc: 15041,
msg: 'Execution server rejected request MSG=connection to mom timed
out''
-7:26:23 'cannot start job 11305 - RM failure, rc: 15041,
msg: 'Execution server rejected request MSG=connection to mom timed
out''
-7:18:01 'cannot start job 9133 - RM failure, rc: 15041,
msg: 'Execution server rejected request MSG=connection to mom timed
out''
-7:18:01 'cannot start job 9134 - RM failure, rc: 15044,
msg: 'Resource temporarily unavailable REJHOST=nyx050 MSG=cannot
allocate node 'nyx050' to job - node not currently available (state:
down)''
clusterquery (1 of 810 failed)
-3:55:39 'cannot load cluster info - pbs_errno=0'
queuequery (1 of 810 failed)
-3:55:39 'cannot get queue info - no data available'
NOTE: use 'mrmctl -f messages <RMID>' to clear stats/failures
AM[bank] type: 'GOLD' state: 'Active'
ALERT: no security algorithm specified (see moab-private.cfg)
ALERT: no secret key specified (see moab-private.cfg)
socketprotocol: 'HTTP' wireprotocol: 'SSS2'
Version: '1'
AM Performance: Avg Time: 8.41s Max Time: 15.00s (68 samples)
AM[bank] Failures:
Wed Aug 16 11:01:39 joballocreserve 'cannot read message header'
Wed Aug 16 11:01:39 joballocreserve 'cannot read message header'
Most of these messages look old.
Brock Palen
Center for Advanced Computing
brockp at umich.edu
(734)936-1985
On Aug 16, 2006, at 11:53 AM, wightman wrote:
> I would start by running the command
>
> mdiag -R -v
>
> And checking for any errors on the Resource Manger interface. Is Moab
> having problems with TORQUE?
>
> Also, you should be able to grep out ALERT and WARNING from the
> moab log
> files to check for anything out of the ordinary.
>
> Let us know what you find.
>
> Thanks,
>
>
> - Douglas
>
> On Wed, 2006-08-16 at 11:32 -0400, Brock Palen wrote:
>> Is there a way to find out why moab decided to keep going? It
>> happens so often and lasts for so long most the moab commands (mdiag
>> showres etc) dont work. If there is a problem with a job starting i
>> want to know what job and why it cant start.
>>
>> Brock Palen
>> Center for Advanced Computing
>> brockp at umich.edu
>> (734)936-1985
>>
>>
>> On Aug 16, 2006, at 11:06 AM, wightman wrote:
>>
>>> Moab does not always take a break between iterations. For
>>> instance, if
>>> there is a job that is failing to start, Moab may push the next
>>> iteration to start early, and try starting the job again. If TORQUE
>>> tells Moab that a new job has entered the queue Moab will normally
>>> schedule it immediately.
>>>
>>> If Moab is continually skipping its sleep cycle, and you can see
>>> Moab
>>> chewing up lots of CPU, then there may be an issue.
>>>
>>> - Douglas
>>>
>>> On Wed, 2006-08-16 at 10:26 -0400, Brock Palen wrote:
>>>> Ever so often we see moab go on a sprint and not sleep between
>>>> iterations. Below is a snip of a log
>>>>
>>>> 08/16 10:13:25 INFO: total jobs selected in partition ALL:
>>>> 28/28
>>>> 08/16 10:13:25 INFO: iteration: 114 scheduling time: 1.046
>>>> seconds
>>>> 08/16 10:13:25 INFO: current util[114]: 179/339 (52.80%) PH:
>>>> 40.59% active jobs: 182 of 210 (completed: 20563)
>>>> 08/16 10:13:25 ALERT: node 'nyx180' sync from expected state
>>>> 'Idle' to state 'Running' at Wed Aug 16 10:13:24
>>>> 08/16 10:13:25 INFO: scheduling complete. sleeping 90 seconds
>>>> 08/16 10:13:25 INFO: starting iteration 115 (loglevel=2)
>>>> 08/16 10:13:25 INFO: PBS data updated for iteration 115
>>>> 08/16 10:13:25 INFO: 346 PBS resources detected on RM nyx
>>>> 08/16 10:13:25 INFO: resources detected: 346
>>>> 08/16 10:13:25 INFO: 0 PBS classes/queues detected on RM nyx
>>>> 08/16 10:13:25 INFO: queues detected: 0
>>>>
>>>> notice there is no time between sleeping for 90 seconds and
>>>> starting
>>>> iteration 115.
>>>>
>>>>
>>>> Brock Palen
>>>> Center for Advanced Computing
>>>> brockp at umich.edu
>>>> (734)936-1985
>>>>
>>>>
>>>> _______________________________________________
>>>> moabusers mailing list
>>>> moabusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/moabusers
>>>
>>>
>>>
>>
>> _______________________________________________
>> moabusers mailing list
>> moabusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/moabusers
>
>
>
More information about the moabusers
mailing list