pbs_mom caches last healthcheck script error ? (Re: [Moabusers] Moab keeps on trying after pbs_mom rejects.)
Garrick Staples
garrick at clusterresources.com
Mon Dec 4 16:06:07 MST 2006
On Tue, Dec 05, 2006 at 09:42:34AM +1100, Chris Samuel alleged:
> On Tuesday 05 December 2006 05:12, wightman wrote:
>
> > When the message no longer returns an ERROR, then Moab correctly places
> > the node back into the scheduling queue.
> >
> > What are you seeing on your cluster?
>
> It's actually looking like a Torque problem, sorry!
>
> The script returns nothing (as it should) but the MOM seems to be remembering
> the last error it saw, and it even returns if we clear it by hand.
>
> For instance, a node had a Myrinet card replaced and they forgot to set the
> switch on the card to 64-bit mode. Our script picked it up correctly and
> placed the node offline. Then we fixed the card, brought the node back up
> and the script saw everything was fine but the MOM was still down in Moab
> with the old error.
>
> We noticed the mom still had the message and assumed we'd have to clear it by
> hand, thus:
>
> # momctl -d 1
>
> [...]
> Server Update Interval: 45 seconds
> MOM Message: ERROR myrinet card is not in 64bit mode
> (use 'momctl -q clearmsg' to clear)
> LOGLEVEL: 0 (use SIGUSR1/SIGUSR2 to adjust)
> [...]
> # momctl -q clearmsg
> localhost: clearmsg = 'messages cleared'
>
> Message went away..
>
> # momctl -d 1
> [...]
> Server Update Interval: 45 seconds
> LOGLEVEL: 0 (use SIGUSR1/SIGUSR2 to adjust)
> [...]
>
> Then, within a minute it's back, even though the script isn't triggering it:
>
> # momctl -d 1
> [...]
> Server Update Interval: 45 seconds
> MOM Message: ERROR myrinet card is not in 64bit mode
> (use 'momctl -q clearmsg' to clear)
> LOGLEVEL: 0 (use SIGUSR1/SIGUSR2 to adjust)
>
> But the script says everything is OK!
>
> # /usr/local/sbin/moab-check-health.sh
> #
>
> Brett has this cluster running Torque 2.2.0-snap.200610191709.
Use 'momctl -C' instead. Though it would have cleared by the time you read
this.
The "error message" in MOM can come from multiple places, and is sent to
pbs_server every update interval (45 seconds from your output).
The health check script is one possible way to trigger an error message,
but since it only run every "node_check_interval" intervals, the
script's output is cached. Every interval, the cached copy of the error
message is copied into the error message buffer unless it is time to
rerun the script.
'momctl -q clearmsg' just clears the error message, not the status of
the health check script.
'momctl -C' clears the counter for the health check and triggers a new
interval.
More information about the moabusers
mailing list