[torquedev] Problems starting job with torque 2.1.6
Åke Sandgren
ake.sandgren at hpc2n.umu.se
Fri Jan 12 11:32:09 MST 2007
On Fri, 2007-01-12 at 11:02 -0500, Troy Baer wrote:
> On Fri, 2007-01-12 at 16:15 +0100, Åke Sandgren wrote:
> > I just had a semi-large job (90 nodes) fail to start due to masternode
> > not sending out the JOIN_JOB to all sisters or sister not receiving it
> > at least.
> >
> > Anyone seen anything like this?
>
> We've seen that a lot in OpenPBS, but not (yet) in TORQUE. The failure
> mode in OpenPBS seems to be that the sister node has some degree of load
> on it and drops the JOIN JOB message, and then the mother superior never
> tries to send another one.
I've never seen this before either but we seldom have jobs as large as
this one so i got curious as to why it didn't start.
It shouldn't matter how much load the sister node has, it simply
shouldn't drop such a message.
Garrick? Any ideas?
--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: ake at hpc2n.umu.se Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
More information about the torquedev
mailing list