[torquedev] RE: pbs_mom suddenly throws floating-point exception on
execute
Moody, Tristan
tmoody at ku.edu
Fri Sep 14 14:01:13 MDT 2007
I did recompile with -g, and the backtrace is still useless:
#0 0x0000003c08f088a8 in ?? ()
#1 0x0000000000000000 in ?? ()
I should point out that the head node (compiling node) is running kernel 2.6.22.4-65.fc7, while the compute nodes are running 2.6.11-1.1369_FC4smp. I'm not sure this is the cause, since torque worked for about two months and then suddenly quit. Recompiling and reinstalling the software does nothing to fix the problem.
Tristan
----Original Message-----
From: torquedev-bounces at supercluster.org on behalf of torquedev-request at supercluster.org
Sent: Thu 9/13/2007 1:00 PM
To: torquedev at supercluster.org
Subject: torquedev Digest, Vol 22, Issue 7
Send torquedev mailing list submissions to
torquedev at supercluster.org
To subscribe or unsubscribe via the World Wide Web, visit
http://www.supercluster.org/mailman/listinfo/torquedev
or, via email, send a message with subject or body 'help' to
torquedev-request at supercluster.org
You can reach the person managing the list at
torquedev-owner at supercluster.org
When replying, please edit your Subject line so it is more specific
than "Re: Contents of torquedev digest..."
Today's Topics:
1. RE: pbs_mom suddenly throws floating-point exception on
execute (Moody, Tristan)
2. Re: RE: pbs_mom suddenly throws floating-point exception on
execute (Garrick Staples)
----------------------------------------------------------------------
Message: 1
Date: Wed, 12 Sep 2007 16:23:36 -0500
From: "Moody, Tristan" <tmoody at ku.edu>
Subject: [torquedev] RE: pbs_mom suddenly throws floating-point
exception on execute
To: <torquedev at supercluster.org>
Message-ID:
<515CE52A7472B549AFE9F121A619F9889408A8 at MAILBOXTHREE.home.ku.edu>
Content-Type: text/plain; charset="iso-8859-1"
This seems unlikely to me, as this has apparently happened to some thirty different machines in a very short timeframe. Recompiling and reinstalling does not help either.
Tristan
Message: 2
Date: Tue, 11 Sep 2007 21:15:38 +1000
From: Chris Samuel <csamuel at vpac.org>
Subject: Re: [torquedev] pbs_mom suddenly throws floating-point
exception on execute
To: torquedev at supercluster.org
Message-ID: <200709112115.40787.csamuel at vpac.org>
Content-Type: text/plain; charset="iso-8859-1"
On Tuesday 11 September 2007 06:15:03 Moody, Tristan wrote:
> Any ideas on what exactly is going wrong? This had been running fine until
> last Thursday, and there have been no changes to the system since July.
> yum.log and the up2date logs are both empty. It seems odd that the
> software would just suddenly stop working. Is there anything I'm missing?
My bet would be some form of filesystem corruption has broken the pbs_mom
binary. Is it installed from an RPM ? If so then you should be able to
use rpm -V $packagename to verify the MD5 checksums for it.
Otherwise check the MD5 of the installed binary with the MD5 of the binary in
the source tree that you compiled.
Good luck!
Chris (on leave in the UK, so random access to email at the moment)
--
Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager
Victorian Partnership for Advanced Computing http://www.vpac.org/
Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia
------------------------------
_______________________________________________
torquedev mailing list
torquedev at supercluster.org
http://www.supercluster.org/mailman/listinfo/torquedev
End of torquedev Digest, Vol 22, Issue 5
****************************************
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/ms-tnef
Size: 3800 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20070912/a14c77ea/attachment-0001.bin
------------------------------
Message: 2
Date: Wed, 12 Sep 2007 14:20:12 -0700
From: Garrick Staples <garrick at usc.edu>
Subject: Re: [torquedev] RE: pbs_mom suddenly throws floating-point
exception on execute
To: torquedev at supercluster.org
Message-ID: <20070912212012.GY19043 at polop.usc.edu>
Content-Type: text/plain; charset="us-ascii"
On Wed, Sep 12, 2007 at 04:23:36PM -0500, Moody, Tristan alleged:
> This seems unlikely to me, as this has apparently happened to some thirty different machines in a very short timeframe. Recompiling and reinstalling does not help either.
Did you recompile with -g so you could get a usable backtrace?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20070912/fef2386e/attachment-0001.bin
------------------------------
_______________________________________________
torquedev mailing list
torquedev at supercluster.org
http://www.supercluster.org/mailman/listinfo/torquedev
End of torquedev Digest, Vol 22, Issue 7
****************************************
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/ms-tnef
Size: 5169 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20070914/eeb6f712/attachment.bin
More information about the torquedev
mailing list