[torquedev] Mixing Torque 2.110 and 2.3.0 on a cluster ?
Craig West
cwest at astro.umass.edu
Mon Mar 24 07:09:26 MDT 2008
Chris,
I've upgraded the two clusters here and it went fairly smoothly. I have
both an x86 and an x86_64 cluster that run from a single torque server.
I was running 2.1.9. I upgraded the server first. A few hours later I
upgraded some x86 nodes and finished them all the next day. A day or so
after upgrading the server I had the chance to upgrade all the x86_64
nodes. I only upgraded the nodes when there was no running jobs on them.
I noticed that upon restarting the server with the new version that it
did indeed convert the jobs that were both queued and running to a new
format. I had disabled all the queues to get them to drain, so I can't
say if the 2.3.0 server was able to launch jobs on the 2.1.x nodes. I
also noticed that the 2.3.0 server was able to see the 2.1.x nodes. I
had jobs still running on the x86 and x86_64 nodes when the server
upgrade took place, and all those jobs were stopped cleanly when they
completed, or reached the wall time limits.
I have OpenMPI (with TM enabled) installed and didn't need to recompile
it for either cluster. However, I will note that I also have JobMonarch
and PbsPython installed and I needed to rebuild PbsPython to get
JobMonarch running again. It looks like the torque library is now called
libtorque.so.2.0.0.
I'm not running with $enablemomrestart and I was never in a situation
where I was trying to run jobs over both the 2.1.x and 2.3.0 moms.
Craig.
On 03/22/2008 06:25 PM, Chris Samuel wrote:
> We're looking at upgrading our Opteron cluster to 2.3.0
> as we *really* want the cpusets support so I'm wondering
> if anyone has tried running a mix of 2.1.10 and 2.3.0 on
> a single cluster ?
>
> My main concerns are:
>
> 1) Will the 2.1 and 2.3 mom's talk to each other correctly ?
>
> 2) Can 2.1.10 mom's talk to a 2.3.0 server (and visa versa) ?
>
> 3) We have heaps of mpiexec's, OpenMPI, etc, built against
> 2.1.10, I'm hoping that the fact that it's all now shared libs
> should make it painless, but has anyone actually tested this ? :-)
>
> We have $enablemomrestart set to 1, so the mom's should
> notice when they are swapped to the new version (assuming
> it follows symlinks correctly to the new binary).
>
More information about the torquedev
mailing list