[torquedev] Mixing Torque 2.110 and 2.3.0 on a cluster ?

Craig West cwest at astro.umass.edu
Mon Mar 24 07:09:26 MDT 2008


Chris,

I've upgraded the two clusters here and it went fairly smoothly. I have 
both an x86 and an x86_64 cluster that run from a single torque server. 
I was running 2.1.9. I upgraded the server first. A few hours later I 
upgraded some x86 nodes and finished them all the next day. A day or so 
after upgrading the server I had the chance to upgrade all the x86_64 
nodes. I only upgraded the nodes when there was no running jobs on them.

I noticed that upon restarting the server with the new version that it 
did indeed convert the jobs that were both queued and running to a new 
format. I had disabled all the queues to get them to drain, so I can't 
say if the 2.3.0 server was able to launch jobs on the 2.1.x nodes. I 
also noticed that the 2.3.0 server was able to see the 2.1.x nodes. I 
had jobs still running on the x86 and x86_64 nodes when the server 
upgrade took place, and all those jobs were stopped cleanly when they 
completed, or reached the wall time limits.

I have OpenMPI (with TM enabled) installed and didn't need to recompile 
it for either cluster. However, I will note that I also have JobMonarch 
and PbsPython installed and I needed to rebuild PbsPython to get 
JobMonarch running again. It looks like the torque library is now called 
libtorque.so.2.0.0.

I'm not running with $enablemomrestart and I was never in a situation 
where I was trying to run jobs over both the 2.1.x and 2.3.0 moms.

Craig.

On 03/22/2008 06:25 PM, Chris Samuel wrote:
> We're looking at upgrading our Opteron cluster to 2.3.0
> as we *really* want the cpusets support so I'm wondering
> if anyone has tried running a mix of 2.1.10 and 2.3.0 on
> a single cluster ?
>
> My main concerns are:
>
> 1) Will the 2.1 and 2.3 mom's talk to each other correctly ?
>
> 2) Can 2.1.10 mom's talk to a 2.3.0 server (and visa versa) ?
>
> 3) We have heaps of mpiexec's, OpenMPI, etc, built against
> 2.1.10, I'm hoping that the fact that it's all now shared libs
> should make it painless, but has anyone actually tested this ? :-)
>
> We have $enablemomrestart set to 1, so the mom's should
> notice when they are swapped to the new version (assuming
> it follows symlinks correctly to the new binary).
>   


More information about the torquedev mailing list