[torqueusers] PBS Error: Execution server rejected request

garrick garrick at usc.edu
Sat Nov 5 01:16:11 MST 2005


On Sat, Nov 05, 2005 at 03:16:39AM +0700, notinh notien alleged:
> Thank Mr. Staples.  Here is the config for the mom. Originally, there was 
> no restricted directives.  The weird thing is the other three cloned nodes 
> with the exact config file, and they are working right now.

I'm at a loss here.  But that old code had a lot of problems with node
states.  You might just manually set the state in qmgr, 'set node node14
state=free', and see if they start talking again.

 
> I actually have newer version in place but the cluster are quite busy and I 
> don't have much experience migrating current running jobs to new server.  I 
> found some docs at the site regarding running 2 servers at the same time, 
> but I have not located docs to show how to migrate running jobs to new 
> server and how to replace old with new server with little impact on the 
> jobs.  Please help me on these things.

It's pretty much painless.  Just install the new daemons and restart
them.  Don't restart MOMs on hosts that have running jobs. 

I generally do something like this:
  kill the scheduler
  wait a few minutes for all new jobs to complete startup
  restart pbs_server
  wait a minute, make sure node and job states are updating correctly
  restart MOMs on all idle nodes
  wait a minute, make sure node and job states are updating correctly
  mark busy nodes offline
  start the scheduler
  restart MOMs on offline nodes after their jobs exit.
  
If you are using maui (or any other software that links to PBS libs), be
sure it is built against the _new_ TORQUE libs and not the ones from
your old install.

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20051105/674cbdbe/attachment.bin


More information about the torqueusers mailing list