procs=N doesn't work with 2.1.10, rejected by pbs_mom (was Re: [torquedev] Re: [Mauiusers] Cluster Size Detection)

csamuel at vpac.org csamuel at vpac.org
Mon Mar 10 09:15:09 MDT 2008


----- "Martin Siegert" <siegert at sfu.ca> wrote:

> I was wondering about this recently: the new parameter appears to
> exist already: torque accepts "qsub -l procs=6 ...",

Interesting, not spotted that before!

> but it appears that at least moab does not understand this.
> The job never starts and "checkjob -v <jobid>" displays
[...]
> Can somebody comment on this, i.e., does support for procs already
> exist in maui and/or moab?

Odd, seems to work here, we have:

moab client version 5.2.0 (revision 9106)

When I specify procs=12 it just allocated me 3 x 4 CPU nodes.

But a simple job won't run - it fails with the following pbs_mom logs:

03/11/2008 02:06:07;0002;   pbs_mom;Svr;Log;Log opened
03/11/2008 02:06:07;0080;   pbs_mom;Req;req_reject;Reject reply code=15035(Unknown resource type  REJHOST=tango089.vpac.org MSG=cannot set lim
its), aux=0, type=ModifyJob, from PBS_Server at tango-m.vpac.org
03/11/2008 02:09:16;0002;   pbs_mom;Svr;im_eof;Premature end of message from addr 172.17.1.83:15003
03/11/2008 02:09:16;0001;   pbs_mom;Svr;pbs_mom;sister could not communicate (15059) in 164148.tango-m.vpac.org, job_start_error from node tan
go083 in job_start_error
03/11/2008 02:09:16;0008;   pbs_mom;Req;send_sisters;sending ABORT to sisters
03/11/2008 02:09:16;0008;   pbs_mom;Req;send_sisters;sending ABORT to sisters
03/11/2008 02:09:16;0001;   pbs_mom;Svr;pbs_mom;node_bailout, node_bailout: received KILL/ABORT request for job 164148.tango-m.vpac.org from n
ode tango083
03/11/2008 02:09:16;0001;   pbs_mom;Svr;pbs_mom;node_bailout, node_bailout: received KILL/ABORT request for job 164148.tango-m.vpac.org from n
ode tango083
03/11/2008 02:09:16;0001;   pbs_mom;Svr;pbs_mom;im_request, event 9762 taskid 0 not found
03/11/2008 02:09:16;0001;   pbs_mom;Svr;pbs_mom;im_request, job 164148.tango-m.vpac.org: command 99
03/11/2008 02:09:16;0002;   pbs_mom;Svr;im_eof;No error from addr 172.17.1.83:15003
03/11/2008 02:09:16;0080;   pbs_mom;Job;164148.tango-m.vpac.org;removing transient job directory /tmp/164148.tango-m.vpac.org
03/11/2008 02:09:18;0080;   pbs_mom;Req;req_reject;Reject reply code=15035(Unknown resource type  REJHOST=tango089.vpac.org MSG=cannot set limits), aux=0, type=ModifyJob, from PBS_Server at tango-m.vpac.org
03/11/2008 02:09:18;0001;   pbs_mom;Job;TMomFinalizeJob3;job not started, Failure job exec failure, after files staged, no retry
03/11/2008 02:09:18;0001;   pbs_mom;Job;164148.tango-m.vpac.org;ALERT:  job failed phase 3 start
03/11/2008 02:09:18;0008;   pbs_mom;Req;send_sisters;sending ABORT to sisters
03/11/2008 02:09:18;0080;   pbs_mom;Job;164148.tango-m.vpac.org;removing transient job directory /tmp/164148.tango-m.vpac.org

These are all Torque 2.1.10.

An interactive job fails with the same error too, when
I just ask for procs=1.

Any clues folks ?

cheers,
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency


More information about the torquedev mailing list