procs=N doesn't work with 2.1.10, rejected by pbs_mom (was Re:
[torquedev] Re: [Mauiusers] Cluster Size Detection)
csamuel at vpac.org
csamuel at vpac.org
Mon Mar 10 09:15:09 MDT 2008
----- "Martin Siegert" <siegert at sfu.ca> wrote:
> I was wondering about this recently: the new parameter appears to
> exist already: torque accepts "qsub -l procs=6 ...",
Interesting, not spotted that before!
> but it appears that at least moab does not understand this.
> The job never starts and "checkjob -v <jobid>" displays
[...]
> Can somebody comment on this, i.e., does support for procs already
> exist in maui and/or moab?
Odd, seems to work here, we have:
moab client version 5.2.0 (revision 9106)
When I specify procs=12 it just allocated me 3 x 4 CPU nodes.
But a simple job won't run - it fails with the following pbs_mom logs:
03/11/2008 02:06:07;0002; pbs_mom;Svr;Log;Log opened
03/11/2008 02:06:07;0080; pbs_mom;Req;req_reject;Reject reply code=15035(Unknown resource type REJHOST=tango089.vpac.org MSG=cannot set lim
its), aux=0, type=ModifyJob, from PBS_Server at tango-m.vpac.org
03/11/2008 02:09:16;0002; pbs_mom;Svr;im_eof;Premature end of message from addr 172.17.1.83:15003
03/11/2008 02:09:16;0001; pbs_mom;Svr;pbs_mom;sister could not communicate (15059) in 164148.tango-m.vpac.org, job_start_error from node tan
go083 in job_start_error
03/11/2008 02:09:16;0008; pbs_mom;Req;send_sisters;sending ABORT to sisters
03/11/2008 02:09:16;0008; pbs_mom;Req;send_sisters;sending ABORT to sisters
03/11/2008 02:09:16;0001; pbs_mom;Svr;pbs_mom;node_bailout, node_bailout: received KILL/ABORT request for job 164148.tango-m.vpac.org from n
ode tango083
03/11/2008 02:09:16;0001; pbs_mom;Svr;pbs_mom;node_bailout, node_bailout: received KILL/ABORT request for job 164148.tango-m.vpac.org from n
ode tango083
03/11/2008 02:09:16;0001; pbs_mom;Svr;pbs_mom;im_request, event 9762 taskid 0 not found
03/11/2008 02:09:16;0001; pbs_mom;Svr;pbs_mom;im_request, job 164148.tango-m.vpac.org: command 99
03/11/2008 02:09:16;0002; pbs_mom;Svr;im_eof;No error from addr 172.17.1.83:15003
03/11/2008 02:09:16;0080; pbs_mom;Job;164148.tango-m.vpac.org;removing transient job directory /tmp/164148.tango-m.vpac.org
03/11/2008 02:09:18;0080; pbs_mom;Req;req_reject;Reject reply code=15035(Unknown resource type REJHOST=tango089.vpac.org MSG=cannot set limits), aux=0, type=ModifyJob, from PBS_Server at tango-m.vpac.org
03/11/2008 02:09:18;0001; pbs_mom;Job;TMomFinalizeJob3;job not started, Failure job exec failure, after files staged, no retry
03/11/2008 02:09:18;0001; pbs_mom;Job;164148.tango-m.vpac.org;ALERT: job failed phase 3 start
03/11/2008 02:09:18;0008; pbs_mom;Req;send_sisters;sending ABORT to sisters
03/11/2008 02:09:18;0080; pbs_mom;Job;164148.tango-m.vpac.org;removing transient job directory /tmp/164148.tango-m.vpac.org
These are all Torque 2.1.10.
An interactive job fails with the same error too, when
I just ask for procs=1.
Any clues folks ?
cheers,
Chris
--
Christopher Samuel - (03) 9925 4751 - Systems Manager
The Victorian Partnership for Advanced Computing
P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency
More information about the torquedev
mailing list