I am having a problem with dependencies and job arrays. I've seen several messages on the list about this but no resolution. We are using torque 2.5.8<br><br>I first submit several jobs in an array: qsub -l -t 1-3<br>This returns jobid1[]<br>
<br>If I call the job with qsub -l depend:afteranyarray:jobid1[]<br><br>Then the second job doesn't wait until the job array (jobid1[]) has completed. It starts up about 20 seconds after the various jobs in the job array are started and of course fails since the results of jobid1 aren't ready yet. I've also tried using depend:afterokarray, depend:afterok and depend:afterany. <br>
<br>I've also tried submitting the second job with: qsub -W depend:afteranyarray:jobid1[] (as well as the same permutations as above). In this case the second job does hold... forever. <br><br>When I run checkjob on each job in the array I find they have all completed successfully with an exit status of 0.<br>
<br>When I checkjob the held job I get <br><br>[xxx@quser04 ~]$ checkjob -vvv 1183767<br>job 1183767 (RM job '1183767.qsched01')<br><br>AName: xxx.defragment<br>State: Hold <br>Creds: user:xxx group:xxx account:t20213 class:short<br>
WallTime: 00:00:00 of 3:58:20<br>SubmitTime: Tue Sep 27 11:33:33<br> (Time Queued Total: 1:45:28 Eligible: 00:00:05)<br><br>NodeMatchPolicy: EXACTNODE<br>Total Requested Tasks: 1<br>Total Requested Nodes: 1<br><br>Req[0] TaskCount: 1 Partition: ALL <br>
NodeCount: 1<br><br>IWD: /home/xxx<br>UMask: 0000 <br>OutputFile: quser04:/home/xxx/./xxx_logs/xxx.defragment.o1183767<br>ErrorFile: quser04:/home/xxx/./xxx_logs/xxx.defragment.e1183767<br>Partition List: quest1,quest2,questgpu1,SHARED<br>
SrcRM: torque DstRM: torque DstRMJID: 1183767.qsched01<br>Submit Args: -V -d . -r y -q short -M <a href="mailto:d-xxx@xxx.edu">d-xxx@xxx.edu</a> -N xxx.defragment -m abe -o ./xxx_logs/ -e ./xxx_logs/ -l walltime=14300 -W depend=afteranyarray:1183766[] /home/xxx/tempcmd20332<br>
Flags: RESTARTABLE<br>Attr: checkpoint<br>StartPriority: 256<br>PE: 1.00<br> NOTE: job cannot run (job has hold in place)<br>NOTE: job violates constraints for partition hyperthread (non-idle state 'Hold')<br>
<br>NOTE: job violates constraints for partition quest1 (non-idle state 'Hold')<br><br>NOTE: job violates constraints for partition quest2 (non-idle state 'Hold')<br><br>NOTE: job violates constraints for partition questgpu1 (non-idle state 'Hold')<br>
<br>NOTE: job violates constraints for partition pim (non-idle state 'Hold')<br><br>BLOCK MSG: non-idle state 'Hold' (recorded at last scheduling iteration)<br><br>I don't really understand what constraints the job is violating and why the dependency isn't working with either -l or -W.<br>
<br>Thanks<br>Darren<br>