[Mauiusers] Exiting job hung?
Paul Van Allsburg
vanallsburg at hope.edu
Wed Jul 5 11:41:14 MDT 2006
I found the pbs cleanup on node12 hung on:
32316 ? S 0:00 /usr/local/sbin/pbs_mom
32317 ? S 0:00 /usr/local/sbin/pbs_rcp -r
/var/spool/PBS/spool/4351.curie..OU hinkle curie04
/home/hinkle/DNA/AmberDNA/amberDNA_md11.o4351
kill -9 32317 did the trick, and the job was cleared from the queue.
Thanks for the 2cents!
Paul
Rouben Tchakhmakhtchian wrote:
> I'm no expert on Torque/Maui (just learning myself), but here are my 2
> cents:
>
> In the worst case scenario, there's always kill -9. You can always log
> on the node in question as root and blow away the offending process.
> Sooner or later the MOM on that node will realize that the process has
> died and should cause the eventual purging of the job from the queue
> either by pbs_server on curie or maui.
>
> If you want to be nice (and depending on how exactly the output file is
> to be copied off to node4), you may want to try some good old fashioned
> DNS poisoning. Just use the /etc/hosts file on node12 to temporarily
> make it think that node4 is some other node (perhaps even itself). If
> the job in question retries to write out the file at regular intervals
> on failure, it should work the next time it attempts to copy the output.
> If the output file is a non-issue, then, like I mentioned above, there's
> always kill -9...
>
> Cheers,
>
> Rouben Tchakhmakhtchian
> rouben at utsc.utoronto.ca
> UTSC Computing & Networking Services
> 416-208-4732
>
>
> Paul Van Allsburg wrote:
>
>> I have a job that was started from node4, and node4 has gone off line
>> with disk errors. The job ran on node12 and wants to write the final
>> output file via node4, but that node is unavailable and the job sits
>> in exiting state. The cluster is running torque-1.2.0p2 and
>> maui-3.2.6p11.
>>
>> Job id Name User Time Use S Queue
>> ---------------- ---------------- ---------------- -------- - -----
>> 4351.curie amberDNA_md11 hinkle 54:03:16 E long
>>
>> Qdel fails and -p option is not available in this release..
>>
>> [root at curie ~]# qdel 4351
>> qdel: Request invalid for state of job 4351.curie.chem.hope.edu
>> [root at curie ~]# qdel -p 4351
>> qdel: invalid option -- p
>> usage: qdel [-W delay] job_identifier...
>>
>> I tried canceljob ...
>>
>> [root at curie ~]# canceljob 4351
>> ERROR: cannot cancel job '4351'
>>
>>
>> How can I force this job out of the queues?
>>
>> Thanks!
>> Paul
More information about the mauiusers
mailing list