[Mauiusers] Exiting job hung?

Rouben Tchakhmakhtchian rouben at utsc.utoronto.ca
Wed Jul 5 10:19:02 MDT 2006


I'm no expert on Torque/Maui (just learning myself), but here are my 2 
cents:

In the worst case scenario, there's always kill -9. You can always log 
on the node in question as root and blow away the offending process. 
Sooner or later the MOM on that node will realize that the process has 
died and should cause the eventual purging of the job from the queue 
either by pbs_server on curie or maui.

If you want to be nice (and depending on how exactly the output file is 
to be copied off to node4), you may want to try some good old fashioned 
DNS poisoning. Just use the /etc/hosts file on node12 to temporarily 
make it think that node4 is some other node (perhaps even itself). If 
the job in question retries to write out the file at regular intervals 
on failure, it should work the next time it attempts to copy the output. 
If the output file is a non-issue, then, like I mentioned above, there's 
always kill -9...

Cheers,

Rouben Tchakhmakhtchian
rouben at utsc.utoronto.ca
UTSC Computing & Networking Services
416-208-4732


Paul Van Allsburg wrote:
> I have a job that was started from  node4, and node4 has gone off line 
> with disk errors.  The job ran on node12 and wants to write the final 
> output file via node4, but that node is unavailable and the job sits in 
> exiting state.  The cluster is running  torque-1.2.0p2 and maui-3.2.6p11.
> 
> Job id           Name             User             Time Use S Queue
> ---------------- ---------------- ---------------- -------- - -----
> 4351.curie       amberDNA_md11    hinkle           54:03:16 E long
> 
> Qdel fails and -p option is not available in this release..
> 
> [root at curie ~]# qdel 4351
> qdel: Request invalid for state of job 4351.curie.chem.hope.edu
> [root at curie ~]# qdel -p 4351
> qdel: invalid option -- p
> usage: qdel [-W delay] job_identifier...
> 
> I tried canceljob ...
> 
> [root at curie ~]# canceljob 4351
> ERROR:  cannot cancel job '4351'
> 
> 
> How can I force this job out of the queues?
> 
> Thanks!
> Paul
> 
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: rouben.vcf
Type: text/x-vcard
Size: 403 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/mauiusers/attachments/20060705/cba5cc35/rouben.vcf


More information about the mauiusers mailing list