[Mauiusers] Exiting job hung?
Rouben Tchakhmakhtchian
rouben at utsc.utoronto.ca
Wed Jul 5 10:19:02 MDT 2006
I'm no expert on Torque/Maui (just learning myself), but here are my 2
cents:
In the worst case scenario, there's always kill -9. You can always log
on the node in question as root and blow away the offending process.
Sooner or later the MOM on that node will realize that the process has
died and should cause the eventual purging of the job from the queue
either by pbs_server on curie or maui.
If you want to be nice (and depending on how exactly the output file is
to be copied off to node4), you may want to try some good old fashioned
DNS poisoning. Just use the /etc/hosts file on node12 to temporarily
make it think that node4 is some other node (perhaps even itself). If
the job in question retries to write out the file at regular intervals
on failure, it should work the next time it attempts to copy the output.
If the output file is a non-issue, then, like I mentioned above, there's
always kill -9...
Cheers,
Rouben Tchakhmakhtchian
rouben at utsc.utoronto.ca
UTSC Computing & Networking Services
416-208-4732
Paul Van Allsburg wrote:
> I have a job that was started from node4, and node4 has gone off line
> with disk errors. The job ran on node12 and wants to write the final
> output file via node4, but that node is unavailable and the job sits in
> exiting state. The cluster is running torque-1.2.0p2 and maui-3.2.6p11.
>
> Job id Name User Time Use S Queue
> ---------------- ---------------- ---------------- -------- - -----
> 4351.curie amberDNA_md11 hinkle 54:03:16 E long
>
> Qdel fails and -p option is not available in this release..
>
> [root at curie ~]# qdel 4351
> qdel: Request invalid for state of job 4351.curie.chem.hope.edu
> [root at curie ~]# qdel -p 4351
> qdel: invalid option -- p
> usage: qdel [-W delay] job_identifier...
>
> I tried canceljob ...
>
> [root at curie ~]# canceljob 4351
> ERROR: cannot cancel job '4351'
>
>
> How can I force this job out of the queues?
>
> Thanks!
> Paul
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: rouben.vcf
Type: text/x-vcard
Size: 403 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/mauiusers/attachments/20060705/cba5cc35/rouben.vcf
More information about the mauiusers
mailing list