<HTML dir=ltr><HEAD><TITLE>Re: [torqueusers] qdel will not delete</TITLE>
<META http-equiv=Content-Type content="text/html; charset=unicode">
<META content="MSHTML 6.00.2900.3429" name=GENERATOR></HEAD>
<BODY>
<DIV id=idOWAReplyText58623 dir=ltr>
<DIV dir=ltr><FONT face=Arial color=#000000 size=2>I've only seen this problem when some of the nodes allocated to the job are unresponsive (either because they've crashed, or, for instance, they're so overloaded they're functionally crippled and unresponsive). </FONT><FONT face=Arial size=2>When the unresponsive node is able to be communicated with by the mom, then the job will be able to exit (unless you force it as Steve mentions below).</FONT></DIV>
<DIV dir=ltr><FONT face=Arial size=2></FONT> </DIV>
<DIV dir=ltr><FONT face=Arial size=2>--Joe</FONT></DIV></DIV>
<DIV dir=ltr><BR>
<HR tabIndex=-1>
<FONT face=Tahoma size=2><B>From:</B> torqueusers-bounces@supercluster.org on behalf of Steve Young<BR><B>Sent:</B> Thu 12/11/2008 2:02 PM<BR><B>To:</B> Rahul Nabar<BR><B>Cc:</B> torqueusers@supercluster.org<BR><B>Subject:</B> Re: [torqueusers] qdel will not delete<BR></FONT><BR></DIV>
<DIV>
<P><FONT size=2>Usually when this happens qdel -p <job id> will remove the job from <BR>the queue if a normal qdel won't do it. >From the qdel man page:<BR><BR> -p Forcibly purge the job from the server. This <BR>should only be used if a running job will not exit because its <BR>allocated nodes are unreachable. The admin<BR> should make every attempt at resolving the <BR>problem on the nodes. If a job’s mother superior recovers after <BR>purging the job, any epilogue scripts may still<BR> run. This option is only available to a batch <BR>operator or the batch administrator.<BR><BR>Hope this helps,<BR><BR>-Steve<BR><BR>On Dec 11, 2008, at 1:47 PM, Rahul Nabar wrote:<BR><BR>> I've had jobs that won't respond to qdel once every so often. Their<BR>> "REMAINING-time" on MAUI then becomes negative which was initially<BR>> confusing since I thought it was a MAUI bug.<BR>><BR>> But the root-cause seems to be that PBS will not obey the qdel on this<BR>> job. Irrespective of whether I issue it as root or MAUI issues it.<BR>><BR>> I had one such job today and I debugged it more: All the sub-nodes<BR>> seemed to be up. the mom daemon on each one of these nodes seemed to<BR>> be up and running.<BR>><BR>> The mom_log on the master node though was interesting; It had this <BR>> snippet:<BR>><BR>> 12/11/2008 11:47:38;0002; pbs_mom;Svr;im_request;connect from <BR>> 11.0.1.79:1023<BR>> 12/11/2008 11:47:38;0008;<BR>> pbs_mom;Job;233139.supernova.che.wisc.edu;received request 'KILL_JOB'<BR>> from 11.0.1.79:1023<BR>> 12/11/2008 11:47:38;0008;<BR>> pbs_mom;Job;233139.supernova.che.wisc.edu;ERROR: received request<BR>> 'KILL_JOB' from 11.0.1.79:1023 for job '233139.supernova.che.wisc.edu'<BR>> (job does not exist locally)<BR>><BR>> The only way I could get this job to delete was to restart the pbs_mom<BR>> on that node.<BR>><BR>> Anyone else who has encountered these symptoms? For me the first clue<BR>> was a negative "REMAINING-time" on MAUI and users who complained that<BR>> they could not qdel a job. In the past I've achieved the same effect<BR>> by removing the relevant foo.supe.JB and foo.supe.SC files from the<BR>> /var/spool/torque/server_priv/jobs on the master node.<BR>> But I don't think that is the best way out. I'd appreciate any other<BR>> debug suggestions as well.<BR>><BR>> --<BR>> Rahul<BR>> _______________________________________________<BR>> torqueusers mailing list<BR>> torqueusers@supercluster.org<BR>> <A href="http://www.supercluster.org/mailman/listinfo/torqueusers">http://www.supercluster.org/mailman/listinfo/torqueusers</A><BR><BR>_______________________________________________<BR>torqueusers mailing list<BR>torqueusers@supercluster.org<BR><A href="http://www.supercluster.org/mailman/listinfo/torqueusers">http://www.supercluster.org/mailman/listinfo/torqueusers</A><BR></FONT></P></DIV></BODY></HTML>