[Moabusers] Re: moabusers Digest, Vol 20, Issue 6 (Ouf of Office Response)

Jonathan Ryskamp jryskamp at clusterresources.com
Fri Jul 28 10:57:55 MDT 2006


I will be in Japan from July 28th to August 5th and will have no e-mail
access. I will be back in the office and checking and responding to
e-mails on Monday August 7th.

If you need more immediate assistance please contact:

Technical Support: 
Nick Ihli
+1 (801) 717-3736
nick.ihli at clusterresources.com

Sales Support:
Michael Jackson
+1 (801) 717-3722
michael at clusterresources.com
And
Jess Arrington
+1 (801) 717-3716
jess at clusterresources.com

Thanks,
Jonathan

>>> "moabusers at supercluster.org" 07/28/06 12:00 >>>

Send moabusers mailing list submissions to
	moabusers at supercluster.org

To subscribe or unsubscribe via the World Wide Web, visit
	http://www.supercluster.org/mailman/listinfo/moabusers
or, via email, send a message with subject or body 'help' to
	moabusers-request at supercluster.org

You can reach the person managing the list at
	moabusers-owner at supercluster.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of moabusers digest..."


Today's Topics:

   1. Re: Force recalculation of job hostlist. (Matthew Britt)
   2. Re: Force recalculation of job hostlist. (Justin Bronder)


----------------------------------------------------------------------

Message: 1
Date: Thu, 27 Jul 2006 14:02:33 -0400
From: Matthew Britt <msbritt at umich.edu>
Subject: Re: [Moabusers] Force recalculation of job hostlist.
To: Justin Bronder <jsbronder at gmail.com>
Cc: moabusers at supercluster.org
Message-ID: <C1AE52C4-8764-4831-B000-81258E9817BE at umich.edu>
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed

Have you looked at torque's health check ability?  We're using this  
to verify a variety of things on each node.   The node can set itself  
offline before the job is ever assigned to the node.

http://www.clusterresources.com/torquedocs21/10.2healthcheck.shtml

- matt



  - matt

On Jul 26, 2006, at 3:45 PM, Justin Bronder wrote:

> At our site we have a large number of jobs that are submitted with the
> intent of utilizing our Myrinet cards.  While Torque does a great job
> of detecting nodes that are having ethernet issues (obviously quite  
> rare),
> to my knowledge there is not an equivalent for Myrinet.
>
> At the suggestion of Chris Vaughan and Douglas Wightman, I've been
> using the Class specific JOBPROLOG to do some preliminary checking
> on the health of the required nodes. This is working great, but I'm
> having trouble figuring out how to deal with reallocating the hostlist
> for the job if a problem is found.  I'd like to keep the job at the  
> top of
> the queue if at all possible, and just mark the problem node offline.
> Then I'd like to signal Moab to find a replacement for this node.
>
> Job Preemption and re-queuing is not working as Torque still sees
> the job as queued at this point and the nodes as free whereas Moab
> sees the job as running and the nodes reserved.
>
> Any suggestions?
>
>
> Thanks in advance,
>
> Justin Bronder.
> _______________________________________________
> moabusers mailing list
> moabusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/moabusers



------------------------------

Message: 2
Date: Thu, 27 Jul 2006 14:10:20 -0400
From: "Justin Bronder" <jsbronder at gmail.com>
Subject: Re: [Moabusers] Force recalculation of job hostlist.
To: "Matthew Britt" <msbritt at umich.edu>
Cc: moabusers at supercluster.org
Message-ID:
	<8d39cca0607271110w6355a8d7u473fd7d940ef033 at mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

Excellent, that's exactly what I was looking for.

Thanks for your help,

Justin.

On 7/27/06, Matthew Britt <msbritt at umich.edu> wrote:
>
> Have you looked at torque's health check ability?  We're using this
> to verify a variety of things on each node.   The node can set itself
> offline before the job is ever assigned to the node.
>
> http://www.clusterresources.com/torquedocs21/10.2healthcheck.shtml
>
> - matt
>
>
>
>   - matt
>
> On Jul 26, 2006, at 3:45 PM, Justin Bronder wrote:
>
> > At our site we have a large number of jobs that are submitted with
the
> > intent of utilizing our Myrinet cards.  While Torque does a great
job
> > of detecting nodes that are having ethernet issues (obviously quite
> > rare),
> > to my knowledge there is not an equivalent for Myrinet.
> >
> > At the suggestion of Chris Vaughan and Douglas Wightman, I've been
> > using the Class specific JOBPROLOG to do some preliminary checking
> > on the health of the required nodes. This is working great, but I'm
> > having trouble figuring out how to deal with reallocating the
hostlist
> > for the job if a problem is found.  I'd like to keep the job at the
> > top of
> > the queue if at all possible, and just mark the problem node
offline.
> > Then I'd like to signal Moab to find a replacement for this node.
> >
> > Job Preemption and re-queuing is not working as Torque still sees
> > the job as queued at this point and the nodes as free whereas Moab
> > sees the job as running and the nodes reserved.
> >
> > Any suggestions?
> >
> >
> > Thanks in advance,
> >
> > Justin Bronder.
> > _______________________________________________
> > moabusers mailing list
> > moabusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/moabusers
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://www.supercluster.org/pipermail/moabusers/attachments/20060727/cde29891/attachment-0001.html

------------------------------

_______________________________________________
moabusers mailing list
moabusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/moabusers


End of moabusers Digest, Vol 20, Issue 6
****************************************


More information about the moabusers mailing list