[Moabusers] Re: moabusers Digest, Vol 20,
Issue 6 (Ouf of Office Response)
Jonathan Ryskamp
jryskamp at clusterresources.com
Fri Jul 28 10:57:55 MDT 2006
I will be in Japan from July 28th to August 5th and will have no e-mail
access. I will be back in the office and checking and responding to
e-mails on Monday August 7th.
If you need more immediate assistance please contact:
Technical Support:
Nick Ihli
+1 (801) 717-3736
nick.ihli at clusterresources.com
Sales Support:
Michael Jackson
+1 (801) 717-3722
michael at clusterresources.com
And
Jess Arrington
+1 (801) 717-3716
jess at clusterresources.com
Thanks,
Jonathan
>>> "moabusers at supercluster.org" 07/28/06 12:00 >>>
Send moabusers mailing list submissions to
moabusers at supercluster.org
To subscribe or unsubscribe via the World Wide Web, visit
http://www.supercluster.org/mailman/listinfo/moabusers
or, via email, send a message with subject or body 'help' to
moabusers-request at supercluster.org
You can reach the person managing the list at
moabusers-owner at supercluster.org
When replying, please edit your Subject line so it is more specific
than "Re: Contents of moabusers digest..."
Today's Topics:
1. Re: Force recalculation of job hostlist. (Matthew Britt)
2. Re: Force recalculation of job hostlist. (Justin Bronder)
----------------------------------------------------------------------
Message: 1
Date: Thu, 27 Jul 2006 14:02:33 -0400
From: Matthew Britt <msbritt at umich.edu>
Subject: Re: [Moabusers] Force recalculation of job hostlist.
To: Justin Bronder <jsbronder at gmail.com>
Cc: moabusers at supercluster.org
Message-ID: <C1AE52C4-8764-4831-B000-81258E9817BE at umich.edu>
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
Have you looked at torque's health check ability? We're using this
to verify a variety of things on each node. The node can set itself
offline before the job is ever assigned to the node.
http://www.clusterresources.com/torquedocs21/10.2healthcheck.shtml
- matt
- matt
On Jul 26, 2006, at 3:45 PM, Justin Bronder wrote:
> At our site we have a large number of jobs that are submitted with the
> intent of utilizing our Myrinet cards. While Torque does a great job
> of detecting nodes that are having ethernet issues (obviously quite
> rare),
> to my knowledge there is not an equivalent for Myrinet.
>
> At the suggestion of Chris Vaughan and Douglas Wightman, I've been
> using the Class specific JOBPROLOG to do some preliminary checking
> on the health of the required nodes. This is working great, but I'm
> having trouble figuring out how to deal with reallocating the hostlist
> for the job if a problem is found. I'd like to keep the job at the
> top of
> the queue if at all possible, and just mark the problem node offline.
> Then I'd like to signal Moab to find a replacement for this node.
>
> Job Preemption and re-queuing is not working as Torque still sees
> the job as queued at this point and the nodes as free whereas Moab
> sees the job as running and the nodes reserved.
>
> Any suggestions?
>
>
> Thanks in advance,
>
> Justin Bronder.
> _______________________________________________
> moabusers mailing list
> moabusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/moabusers
------------------------------
Message: 2
Date: Thu, 27 Jul 2006 14:10:20 -0400
From: "Justin Bronder" <jsbronder at gmail.com>
Subject: Re: [Moabusers] Force recalculation of job hostlist.
To: "Matthew Britt" <msbritt at umich.edu>
Cc: moabusers at supercluster.org
Message-ID:
<8d39cca0607271110w6355a8d7u473fd7d940ef033 at mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"
Excellent, that's exactly what I was looking for.
Thanks for your help,
Justin.
On 7/27/06, Matthew Britt <msbritt at umich.edu> wrote:
>
> Have you looked at torque's health check ability? We're using this
> to verify a variety of things on each node. The node can set itself
> offline before the job is ever assigned to the node.
>
> http://www.clusterresources.com/torquedocs21/10.2healthcheck.shtml
>
> - matt
>
>
>
> - matt
>
> On Jul 26, 2006, at 3:45 PM, Justin Bronder wrote:
>
> > At our site we have a large number of jobs that are submitted with
the
> > intent of utilizing our Myrinet cards. While Torque does a great
job
> > of detecting nodes that are having ethernet issues (obviously quite
> > rare),
> > to my knowledge there is not an equivalent for Myrinet.
> >
> > At the suggestion of Chris Vaughan and Douglas Wightman, I've been
> > using the Class specific JOBPROLOG to do some preliminary checking
> > on the health of the required nodes. This is working great, but I'm
> > having trouble figuring out how to deal with reallocating the
hostlist
> > for the job if a problem is found. I'd like to keep the job at the
> > top of
> > the queue if at all possible, and just mark the problem node
offline.
> > Then I'd like to signal Moab to find a replacement for this node.
> >
> > Job Preemption and re-queuing is not working as Torque still sees
> > the job as queued at this point and the nodes as free whereas Moab
> > sees the job as running and the nodes reserved.
> >
> > Any suggestions?
> >
> >
> > Thanks in advance,
> >
> > Justin Bronder.
> > _______________________________________________
> > moabusers mailing list
> > moabusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/moabusers
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://www.supercluster.org/pipermail/moabusers/attachments/20060727/cde29891/attachment-0001.html
------------------------------
_______________________________________________
moabusers mailing list
moabusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/moabusers
End of moabusers Digest, Vol 20, Issue 6
****************************************
More information about the moabusers
mailing list