is ASR (automatic system recovery)<br>enabled?<br><br><div class="gmail_quote">On Mon, Aug 16, 2010 at 10:10 PM, Brad Cavanagh <span dir="ltr"><<a href="mailto:brad.cavanagh@gmail.com">brad.cavanagh@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Hi Jan,<br>
<br>
Random problems like this usually point to bad hardware, more than<br>
likely RAM. Do you see the same problems when you run the same job on<br>
the node manually (i.e. login to the node and run it, instead of<br>
sending it through your queue scheduler)?<br>
<br>
Brad.<br>
<div><div></div><div class="h5"><br>
On Mon, Aug 16, 2010 at 9:39 AM, Jan Dettmer <<a href="mailto:jand@uvic.ca">jand@uvic.ca</a>> wrote:<br>
> Hi all,<br>
><br>
> This may be the wrong place to post this problem but I am not sure where to<br>
> start.<br>
><br>
> I have a cluster of several 8 core nodes that I run torque, open MPI, and<br>
> MAUI on debian. The cluster has been running flawless for several months and<br>
> I usually run parallel jobs across the whole cluster. Late last week, I<br>
> started having problems with one of the nodes rebooting at what seems<br>
> random. This only happens when I am running a job on it. If it sits idle, it<br>
> stays alive without reboots. The reboots are also completely out of the blue<br>
> without any signs in the debian logs.<br>
><br>
> The reboots happen after a job is started. The same code runs on the other<br>
> nodes without problem for days.<br>
><br>
> Has anyone experienced this before and can point me towards possible causes<br>
> for this?<br>
><br>
> Thanks, Jan<br>
><br>
><br>
</div></div>> _______________________________________________<br>
> torqueusers mailing list<br>
> <a href="mailto:torqueusers@supercluster.org">torqueusers@supercluster.org</a><br>
> <a href="http://www.supercluster.org/mailman/listinfo/torqueusers" target="_blank">http://www.supercluster.org/mailman/listinfo/torqueusers</a><br>
><br>
><br>
_______________________________________________<br>
torqueusers mailing list<br>
<a href="mailto:torqueusers@supercluster.org">torqueusers@supercluster.org</a><br>
<a href="http://www.supercluster.org/mailman/listinfo/torqueusers" target="_blank">http://www.supercluster.org/mailman/listinfo/torqueusers</a><br>
</blockquote></div><br>