Hi Folks,<br><br>Thanks for all your replies. I have thought that mixing versions was a little unsafe. However, I am a little confused why they can work together for a period of time and then decided to segfault when the server pings the mom's. So to find an explantion I have built a debug build. After debugging my segfaulting moms torque-2.3.6-2cri.x86_64 further with a debug build I seem to move a little closer to the problem.<br>
<br>Program received signal SIGSEGV, Segmentation fault.<br>mom_server_find_by_ip (search_ipaddr=177078032) at mom_server.c:450<br>450 ipaddr = ntohl(addr->sin_addr.s_addr);<br>(gdb) where<br>#0 mom_server_find_by_ip (search_ipaddr=177078032) at mom_server.c:450<br>
#1 0x000000000041965e in mom_server_valid_message_source (stream=0) at mom_server.c:2022<br>#2 0x0000000000419870 in is_request (stream=0, version=1, cmdp=0x7fffcb2774d8) at mom_server.c:2125<br>#3 0x0000000000416997 in do_rpp (stream=0) at mom_main.c:5351<br>
#4 0x0000000000416a52 in rpp_request (fd=<value optimized out>) at mom_main.c:5408<br>#5 0x00002ae6c4678bc8 in wait_request (waittime=<value optimized out>, SState=0x0) at ../Libnet/net_server.c:469<br>#6 0x0000000000416c1d in main_loop () at mom_main.c:8046<br>
#7 0x0000000000416ee1 in main (argc=1, argv=0x7fffcb277bc8) at mom_main.c:8148<br>(gdb) print ipaddr<br>No symbol "ipaddr" in current context.<br>(gdb) print addr<br>$1 = <value optimized out><br>(gdb) print addr->sin_addr.s_addr<br>
Cannot access memory at address 0x4<br>(gdb) print * addr <br>Cannot access memory at address 0x0<br>(gdb) frame 1<br>#1 0x000000000041965e in mom_server_valid_message_source (stream=0) at mom_server.c:2022<br>2022 if ((pms = mom_server_find_by_ip(ipaddr)))<br>
(gdb) print ipaddr<br>No symbol "ipaddr" in current context.<br>(gdb) <br><br>It appears that the addr is null which is slightly confusing. Does anyone have detailed knowledge of the source, enough to comment on this?<br>
<br>Cheers,<br><br>Dug<br><br><div class="gmail_quote">2009/11/5 Ken Nielson <span dir="ltr"><<a href="mailto:knielson@adaptivecomputing.com">knielson@adaptivecomputing.com</a>></span><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
<div class="im">Dug,<br>
<br>
I currently use both 32 bit and 64 bit machines in my cluster running 2.3.x and 2.4.x. I have not had any problems except when using high availability because serverdb is created directly from memory so the 32 bit and 64 bit machines do not create compatible images.<br>
<br>
Did something change between 2.1.x and 2.3.x in the protocol?<br>
<br>
I believe this may be a version incompatibility problem and not an architecture problem.<br>
<br>
Ken Nielson<br>
Adaptive Computing<br>
<br>
<br>
<br>
Garrick Staples wrote:<br>
</div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div><div></div><div class="h5">
I used a mix of 32bit and 64bit pbs_moms for years. It was never a problem.<br>
<br>
This is just another bug in the 2.3.x line. The 2.1.x line is stable.<br>
<br>
On Thu, Nov 05, 2009 at 11:24:02AM -0500, Tom Pierce alleged:<br>
<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Dear Douglas,<br>
<br>
I had mixed 32 bit moms and 64 bit moms and it did not work well. I<br>
recovered by switching to a full 32 bit setup for Torque both pbs and<br>
moms. Later when the full architecture was 64 bit I moved up to 64<br>
bit everywhere.<br>
<br>
my two cents.<br>
<br>
Tom<br>
<br>
On Wed, Nov 4, 2009 at 4:50 AM, Douglas McNab <<a href="mailto:d.mcnab@physics.gla.ac.uk" target="_blank">d.mcnab@physics.gla.ac.uk</a>> wrote:<br>
<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Hi,<br>
<br>
I have an issue with segfaulting mom's that seems correlated with the is<br>
server trying to ping it's moms.<br>
The server are version is torque-2.3.6-2cri.x86_64<br>
We are currently supporting two OS's through the same batch system using<br>
submit filter and node properties. Therefore, we have two different<br>
versions of moms.<br>
Nodes 1->295 have moms torque-2.3.6-2cri.x86_64 and 296->309 have moms<br>
torque-2.1.9-4cri.slc4.i386<br>
<br>
When the moms segfault we see that the torque-2.1.9 moms stay up and only<br>
the torque-2.3.6 moms all die.<br>
<br>
I ran one of them through GDB and can see the call stack:<br>
<br>
Program received signal SIGSEGV, Segmentation fault.<br>
0x000000000041813f in ?? ()<br>
(gdb) where<br>
#0 0x000000000041813f in ?? ()<br>
#1 0x000000000041985e in ?? ()<br>
#2 0x0000000000419a70 in ?? ()<br>
#3 0x0000000000416b97 in close_conn ()<br>
#4 0x0000000000416c52 in close_conn ()<br>
#5 0x00002b12d6cd7488 in wait_request () from /usr/lib64/libtorque.so.2<br>
#6 0x0000000000416e1d in close_conn ()<br>
#7 0x00000000004170e1 in close_conn ()<br>
#8 0x00002b12d6f2b974 in __libc_start_main () from /lib64/libc.so.6<br>
#9 0x0000000000405eb9 in close_conn ()<br>
#10 0x00007fff7565e368 in ?? ()<br>
#11 0x0000000000000000 in ?? ()<br>
<br>
Unfortunately this doesn't really give me any clues.<br>
Does anyone have any other ideas?<br>
<br>
Cheers,<br>
<br>
Dug<br>
<br>
--<br>
ScotGrid, Room 481, Kelvin Building, University of Glasgow<br>
<br>
<br>
_______________________________________________<br>
torqueusers mailing list<br>
<a href="mailto:torqueusers@supercluster.org" target="_blank">torqueusers@supercluster.org</a><br>
<a href="http://www.supercluster.org/mailman/listinfo/torqueusers" target="_blank">http://www.supercluster.org/mailman/listinfo/torqueusers</a><br>
<br>
<br>
<br>
</blockquote>
<br>
-- <br>
-----------------------<br>
Thanks<br>
<br>
Tom<br>
_______________________________________________<br>
torqueusers mailing list<br>
<a href="mailto:torqueusers@supercluster.org" target="_blank">torqueusers@supercluster.org</a><br>
<a href="http://www.supercluster.org/mailman/listinfo/torqueusers" target="_blank">http://www.supercluster.org/mailman/listinfo/torqueusers</a><br>
<br>
</blockquote>
<br></div></div>
------------------------------------------------------------------------<div class="im"><br>
<br>
_______________________________________________<br>
torqueusers mailing list<br>
<a href="mailto:torqueusers@supercluster.org" target="_blank">torqueusers@supercluster.org</a><br>
<a href="http://www.supercluster.org/mailman/listinfo/torqueusers" target="_blank">http://www.supercluster.org/mailman/listinfo/torqueusers</a><br>
<br>
</div></blockquote>
<br>
</blockquote></div><br><br clear="all"><br>-- <br>ScotGrid, Room 481, Kelvin Building, University of Glasgow<br>tel: +44(0)141 330 6439<br>