The code is a standard example I found somewhere. Here it is:<br><br>#include <mpi.h><br> #include <stdio.h><br> #include <string.h><br><br> #define BUFSIZE 128<br> #define TAG 0<br><br> int main(int argc, char *argv[])<br>
{<br> char idstr[32];<br> char buff[BUFSIZE];<br> char processor_name[MPI_MAX_PROCESSOR_NAME];<br> int namelen;<br> int numprocs;<br> int myid;<br> int i;<br> MPI_Status stat;<br><br> MPI_Init(&argc,&argv); /* all MPI programs start with MPI_Init; all 'N' processes exist thereafter */<br>
MPI_Comm_size(MPI_COMM_WORLD,&numprocs); /* find out how big the SPMD world is */<br> MPI_Comm_rank(MPI_COMM_WORLD,&myid); /* and this processes' rank is */<br> MPI_Get_processor_name(processor_name, &namelen);<br>
<br> /* At this point, all the programs are running equivalently, the rank is used to<br> distinguish the roles of the programs in the SPMD model, with rank 0 often used<br> specially... */<br> if(myid == 0)<br>
{<br> printf("%d(%s): We have %d processors\n", myid, processor_name, numprocs);<br> for(i=1;i<numprocs;i++)<br> {<br> sprintf(buff, "Hello %d! ", i, processor_name);<br> MPI_Send(buff, BUFSIZE, MPI_CHAR, i, TAG, MPI_COMM_WORLD);<br>
}<br> for(i=1;i<numprocs;i++)<br> {<br> MPI_Recv(buff, BUFSIZE, MPI_CHAR, i, TAG, MPI_COMM_WORLD, &stat);<br> printf("%d(%s): %s\n", myid, processor_name, buff);<br> }<br> }<br>
else<br> {<br> /* receive from rank 0: */<br> MPI_Recv(buff, BUFSIZE, MPI_CHAR, 0, TAG, MPI_COMM_WORLD, &stat);<br> sprintf(idstr, "Processor %d (%s) reporting for duty\n", myid, processor_name);<br>
//strcat(buff, idstr);<br> //strcat(buff, "reporting for duty (%s)\n", processor_name);<br> strcat(buff, idstr);<br> /* send to rank 0: */<br> MPI_Send(buff, BUFSIZE, MPI_CHAR, 0, TAG, MPI_COMM_WORLD);<br>
}<br><br> MPI_Finalize(); /* MPI Programs end with MPI Finalize; this is a weak synchronization point */<br> return 0;<br> }<br><br>It's really strange. I don't get. You know, the issue is that my colleague managed to get it to<br>
work. I followed the same manual (<a href="http://debianclusters.cs.uni.edu/index.php/Using_a_Scheduler_and_Queue">http://debianclusters.cs.uni.edu/index.php/Using_a_Scheduler_and_Queue</a>).<br>The openmpi was configured with these options:<br>
<br>./configure --enable-languages=c,c++,fortran,objc,obj-c++,treelang --prefix=/usr/local --enable-shared --with-system-zlib --libexecdir=/usr/local/lib --without-included-gettext --enable-threads=posix --enable-nls --enable-__cxa_atexit --enable-clocale=gnu --enable-libstdcxx-debug --enable-mpfr --enable-checking=release x86_64-linux-gnu --with-tm=/usr/local<br>
<br>The thing is that torque worked fine without openmpi. I use maui as a scheduler, and I have no problems with it as well.<br>Maybe I should turn to openmpi guys for help. <br><br>If anybody might know of anything that could help me I'm listening.<br>
Thank you.<br><br><br><br><div class="gmail_quote">On Thu, Feb 21, 2008 at 3:03 PM, Craig West <<a href="mailto:cwest@astro.umass.edu">cwest@astro.umass.edu</a>> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
<br>
Hi Jozef,<br>
<br>
Not sure that I can really help any more. Its not something I've seen.<br>
At a guess it could be related to your network, it could be the code you<br>
are running, perhaps faulty RAM?<br>
<br>
The code is the first thing I am suspicious of. Does this code run<br>
correctly if you manually run it (without the queue) on the same nodes.<br>
I noticed that the last snippet you set was for 8 processors, and again<br>
processors 1-7 were listed, and it appears that the first node in the<br>
list was the one that crashed. It looks like processor 0 could be the<br>
problem, hence my concern with the code. If you want to send me the<br>
source code I'll try it here and see what happens.<br>
<br>
<br>
The other thing I noticed is that your computers are not time<br>
synchronized. I would suggest setting up ntp, its not a must but can<br>
make life easier, especially when tracking faults between computers, and<br>
building code.<br>
<br>
Cheers,<br>
<font color="#888888">Craig.<br>
</font></blockquote></div><br>