<div dir="ltr"><br><br><div class="gmail_quote">On Wed, Sep 17, 2008 at 10:06 PM, Zhiliang Hu <span dir="ltr"><<a href="mailto:zhu@iastate.edu">zhu@iastate.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Sorry for cross posting -- I didn't get the problem solved on other lists:<br>
<br>
We are running a Linux CentOS 8-node cluster. When "qsub" a mpiblast job, I came to this dilemma: what's the correct way to supply the nodes information: to "qsub" (-l nodes=6:ppn=2)? or to "mpirun" (-np 12 -machinefile /path/to/mpimachines)? Or both? --- they all failed in my trials (details below).<br>
<br>
Any advice it appreciated.<br>
<br>
Zhiliang<br>
<br>
<br>
ps: My trials (they all on one-line; I break them down for visual purpose):<br>
<br>
(1)<br>
The following mpiblast runs fine on our CentOS cluster:<br>
------------------------------------------------------<br>
/path/to/bin/mpirun -np 12 -machinefile /path/to/mpimachines<br>
/path/to/mpiblast<br>
-p blastn<br>
-d seq.db<br>
-i /path/to/input.seq<br>
-o /path/to/output.txt<br>
------------------------------------------------------<br>
<br>
(2)<br>
When I try to send the job with 'qsub', it has problems:<br>
--------------------------------------<br>
qsub -l nodes=6:ppn=2<br>
-e /path/to/locationA<br>
-o /path/to/locationA<br>
/path/to/program<br>
<br>
where "program" is:<br>
<br>
/path/to/bin/mpirun<br>
/path/to/mpiblast<br>
-p blastn<br>
-d seq.db<br>
-i /path/to/input.seq<br>
-o /path/to/output.txt<br>
--------------------------------------<br>
The torque's "..ER" file says: "Sorry, mpiBLAST must be run on 3 or more nodes". (Also in the node's /undeliverred/ errors).<br>
<br>
A SIDE NOTE: This worked before on this machine but for some weird reason it is failing now.<br>
<br>
<br>
(3)<br>
But if I specify node info like in:<br>
--------------------------------------<br>
qsub -l nodes=6:ppn=2<br>
-e /path/to/locationA<br>
-o /path/to/locationA<br>
/path/to/program<br>
<br>
where "program" is:<br>
<br>
/path/to/bin/mpirun -np 12 -machinefile /path/to/mpimachines<br>
/path/to/mpiblast<br>
-p blastn<br>
-d seq.db<br>
-i /path/to/input.seq<br>
-o /path/to/output.txt<br>
--------------------------------------<br>
It fails with error: "pls:tm: failed to poll for a spawned proc, return status = 17002".<br>
<br>
-- what's the proper way to queue mpiblast jobs?</blockquote></div><br><br><br>(3) should work. What MPI implementation do you use? I would check all your mom logs to try to find an error associated with that job - if you can track the failure down to a specific node you might be able to diagnose it.<br>
</div>