Just to elaborate on my earlier comments on the scp mechanism for file transfer. Here's a simple test that breaks it on our (modestly small) test cluster:<br><br># Run a lot of small quickly exiting jobs on the cluster<br>
echo "sleep 0.1; echo hello world" | qsub -t 1-100 <br><br>Post completion:<br><br>ls -l STDIN.* | wc -l<br>118<br><br>Random stdout/stderr files are not returned. There doesn't seem to be a pattern. Ranges go missing, presumably when there were just too many results<br>
coming back simultaneously. Here's the full set. For example STDOUT 21-36 are missed. The count varies run to run. The others all ended up in the undelivered<br>directories across the nodes. Is there any way to get torque to retry delivery a few times in the event of failure?<br>
<br>$ ls STD*<br>STDIN.e5510-1 STDIN.e5510-18 STDIN.e5510-39 STDIN.e5510-51 STDIN.e5510-72 STDIN.e5510-84 STDIN.e5510-93 STDIN.o5510-15 STDIN.o5510-39 STDIN.o5510-50 STDIN.o5510-72 STDIN.o5510-83<br>STDIN.e5510-10 STDIN.e5510-19 STDIN.e5510-4 STDIN.e5510-52 STDIN.e5510-73 STDIN.e5510-85 STDIN.e5510-94 STDIN.o5510-16 STDIN.o5510-4 STDIN.o5510-51 STDIN.o5510-73 STDIN.o5510-84<br>
STDIN.e5510-100 STDIN.e5510-2 STDIN.e5510-40 STDIN.e5510-54 STDIN.e5510-74 STDIN.e5510-86 STDIN.e5510-97 STDIN.o5510-17 STDIN.o5510-40 STDIN.o5510-53 STDIN.o5510-74 STDIN.o5510-85<br>STDIN.e5510-11 STDIN.e5510-20 STDIN.e5510-41 STDIN.e5510-55 STDIN.e5510-75 STDIN.e5510-87 STDIN.o5510-1 STDIN.o5510-18 STDIN.o5510-41 STDIN.o5510-56 STDIN.o5510-75 STDIN.o5510-87<br>
STDIN.e5510-12 STDIN.e5510-22 STDIN.e5510-42 STDIN.e5510-6 STDIN.e5510-76 STDIN.e5510-88 STDIN.o5510-10 STDIN.o5510-19 STDIN.o5510-42 STDIN.o5510-6 STDIN.o5510-76 STDIN.o5510-88<br>STDIN.e5510-13 STDIN.e5510-26 STDIN.e5510-43 STDIN.e5510-63 STDIN.e5510-79 STDIN.e5510-89 STDIN.o5510-100 STDIN.o5510-2 STDIN.o5510-44 STDIN.o5510-63 STDIN.o5510-79 STDIN.o5510-9<br>
STDIN.e5510-14 STDIN.e5510-3 STDIN.e5510-45 STDIN.e5510-65 STDIN.e5510-8 STDIN.e5510-9 STDIN.o5510-11 STDIN.o5510-20 STDIN.o5510-45 STDIN.o5510-65 STDIN.o5510-8 STDIN.o5510-91<br>STDIN.e5510-15 STDIN.e5510-30 STDIN.e5510-49 STDIN.e5510-7 STDIN.e5510-80 STDIN.e5510-90 STDIN.o5510-12 STDIN.o5510-3 STDIN.o5510-46 STDIN.o5510-7 STDIN.o5510-80 STDIN.o5510-97<br>
STDIN.e5510-16 STDIN.e5510-34 STDIN.e5510-5 STDIN.e5510-70 STDIN.e5510-81 STDIN.e5510-91 STDIN.o5510-13 STDIN.o5510-37 STDIN.o5510-49 STDIN.o5510-70 STDIN.o5510-81<br>STDIN.e5510-17 STDIN.e5510-37 STDIN.e5510-50 STDIN.e5510-71 STDIN.e5510-83 STDIN.e5510-92 STDIN.o5510-14 STDIN.o5510-38 STDIN.o5510-5 STDIN.o5510-71 STDIN.o5510-82<br>
<br><br clear="all"><br>-- <br>Darren Platt<br>Senior Director, Research<br>23andMe, inc