[torqueusers] Interaction with NFS caused by high job count

Brock Palen brockp at umich.edu
Thu Mar 17 11:15:02 MDT 2011


Yes,

Check the number of privilege ports in use, 
You can solve this a few ways we tune TCP settings to avoid this:

#http://www.clusterresources.com/pipermail/torqueusers/2009-February/008715.html
# release sockets faster because we use a lot of them
net.ipv4.tcp_fin_timeout = 20
# Reuse sockets as fast as possible
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1

You can also build torque to not use priv ports. 
Lastly you can increate job_stat_rate,

Note that the number of connections is proportional to the number of jobs not nodes.

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985



On Mar 17, 2011, at 12:51 PM, Kevin Van Workum wrote:

> Has anybody ever noticed any problems with mounting NFS's on the machine running pbs_server?
> 
> We've seen some issues when the pbs server machine tries to mount NFS shares if we have a large number of running jobs (700-1000 jobs). The error is:
> 
> mount.nfs: input/output error
> 
> The error is inconsistent. Sometimes it works, other times not. I'm guessing I have to many tcp connections open, but it seems like 1000 jobs shouldn't cause a problem. Any ideas?
> 
> -- 
> Kevin Van Workum, PhD
> Sabalcore Computing Inc.
> Run your code on 500 processors.
> Sign up for a free trial account.
> www.sabalcore.com
> 877-492-8027 ext. 11
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list