[torquedev] Re: [torqueusers] Torque 2.3.1-snap.200805191843 out of
filedescriptor errors
Chad Vizino
vizino at psc.edu
Wed May 28 06:36:24 MDT 2008
Hi Chris,
I've not looked at the source recently but there was a file descriptor
leak problem with cpuset cleanup--not sure if a fix has been applied yet
(see my note (below) with a fix from a couple months ago) but this could
be the problem.
Regards,
-Chad
> Subject: [torquedev] 2.3.0 cpuset fixes
> Date: Thu, 27 Mar 2008 19:45:14 -0400
> From: Chad Vizino <vizino at psc.edu>
> To: torquedev at supercluster.org
>
> I misdirected the message below to torquedev-request. Sorry about that.
>
> Perhaps the fixes below could be considered for the next 2.3.0 snap if
> they haven't been repaired already.
>
> -Chad
>
> -------- Original Message --------
> Subject: Re: [torqueusers] ncpus=? 0
> Date: Thu, 27 Mar 2008 08:20:52 -0400
> From: Chad Vizino <vizino at psc.edu>
> To: siri <didier.siri at univ-provence.fr>
> CC: torquedev-request at supercluster.org, torqueusers at supercluster.org
> References: <F020B8F2-B338-45A4-8555-AD3E4FBF705A at univ-provence.fr>
>
> Greetings,
>
> We have been using 2.3.0 (the release, not a snap) for a week or so on
> our Altix systems (one with 144 cpus, the other with 768) with cpusets
> enabled. There are a few bugs that you should be aware of:
>
> 1) ncpus not obtained and showing ncpus=? in pbsnodes output.
>
> Fix: src/resmom/linux/mom_mach.c
> at lines 3333-3338, delete if condition around fscanf.
>
> 2) cpuset handling has a logic error in displaying clean up messages.
> There's a file descriptor leak in cpuset cleanup and left unchecked
> pbs_mom will hit its file descriptor limit and stop working. Also,
> cpuset cleanups are slow (N seconds, where N is the number of cpus in
> the cpuset) due to an unnecessary sleep. Finally, depending on how big
> your machine is, you may need to increase an array in the cpuset
> creation routine.
>
> Fix: src/resmom/linux/cpuset.c
> at line 67, remove "!" before cpuset_delete(childpath)
> at line 84, add "fclose(fd);"
> at line 85, delete "sleep(1);"
> at line 92, add "closedir(dir);"
> at line 222, array size for cpusbuf[] may not be big enough (depends on
> how many cpus you have (need about 4 chars per cpu to be safe))
>
> Not a bug, but depending on the size of your machine, you may need to
> increase the number of file descriptors per process in
> setup_program_environment() in src/resmom/mom_mach.c at line 6271. We
> added this and upped limit in /etc/security/limits.conf:
>
> /* temporary hack to work around 1024 limit */
> getrlimit(RLIMIT_NOFILE, &rlimit);
> if (rlimit.rlim_cur < 4096 || rlimit.rlim_max < 4096) {
> rlimit.rlim_cur = 4096;
> rlimit.rlim_max = 4096;
> setrlimit(RLIMIT_NOFILE, &rlimit);
> }
>
>
> In our server "nodes" file we have:
>
> host np=N
>
> (note no ":ts" after host). We choose to make N 4 cpus smaller than the
> physical number of cpus on the system since our boot cpuset is 4 cpus
> and 1 memory node.
>
> When submitting a job, use "-l nodes=1:ppn=4" for example. cpusets are
> not constructed when using "-l ncpus=...".
>
> Hope this helps a little. We're still playing with
> exlusive/non-exclusive cpuset settings and limiting the job to specific
> memory nodes to see how things work. I'd be interested in your experiences.
>
> Regards,
>
> Chad Vizino
Chris Samuel wrote:
> Never quite sure whether these should go to the users
> or the dev list, so this is going to both. :-)
>
> With Torque 2.3.1-snap.200805191843 I'm suddenly seeing
> pbs_mom's dieing with:
>
> 05/28/2008 16:23:22;0001; pbs_mom;Svr;pbs_mom;Too many open files (24) in mom_get_sample, 31772: get_proc_stat
>
> This is odd because the maximum number of open files
> according to ulimit is 1024:
>
> open files (-n) 1024
>
> This is with cpusets enabled and with two simple patches
> applied to extend inter-mom TCP timeouts and to put
> tm_spawn tasks in the jobset rather than the per-vnode
> cpusets (as we're using OpenMPI).
>
> Looking at the output of lsof they are almost all open file
> descriptors on various deleted cpuset files for past jobs,
> and I've attached a sample output for one that hasn't (yet)
> keeled over.
>
> Will attempt to see if I can spot why they're not getting closed..
>
> cheers!
> Chris
More information about the torquedev
mailing list