[torqueusers] torque not scaling well
Ronny T. Lampert
telecaadmin at gmail.com
Thu Aug 2 01:49:50 MDT 2007
> Running max debug shows that a scheduling
> pass goes from well under 1 second below
> 1500 jobs or so up to 10 or 15 minutes
> as the queue length increases, and it's
> pretty much all spent waiting on pbs_server
> to respond to maui. We're not yet sure
> what that means, but it does really make
> us wonder about torque.
I might chime in, I observed a smiliar thing quite a while ago.
Back in 1.2.X days the server used to provide *1* job-information per
request.
Each request was handled via select() (or poll()) + some overhead with
stat()ing et al) in the server and took rather long to complete.
I had pbs_sched running back then and both took more than 1 second
(because of this select()ing on both sided + timeouts) per job to
transfer from pbs_server to pbs_sched!
strace on Linux can do wonders to find out what's going on
system-call-wise - maybe you should take a dump and send it to someone
who wants to analyze it.
Cheers,
Ronny
More information about the torqueusers
mailing list