Materials Studio Bugs with PBS (was Re: [torqueusers] Node free in pbsnodes but queues won't use it)

Chris Samuel csamuel at vpac.org
Mon Jul 4 21:42:24 MDT 2005


On Wed, 11 May 2005 07:09 am, Munson,Jennifer N. wrote:

> I have a Rocks cluster which was running Maui and torque but in order to
> troubleshoot and get jobs to run from MS Modeling we stopped Maui and
> enabled the pbs_sched.
[...]
> And the related question would be... Has anyone out there enabled the
> Accelrys application MS Modeling 3.2 to work correctly with Maui and
> Torque using hpmpi? Alternatively, is there anyone out there who is
> itching to know how and would like to help me? ;>

On Thu, 12 May 2005 04:40 am, Roy Dragseth wrote:

> I'm the maintainer of the PBS/Maui roll in Rocks and we have also been 
> struggling to make MS Modeling work on our cluster.


Hi Jennifer, Roy,

Ahh, so we're not the only ones struggling with that "interesting" piece of 
software to get it to work with PBS.

They make a lot of *VERY* bad assumptions about the layout of a PBS cluster 
and I am forever having to hand patch dsd_pbs.pm to *FIX* their software when 
our users need it upgrading. :-(

For instance their way of working out if you've got a queueing system is to 
look for a pbs_server process!  This of course fails if its not running on 
your management node (which we don't allow).   The fix is to go and edit:

 ${where MS is}/Gateway/root_default/dsd/commands/queues/PBS/dsd_pbs.pm

search for pbs_serv and change their code that "detects" PBS to always return 
1 and then restart the gatekeeper.  You should then be able to edit the 
"Gateway Data" and select Open PBS as the queueing system.

A more portable solution would be to check that "qstat -q" returns 0 (to show 
it's all OK).

You will probably also want to stop its very anti-social pounding of your 
pbs_server, it defaults to sleeping for 1 second between qstat's for the 
first 30 seconds of the jobs life, and then backs off to hitting it every 5 
seconds!

To fix this edit:

 ${where MS is}/Gateway/root_default/dsd/commands/DSD_serverutils.pm

and search for the Perl variable $delay and fix the couple of lines there.

It's even worse when it comes to checking for the status of jobs it has queued 
to find out if they've finished yet - we were forever having the Gateway 
declare that MS jobs had finished or crashed when they were still running on 
the system, or in some cases were still queued waiting to run!

This is because they wrongly assume that if qstat returns *any* error then the 
job has died, which is of course badly incorrect.

The *only* time you can assume a job is finished is when qstat of the job ID 
returns with the exit code 153, i.e. in dsd_pbs.pm the qstat check should do:

 if (($?>>8) == 153)

Any other error code indicates a problem somewhere else that is unlikely to 
have affected the running of a job, and once fixed by the admins you will 
either find that the job is still there *or* get the 153 code to show it's 
finished.

I've just done the upgrade to 3.2 here and I'm in the process of patching 
dsd_pbs.pm yet *AGAIN* to fix it.

It's trivial to prove to yourself by doing:

 qstat 001
 echo $?

and comparing it with:

 qstat  nobody
 echo $?

The former tells you "qstat: Unknown Job Id 001.${PBSSERVER}" and returns 153 
whilst the second tells you "qstat: Unknown queue destination nobody" and 
returns 170.

In September *2003* I provided this particular fix to Accelrys and it's still 
not appeared in their code, so I'm BCC'ing this to the people at Accelrys I 
had contact with then in the hope that I may get some response to this, and 
so that they can see that this is a real problem that lots of sites have with 
their software.

When I brought this up on the old ScalablePBS Users list back in 2003 I got a 
response from some folks at a very large company who were having exactly 
these problems with Materials Studio, and this fix (amongst others) helped 
them too..

At least the broken parts are all written in Perl, so I can fix their broken 
code for them!

cheers,
Chris
-- 
 Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager
 Victorian Partnership for Advanced Computing http://www.vpac.org/
 Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050705/d0bb9165/attachment.bin


More information about the torqueusers mailing list