From childss at cs.tcd.ie Wed Jul 1 07:34:39 2009 From: childss at cs.tcd.ie (Stephen Childs) Date: Wed, 01 Jul 2009 14:34:39 +0100 Subject: [torqueusers] PBS web monitor release 0.7 In-Reply-To: <4A4B57B4.9040209@ibcp.fr> References: <4988290D.4040601@cs.tcd.ie> <4A4B57B4.9040209@ibcp.fr> Message-ID: <4A4B65EF.5030901@cs.tcd.ie> Alexis Michon wrote: > Hello Stephen, > > Can you explain the colum efficiency in the table users, please ? > example : 138 <- what does this mean ? Efficiency is calculated as the percentage of cpu time / walltime. It can be >100 for multithreaded code or multi-node (e.g. MPI) jobs. Stephen -- Dr. Stephen Childs, Research Fellow, EGEE Project, phone: +353-1-8961797 Computer Architecture Group, email: Stephen.Childs @ cs.tcd.ie Trinity College Dublin, Ireland web: http://www.cs.tcd.ie/Stephen.Childs -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 2952 bytes Desc: S/MIME Cryptographic Signature Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20090701/be52c061/attachment.bin From loc157 at yahoo.com Tue Jul 7 03:12:48 2009 From: loc157 at yahoo.com (Loc Tran) Date: Tue, 7 Jul 2009 02:12:48 -0700 (PDT) Subject: [torqueusers] Help: about fifo scheduler source !! Message-ID: <626679.39837.qm@web38907.mail.mud.yahoo.com> ?? Hi all ! I want adjust fifo scheduer source for my cluster system. But I haven't found PBS function to run parallel job on multi?node ( pbs_runjob(int connect, char *job_id, char *location, char *extend)?run job only 1 node ? exactly ? ). Anybody help me please hic p/s: sorry for my English is not good Thks -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20090707/7fa130f6/attachment.html From ben at dayborogeo.com Mon Jul 6 18:27:21 2009 From: ben at dayborogeo.com (Ben Turner) Date: Tue, 07 Jul 2009 10:27:21 +1000 Subject: [torqueusers] Problem setting up headnode to be compute node. Message-ID: <4A529669.2080704@dayborogeo.com> Hi, Wondering if anybody can help. I am trying to set up torque to run on a single machine with four cores. I just want to be able to queue jobs up overnight on this machine. I have installed torque and configured pbs_mom and pbs_sched listing only the head node in the "nodes" file and specifying np=4. I run pbs_server and then pbsnodes -a reports that the node is "down". When I look in the logs the server log says that that it can ping "itself" ok but fails to make a connection and sets the state as down. The mom_log shows that mom is trying to say hello but the connection to the server times out. I am sure this is a simple problem that many people have faced, does anybody have any ideas? Also when I run pbs_sched I get an error message "pbs_sched: Cannot assign requested address (99) in main, bind" Cheers Ben From dvadell at linuxclusters.com.ar Tue Jul 7 09:32:21 2009 From: dvadell at linuxclusters.com.ar (Diego M. Vadell) Date: Tue, 07 Jul 2009 12:32:21 -0300 Subject: [torqueusers] Problem setting up headnode to be compute node. In-Reply-To: <4A529669.2080704@dayborogeo.com> References: <4A529669.2080704@dayborogeo.com> Message-ID: <4A536A85.9020802@linuxclusters.com.ar> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi Ben, Do you have a firewall configured? How does pbs_mom's configuration looks like? What IP address corresponds to $clienthost in pbs_mom's configuration? HTH, -- Diego. Ben Turner wrote: > Hi, > > Wondering if anybody can help. > > I am trying to set up torque to run on a single machine with four cores. > I just want to be able to queue jobs up overnight on this > machine. > > I have installed torque and configured pbs_mom and pbs_sched listing > only the head node in the "nodes" file and specifying np=4. > I run pbs_server and then pbsnodes -a reports that the node is "down". > > When I look in the logs the server log says that that it can ping > "itself" ok but fails to make a connection and sets the state as down. > The mom_log shows that mom is trying to say hello but the connection to > the server times out. > > I am sure this is a simple problem that many people have faced, does > anybody have any ideas? > > Also when I run pbs_sched I get an error message "pbs_sched: Cannot > assign requested address (99) in main, bind" > > Cheers > Ben > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFKU2qFn3NZ34IbincRAvsYAKCBz/LxJMNN/cZ1+VT7/H1XU3JbrgCfXMc6 WvKU+xmicw16YE/7MPGD0cY= =NYEh -----END PGP SIGNATURE----- From gus at ldeo.columbia.edu Tue Jul 7 10:04:30 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Tue, 07 Jul 2009 12:04:30 -0400 Subject: [torqueusers] Problem setting up headnode to be compute node. In-Reply-To: <4A536A85.9020802@linuxclusters.com.ar> References: <4A529669.2080704@dayborogeo.com> <4A536A85.9020802@linuxclusters.com.ar> Message-ID: <4A53720E.1030106@ldeo.columbia.edu> Hi Ben, list Here is a recent thread about a very similar problem. It may have the answers you need: http://www.supercluster.org/pipermail/torqueusers/2009-May/009101.html My $0.02 Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- Diego M. Vadell wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi Ben, > > Do you have a firewall configured? How does pbs_mom's configuration > looks like? What IP address corresponds to $clienthost in pbs_mom's > configuration? > > HTH, > -- Diego. > > Ben Turner wrote: >> Hi, >> >> Wondering if anybody can help. >> >> I am trying to set up torque to run on a single machine with four cores. >> I just want to be able to queue jobs up overnight on this >> machine. >> >> I have installed torque and configured pbs_mom and pbs_sched listing >> only the head node in the "nodes" file and specifying np=4. >> I run pbs_server and then pbsnodes -a reports that the node is "down". >> >> When I look in the logs the server log says that that it can ping >> "itself" ok but fails to make a connection and sets the state as down. >> The mom_log shows that mom is trying to say hello but the connection to >> the server times out. >> >> I am sure this is a simple problem that many people have faced, does >> anybody have any ideas? >> >> Also when I run pbs_sched I get an error message "pbs_sched: Cannot >> assign requested address (99) in main, bind" >> >> Cheers >> Ben >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.7 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFKU2qFn3NZ34IbincRAvsYAKCBz/LxJMNN/cZ1+VT7/H1XU3JbrgCfXMc6 > WvKU+xmicw16YE/7MPGD0cY= > =NYEh > -----END PGP SIGNATURE----- > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From chrisjob.fr at gmail.com Wed Jul 8 07:37:00 2009 From: chrisjob.fr at gmail.com (ChrisJob.fr) Date: Wed, 08 Jul 2009 15:37:00 +0200 Subject: [torqueusers] List of discussion or documentation on infiniband Message-ID: <4A54A0FC.5040608@gmail.com> Hi We have an infiniband HPC cluster. Sometimes we have problem with MPI programs and we must restart the infiniband. After everything is OK for 2 weeks. Do you know where I can find a discussion list about infiniband ? Or documention on the subject ? Thank you for yout help Chris From jasonw at jhu.edu Wed Jul 8 07:43:51 2009 From: jasonw at jhu.edu (Jason Williams) Date: Wed, 08 Jul 2009 09:43:51 -0400 Subject: [torqueusers] [OFFTOPIC] List of discussion or documentation on infiniband In-Reply-To: <4A54A0FC.5040608@gmail.com> References: <4A54A0FC.5040608@gmail.com> Message-ID: <4A54A297.2000502@jhu.edu> Hey Chris, One of the major players out there in the Infiniband world is the Open Fabrics Alliance. (http://www.openfabrics.org). There should be some docs and mailing lists on the site that you could check out. Also, you might want to figure out what MPI libraries you are using and check the website for them. One last suggestion is to find out who your IB Card and Switch provider is and maybe get them in on a service call. To me, it sounds like you are having a problem with your IB Fabric Subnet Manager. I know some switches out there have this sort of problem, but I don't want to get too deep into because this is technically off topic for this list. -- Jason Williams ChrisJob.fr wrote: > Hi > > We have an infiniband HPC cluster. Sometimes we have problem with MPI > programs and we must restart the infiniband. After everything is OK for > 2 weeks. > Do you know where I can find a discussion list about infiniband ? Or > documention on the subject ? > > Thank you for yout help > Chris > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From Joseph.Greenseid at ngc.com Wed Jul 8 07:48:28 2009 From: Joseph.Greenseid at ngc.com (Greenseid, Joseph M (IS)) Date: Wed, 8 Jul 2009 08:48:28 -0500 Subject: [torqueusers] [OFFTOPIC] List of discussion or documentation on infiniband References: <4A54A0FC.5040608@gmail.com> <4A54A297.2000502@jhu.edu> Message-ID: While giving Jason's response a "yeah, I think that, too," I would also suggest checking to see if you got your IB stack from your vendor. Some vendors distribute a specialized software/driver set that they tweak to tune specifically to their gear. It's usually based on the OFED stack from Open Fabrics in my experience, but if they've made changes, then you could/should hit them up for support. --Joe ________________________________ From: torqueusers-bounces at supercluster.org on behalf of Jason Williams Sent: Wed 7/8/2009 9:43 AM To: ChrisJob.fr at gmail.com Cc: torqueusers at supercluster.org Subject: Re: [torqueusers] [OFFTOPIC] List of discussion or documentation on infiniband Hey Chris, One of the major players out there in the Infiniband world is the Open Fabrics Alliance. (http://www.openfabrics.org ). There should be some docs and mailing lists on the site that you could check out. Also, you might want to figure out what MPI libraries you are using and check the website for them. One last suggestion is to find out who your IB Card and Switch provider is and maybe get them in on a service call. To me, it sounds like you are having a problem with your IB Fabric Subnet Manager. I know some switches out there have this sort of problem, but I don't want to get too deep into because this is technically off topic for this list. -- Jason Williams ChrisJob.fr wrote: > Hi > > We have an infiniband HPC cluster. Sometimes we have problem with MPI > programs and we must restart the infiniband. After everything is OK for > 2 weeks. > Do you know where I can find a discussion list about infiniband ? Or > documention on the subject ? > > Thank you for yout help > Chris > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20090708/141736ce/attachment.html From josh at clusterresources.com Wed Jul 8 13:26:33 2009 From: josh at clusterresources.com (Josh Butikofer) Date: Wed, 08 Jul 2009 13:26:33 -0600 Subject: [torqueusers] TORQUE 2.3.7 Released Message-ID: <4A54F2E9.10706@clusterresources.com> Everyone, TORQUE 2.3.7 has been released. You can download it at: http://www.clusterresources.com/downloads/torque/torque-2.3.7.tar.gz This build has a number of bug fixes and enhancements. Here is a list of changes from the CHANGELOG: c - crash b - bug fix e - enhancement f - new feature e - pbs_mom sisters can now tolerate an explicit group ID instead of only a valid group name. This helps TORQUE be more robust to group lookup failures. b - fixed a bug where UNIX domain socket communication was failing when "--disable-privports" was used. e - add job exit status as 10th argument to the epilogue script c - check filename for NULL to prevent crash e - merged in more logging and NOSIGCHLDMOM capability from Yahoo branch e - merged in new log_ext() function to allow more fine grained syslog events, you can now specify severity level. Also added more logging statements e - added code to allow compilers to override CLONE_BATCH_SIZE at configure time (allows for finer grained control on how arrays are created) e - added code which prefixes the severity tag on all log_ext() and log_err() messages e - added qmgr option accounting_keep_days, specifies how long to keep accounting files. e - changed mom config varattr so invoked script returns the varattr name and value(s) e - improved the performance of pbs_server when submitting large numbers of jobs with dependencies defined e - added new parameter "log_keep_days" to both pbs_server and pbs_mom. specifies how long to keep log files before they are automatically removed e - added qmgr server attribute lock_file, specifies where server lock file is located e - modified to allow retention of completed jobs across server shutdown e - added job_must_report qmgr configuration which says the job must be reported to scheduler. Added job attribute "reported". Added PURGECOMP functionality which allows scheduler to confirm jobs are reported. Also added -c option to qdel. Used to clean up unreported jobs. b - fix so interactive jobs run when using $job_output_file_umask userdefault b - changes to improve the qstat -x XML output and documentation b - fix truncated output in qmgr (peter h IPSec+jan n NANCO) b - fix so find_resc_entry still works after setting server extra_resc b - change so set_jobexid() gets called if JOB_ATR_egroup is not set b - fixed memory issue (underallocated array for a string) Thanks for everyone who contributed to this release! Regards, -- Josh Butikofer Cluster Resources, Inc. ############################# From gus at ldeo.columbia.edu Wed Jul 8 14:19:30 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Wed, 08 Jul 2009 16:19:30 -0400 Subject: [torqueusers] List of discussion or documentation on infiniband In-Reply-To: <4A54A0FC.5040608@gmail.com> References: <4A54A0FC.5040608@gmail.com> Message-ID: <4A54FF52.4090704@ldeo.columbia.edu> Hi Chris, list Like you, I never found any substantial, clear, and easy to follow Infiniband documentation, or a tutorial on how to setup, configure, monitor, and maintain Infiniband networks (IB). AFAIK, weak documentation is a general problem with IB. There is the open fabrics site, which was already pointed out to you: http://www.openfabrics.org/ There is also the open fabrics general list, which IMHO it is not really so general, but too technical, mostly for developers, not "user-friendly": http://lists.openfabrics.org/pipermail/general/ http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general On a more user-oriented tone I was pointed to this "Getting Started with Infiniband": http://people.redhat.com/dledford/infiniband_get_started.html I dug this "Introduction to Infiniband Architecture": http://www.oreillynet.com/pub/a/network/2002/02/04/windows.html and of course the Wikipedia article: http://en.wikipedia.org/wiki/InfiniBand More generic info about IB is in Jeff Layton's Jan/2008 article on ClusterMonkey: http://www.clustermonkey.net//content/view/222/2/ Cisco had a reasonable IB technology guide for beginners, but it seems to have removed it from its web site (at least I can't find them there anymore). I found a copy here: direkt.jacob-computer.de/content/datenblatt/168758_1.pdf If you downloaded OFED or have it in your cluster, there are text documents in the directory: /wherever/you/untarred/it/OFED-1.4/docs Start with the README file therein. In /etc/infiniband/ there is some info about how OFED was configured in your system (if you use OFED). The startup scripts are /etc/init.d/openibd (all nodes) and /etc/init.d/opensmd (subnet manager, most likely your head node). Some useful IB commands (try their man pages first): sminfo ibv_devinfo ibnodes ibhosts ibstat ibstatus ibdiagnet ibchecknet qperf I realize this is a bit off topic, but not so much, since it is about clusters, and Torque is a cluster tool. Anyway, you may find more information if you dig the on cluster-specific lists, particularly the Beowulf and Rocks Clusters list archives: http://www.beowulf.org/archive/index.html http://marc.info/?l=npaci-rocks-discussion I hope this helps, Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- ChrisJob.fr wrote: > Hi > > We have an infiniband HPC cluster. Sometimes we have problem with MPI > programs and we must restart the infiniband. After everything is OK for > 2 weeks. > Do you know where I can find a discussion list about infiniband ? Or > documention on the subject ? > > Thank you for yout help > Chris > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From Erming.Pei at ihep.ac.cn Wed Jul 8 19:39:57 2009 From: Erming.Pei at ihep.ac.cn (Erming.Pei) Date: Thu, 9 Jul 2009 09:39:57 +0800 Subject: [torqueusers] TORQUE 2.3.7 Released References: <4A54F2E9.10706@clusterresources.com> Message-ID: <003601ca0036$2aac4f50$43c2a8c0@Erming> Hi Josh, What's main difference between torque-2.3.7 and torque-2.4? Are the changes in torque-2.3.7 also supported in torque-2.4.x? PS. I deployed the latest 2.4.1b1 in our local cluster. Thanks, Erming PEI --------------------- IHEP, CAS, Beijing ----- Original Message ----- From: "Josh Butikofer" To: Sent: Thursday, July 09, 2009 3:26 AM Subject: [torqueusers] TORQUE 2.3.7 Released > Everyone, > > TORQUE 2.3.7 has been released. You can download it at: > > http://www.clusterresources.com/downloads/torque/torque-2.3.7.tar.gz > > This build has a number of bug fixes and enhancements. Here is a list of changes > from the CHANGELOG: > > c - crash b - bug fix e - enhancement f - new feature > > e - pbs_mom sisters can now tolerate an explicit group ID instead of only a > valid group name. This helps TORQUE be more robust to group lookup > failures. > b - fixed a bug where UNIX domain socket communication was failing when > "--disable-privports" was used. > e - add job exit status as 10th argument to the epilogue script > c - check filename for NULL to prevent crash > e - merged in more logging and NOSIGCHLDMOM capability from Yahoo branch > e - merged in new log_ext() function to allow more fine grained syslog events, > you can now specify severity level. Also added more logging statements > e - added code to allow compilers to override CLONE_BATCH_SIZE at configure > time (allows for finer grained control on how arrays are created) > e - added code which prefixes the severity tag on all log_ext() and log_err() > messages > e - added qmgr option accounting_keep_days, specifies how long to keep > accounting files. > e - changed mom config varattr so invoked script returns the varattr name > and value(s) > e - improved the performance of pbs_server when submitting large numbers of > jobs with dependencies defined > e - added new parameter "log_keep_days" to both pbs_server and pbs_mom. > specifies how long to keep log files before they are automatically removed > e - added qmgr server attribute lock_file, specifies where server lock file > is located > e - modified to allow retention of completed jobs across server shutdown > e - added job_must_report qmgr configuration which says the job must be > reported to scheduler. Added job attribute "reported". Added PURGECOMP > functionality which allows scheduler to confirm jobs are reported. Also > added -c option to qdel. Used to clean up unreported jobs. > b - fix so interactive jobs run when using $job_output_file_umask userdefault > b - changes to improve the qstat -x XML output and documentation > b - fix truncated output in qmgr (peter h IPSec+jan n NANCO) > b - fix so find_resc_entry still works after setting server extra_resc > b - change so set_jobexid() gets called if JOB_ATR_egroup is not set > b - fixed memory issue (underallocated array for a string) > > Thanks for everyone who contributed to this release! > > Regards, > > -- > Josh Butikofer > Cluster Resources, Inc. > ############################# > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From josh at clusterresources.com Thu Jul 9 06:36:38 2009 From: josh at clusterresources.com (Josh Butikofer) Date: Thu, 9 Jul 2009 06:36:38 -0600 (MDT) Subject: [torqueusers] TORQUE 2.3.7 Released In-Reply-To: <4A55A6B2.1090207@dartmouth.edu> Message-ID: <2060134705.64961247142998261.JavaMail.root@mail> You have to do this "manually" right now using something like: ./configure CFLAGS="-DCLONE_BATH_SIZE=5" Josh Butikofer Cluster Resources, Inc. ############################# ----- "Pete Schmitt" wrote: > Looked at configure ---help and can't find anything to do with > CLONE_BATCH_SIZE. How is overriding CLONE_BATCH_SIZE accomplished? > > -pete > > Josh Butikofer wrote: > > Everyone, > > TORQUE 2.3.7 has been released. You can download it at: > http://www.clusterresources.com/downloads/torque/torque-2.3.7.tar.gz > This build has a number of bug fixes and enhancements. Here is a list > of changes > from the CHANGELOG: > > c - crash b - bug fix e - enhancement f - new feature > > e - pbs_mom sisters can now tolerate an explicit group ID instead > of only a > valid group name. This helps TORQUE be more robust to group > lookup > failures. > b - fixed a bug where UNIX domain socket communication was failing > when > "--disable-privports" was used. > e - add job exit status as 10th argument to the epilogue script > c - check filename for NULL to prevent crash > e - merged in more logging and NOSIGCHLDMOM capability from Yahoo > branch > e - merged in new log_ext() function to allow more fine grained > syslog events, > you can now specify severity level. Also added more logging > statements > e - added code to allow compilers to override CLONE_BATCH_SIZE at > configure > time (allows for finer grained control on how arrays are > created) > e - added code which prefixes the severity tag on all log_ext() and > log_err() > messages > e - added qmgr option accounting_keep_days, specifies how long to > keep > accounting files. > e - changed mom config varattr so invoked script returns the > varattr name > and value(s) > e - improved the performance of pbs_server when submitting large > numbers of > jobs with dependencies defined > e - added new parameter "log_keep_days" to both pbs_server and > pbs_mom. > specifies how long to keep log files before they are > automatically removed > e - added qmgr server attribute lock_file, specifies where server > lock file > is located > e - modified to allow retention of completed jobs across server > shutdown > e - added job_must_report qmgr configuration which says the job > must be > reported to scheduler. Added job attribute "reported". Added > PURGECOMP > functionality which allows scheduler to confirm jobs are > reported. Also > added -c option to qdel. Used to clean up unreported jobs. > b - fix so interactive jobs run when using $job_output_file_umask > userdefault > b - changes to improve the qstat -x XML output and documentation > b - fix truncated output in qmgr (peter h IPSec+jan n NANCO) > b - fix so find_resc_entry still works after setting server > extra_resc > b - change so set_jobexid() gets called if JOB_ATR_egroup is not > set > b - fixed memory issue (underallocated array for a string) > > Thanks for everyone who contributed to this release! > > Regards, > > -- > > Pete Schmitt > Technical Director: Discovery Cluster > 179B Berry Library, HB 6224 > Dartmouth College > Hanover, NH 03755 > > Dartmouth: 603-646-8109 Mon,Tue,Thu > DHMC/Cell: 603-252-2452 Wed,Fri > Fax: 603-646-1042 > AIM: CongressSt > Computational Genetics Lab From josh at clusterresources.com Thu Jul 9 06:41:19 2009 From: josh at clusterresources.com (Josh Butikofer) Date: Thu, 9 Jul 2009 06:41:19 -0600 (MDT) Subject: [torqueusers] TORQUE 2.3.7 Released In-Reply-To: <003601ca0036$2aac4f50$43c2a8c0@Erming> Message-ID: <1998842979.65021247143279324.JavaMail.root@mail> TORQUE 2.3.7 is the fully released, up-to-date version of TORQUE. All the fixes/enhancements that go into 2.3.x are also being rolled into TORQUE 2.4. TORQUE 2.4 has additional features like much improved support for BLCR among other things (see the CHANGELOG). Technically TORQUE 2.4 is still in beta. We hope to release it over the next few months. If it is working fine for you, I'd say stick with 2.4 to help us meet that goal by shaking out any bugs (or discovering that there are no bugs). ;-) Josh Butikofer Cluster Resources, Inc. ############################# ----- "Erming.Pei" wrote: > Hi Josh, > > What's main difference between torque-2.3.7 and torque-2.4? > > Are the changes in torque-2.3.7 also supported in torque-2.4.x? > > PS. I deployed the latest 2.4.1b1 in our local cluster. > > Thanks, > > Erming PEI > --------------------- > IHEP, CAS, Beijing > > > > ----- Original Message ----- > From: "Josh Butikofer" > To: > Sent: Thursday, July 09, 2009 3:26 AM > Subject: [torqueusers] TORQUE 2.3.7 Released > > > > Everyone, > > > > TORQUE 2.3.7 has been released. You can download it at: > > > > > http://www.clusterresources.com/downloads/torque/torque-2.3.7.tar.gz > > > > This build has a number of bug fixes and enhancements. Here is a > list of changes > > from the CHANGELOG: > > > > c - crash b - bug fix e - enhancement f - new feature > > > > e - pbs_mom sisters can now tolerate an explicit group ID instead > of only a > > valid group name. This helps TORQUE be more robust to group > lookup > > failures. > > b - fixed a bug where UNIX domain socket communication was failing > when > > "--disable-privports" was used. > > e - add job exit status as 10th argument to the epilogue script > > c - check filename for NULL to prevent crash > > e - merged in more logging and NOSIGCHLDMOM capability from Yahoo > branch > > e - merged in new log_ext() function to allow more fine grained > syslog events, > > you can now specify severity level. Also added more logging > statements > > e - added code to allow compilers to override CLONE_BATCH_SIZE at > configure > > time (allows for finer grained control on how arrays are > created) > > e - added code which prefixes the severity tag on all log_ext() > and log_err() > > messages > > e - added qmgr option accounting_keep_days, specifies how long to > keep > > accounting files. > > e - changed mom config varattr so invoked script returns the > varattr name > > and value(s) > > e - improved the performance of pbs_server when submitting large > numbers of > > jobs with dependencies defined > > e - added new parameter "log_keep_days" to both pbs_server and > pbs_mom. > > specifies how long to keep log files before they are > automatically removed > > e - added qmgr server attribute lock_file, specifies where server > lock file > > is located > > e - modified to allow retention of completed jobs across server > shutdown > > e - added job_must_report qmgr configuration which says the job > must be > > reported to scheduler. Added job attribute "reported". Added > PURGECOMP > > functionality which allows scheduler to confirm jobs are > reported. Also > > added -c option to qdel. Used to clean up unreported jobs. > > b - fix so interactive jobs run when using $job_output_file_umask > userdefault > > b - changes to improve the qstat -x XML output and documentation > > b - fix truncated output in qmgr (peter h IPSec+jan n NANCO) > > b - fix so find_resc_entry still works after setting server > extra_resc > > b - change so set_jobexid() gets called if JOB_ATR_egroup is not > set > > b - fixed memory issue (underallocated array for a string) > > > > Thanks for everyone who contributed to this release! > > > > Regards, > > > > -- > > Josh Butikofer > > Cluster Resources, Inc. > > ############################# > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > From simont at mail.muni.cz Wed Jul 8 07:22:25 2009 From: simont at mail.muni.cz (Simon Toth) Date: Wed, 08 Jul 2009 15:22:25 +0200 Subject: [torqueusers] Configuration for pbs_sched on different machine? Message-ID: <4A549D91.6000703@mail.muni.cz> How can I configure the server/scheduler to work together when they are on different machines? I did run the server with both -S ip and -S hostname, but none seems to work. Simon Toth From peter.schmitt at dartmouth.edu Thu Jul 9 02:13:38 2009 From: peter.schmitt at dartmouth.edu (Pete Schmitt) Date: Thu, 09 Jul 2009 04:13:38 -0400 Subject: [torqueusers] TORQUE 2.3.7 Released In-Reply-To: <4A54F2E9.10706@clusterresources.com> References: <4A54F2E9.10706@clusterresources.com> Message-ID: <4A55A6B2.1090207@dartmouth.edu> Looked at configure ---help and can't find anything to do with CLONE_BATCH_SIZE. How is overriding CLONE_BATCH_SIZE accomplished? -pete Josh Butikofer wrote: > Everyone, > > TORQUE 2.3.7 has been released. You can download it at: > > http://www.clusterresources.com/downloads/torque/torque-2.3.7.tar.gz > > This build has a number of bug fixes and enhancements. Here is a list of changes > from the CHANGELOG: > > c - crash b - bug fix e - enhancement f - new feature > > e - pbs_mom sisters can now tolerate an explicit group ID instead of only a > valid group name. This helps TORQUE be more robust to group lookup > failures. > b - fixed a bug where UNIX domain socket communication was failing when > "--disable-privports" was used. > e - add job exit status as 10th argument to the epilogue script > c - check filename for NULL to prevent crash > e - merged in more logging and NOSIGCHLDMOM capability from Yahoo branch > e - merged in new log_ext() function to allow more fine grained syslog events, > you can now specify severity level. Also added more logging statements > e - added code to allow compilers to override CLONE_BATCH_SIZE at configure > time (allows for finer grained control on how arrays are created) > e - added code which prefixes the severity tag on all log_ext() and log_err() > messages > e - added qmgr option accounting_keep_days, specifies how long to keep > accounting files. > e - changed mom config varattr so invoked script returns the varattr name > and value(s) > e - improved the performance of pbs_server when submitting large numbers of > jobs with dependencies defined > e - added new parameter "log_keep_days" to both pbs_server and pbs_mom. > specifies how long to keep log files before they are automatically removed > e - added qmgr server attribute lock_file, specifies where server lock file > is located > e - modified to allow retention of completed jobs across server shutdown > e - added job_must_report qmgr configuration which says the job must be > reported to scheduler. Added job attribute "reported". Added PURGECOMP > functionality which allows scheduler to confirm jobs are reported. Also > added -c option to qdel. Used to clean up unreported jobs. > b - fix so interactive jobs run when using $job_output_file_umask userdefault > b - changes to improve the qstat -x XML output and documentation > b - fix truncated output in qmgr (peter h IPSec+jan n NANCO) > b - fix so find_resc_entry still works after setting server extra_resc > b - change so set_jobexid() gets called if JOB_ATR_egroup is not set > b - fixed memory issue (underallocated array for a string) > > Thanks for everyone who contributed to this release! > > Regards, > > -- *Pete Schmitt* *Technical Director: Discovery Cluster 179B Berry Library, HB 6224 Dartmouth College Hanover, NH 03755* *Dartmouth: 603-646-8109 Mon,Tue,Thu** DHMC/Cell: 603-252-2452 Wed,Fri Fax: 603-646-1042 AIM: CongressSt * Computational Genetics Lab -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20090709/a5a06385/attachment.html From anilth at hi.is Thu Jul 9 10:27:37 2009 From: anilth at hi.is (Anil Thapa) Date: Thu, 09 Jul 2009 16:27:37 +0000 Subject: [torqueusers] Torque 2.3.7 and connection Refused Message-ID: <4A561A79.4050608@hi.is> Hi all, I just build the new version and installed. I am having a small problem. Everything looks well however when job submitted it always stays in R mode and rest other jobs are in Q mode when i do qstat. I looked at the node mom_logs it has loads of this error: 07/09/2009 16:18:10;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Connection refused (111) in scan_for_exiting, cannot bind to port 1023 in client_to_svr - connection refused Note: submitted job is very simple script. and for the testing purpose firewall has temporarily turned off. Has anyone came up with this error. Any tips, help and suggestion would be appreciated. Regards, \A From tbaer at utk.edu Thu Jul 9 15:52:54 2009 From: tbaer at utk.edu (Troy Baer) Date: Thu, 09 Jul 2009 17:52:54 -0400 Subject: [torqueusers] Job dependency problem Message-ID: <1247176374.6386.119.camel@browncoat.jics.utk.edu> Hello all, A summer intern and I have been working on a tool to automate generating graph-based job dependency chains (a la condor_submit_dag), and we've run into an interesting problem where setting a dependency sometimes doesn't seem to have any effect: ----- $ qstat -r verne.nics.utk.edu: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- 2879.verne.nics. troy analysis postproc_hr6 13058 1 -- -- 01:00 R -- 2880.verne.nics. troy hpss archive 14980 -- -- -- 01:00 R -- $ qstat -f 2879 2880 Job Id: 2879.verne.nics.utk.edu ... depend = afterok:2874.verne.nics.utk.edu at verne.nics.utk.edu, beforeok:2880.verne.nics.utk.edu at verne.nics.utk.edu ... submit_args = -N postproc_hr6 -W depend=afterok:2874.verne.nics.utk.edu -v day=20090531,hr=0600 postproc.pbs ... Job Id: 2880.verne.nics.utk.edu ... depend = afterok:2875.verne.nics.utk.edu at verne.nics.utk.edu:2876.verne.nic s.utk.edu at verne.nics.utk.edu:2877.verne.nics.utk.edu at verne.nics.utk.ed u:2878.verne.nics.utk.edu at verne.nics.utk.edu ... submit_args = -N archive -W depend=afterok:2875.verne.nics.utk.edu:2876.ve rne.nics.utk.edu:2877.verne.nics.utk.edu:2878.verne.nics.utk.edu:2879. verne.nics.utk.edu -v day=20090531 archive.pbs ... ----- Jobid 2880 was submitted with an afterok dependency on 2879 (among other things), but that has somehow been translated into 2879 having a beforeok dependency on 2880 that wasn't in the original submission and doesn't seem to have any effect. Any ideas on what might be causing this? Thanks, --Troy -- Troy Baer, HPC System Administrator National Institute for Computational Sciences, University of Tennessee http://www.nics.tennessee.edu/ Phone: 865-241-4233 From anilth at hi.is Thu Jul 9 18:31:18 2009 From: anilth at hi.is (Anil Thapa) Date: Fri, 10 Jul 2009 00:31:18 +0000 Subject: [torqueusers] Torque 2.3.7 and connection Refused In-Reply-To: <4A561A79.4050608@hi.is> References: <4A561A79.4050608@hi.is> Message-ID: <4A568BD6.1020500@hi.is> Hello again, I been into number of problem today while testing 2.3.7 version. 1. After re-building the all package and clean installation now pbs_mom is reporting to the server with all the firewall turned off. 2. Jobs can be submitted to the frontnode - when i submit job it gives you jobid and etc as usual. For example : I tested with echo "sleep 30" | qsub ---this submits the job and watch qsub can be seen the job activities. - that okay and normal. 3. then i wrote very simple script and submit the job it does submitted but it does not download the output. It looks job has been submitted- finished but don?t know what happened? mom_log shows this - pbs_mom;Req;dis_request_read;decoding command CopyFiles from PBS_Server pbs_mom;Req;;Type CopyFiles request received from PBS_Server at p34.test.local, sock=12 pbs_mom;Job;dispatch_request;dispatching request CopyFiles on sd=12 pbs_mom;Job;62.bhairab.rhi.hi.is;attempting to copy file 'bhairab.rhi.hi.is:/test/anil/test.sh.o62' pbs_mom;Job;62.bhairab.rhi.hi.is;forking to user, uid: 501 gid: 501 homedir: '/test/anil' pbs_mom;n/a;mom_close_poll;entered pbs_mom;Svr;mom_get_sample;proc_array load started pbs_mom;n/a;mom_get_sample;proc_array loaded - nproc=126 pbs_mom;Svr;mom_get_sample;proc_array load started pbs_mom;n/a;mom_get_sample;proc_array loaded - nproc=126 pbs_mom;n/a;cput_sum;proc_array loop start - jobid = 62.bhairab.rhi.hi.is pbs_mom;n/a;mem_sum;proc_array loop start - jobid = 62.bhairab.rhi.hi.is pbs_mom;n/a;resi_sum;proc_array loop start - jobid = 62.bhairab.rhi.hi.is pbs_mom;n/a;cput_sum;proc_array loop start - jobid = 61.bhairab.rhi.hi.is pbs_mom;n/a;mem_sum;proc_array loop start - jobid = 61.bhairab.rhi.hi.is pbs_mom;n/a;resi_sum;proc_array loop start - jobid = 61.bhairab.rhi.hi.is pbs_mom;Job;scan_for_terminated;pid 5027 not tracked, exitcode=0 pbs_mom;Req;dis_request_read;decoding command DeleteJob from PBS_Server I am not sure if this is issue with this version or just the configuration error. In the past I did the similar way with 2.1.7 and have no problem. Can someone point me to right direction to fix this installation or configuration. 4. I joined headnode into our ldap server so users can login to headnode with their own username passowrd. users home directory are in different servers which are mounted at the time they login so users get their own home drive space. But when the users submit thier jobs does users home dirve and accounts also be created in all the nodes ? or how is the general practice ? Finally, last time I build the packages it also created the torque pam package but i do not see in this version by default. I would be grateful if someone point me to torque pam pakacge link. Some input - help, suggesstion would be great apologies for long Regards, A Anil Thapa wrote: > Hi all, > > I just build the new version and installed. I am having a small > problem. Everything looks well however when job submitted it always > stays in R mode and rest other jobs are in Q mode when i do qstat. I > looked at the node mom_logs it has loads of this error: > > 07/09/2009 16:18:10;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Connection > refused (111) in scan_for_exiting, cannot bind to port 1023 in > client_to_svr - connection refused > > Note: submitted job is very simple script. and for the testing purpose > firewall has temporarily turned off. > > Has anyone came up with this error. Any tips, help and suggestion would > be appreciated. > > Regards, > \A > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From gus at ldeo.columbia.edu Thu Jul 9 18:51:13 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Thu, 09 Jul 2009 20:51:13 -0400 Subject: [torqueusers] Does "qsub -p" set the job priority? Message-ID: <4A569081.7010501@ldeo.columbia.edu> Dear Torque and Maui experts Can anybody clarify if "qsub -p" is effective? I.e., does it set the job priority when one is using Torque AND Maui? In this case, how the Torque "qsub -p" priority range of -1024 to +1023 translates into a Maui job priority number? Or is the user supposed to change the job priority with "setspri", after the job is queued with "qsub"? I couldn't find any clarification of this matter on the qsub man pages, or on the Maui and Torque Administrator Guides. My simple experiments with a bunch of jobs suggested that "qsub -p" doesn't work, doesn't really set the job priority. (They ran in the order they were submitted, not according to the priority). However, I may be missing some fundamental and undocumented feature that only the savvy insiders know about. :) I found this 2006 thread on the very same problem, and I wonder if the issue was resolved: http://www.supercluster.org/pipermail/mauiusers/2006-November/002446.html I appreciate any help. Many thanks, Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- From prakash.velayutham at cchmc.org Thu Jul 9 20:04:25 2009 From: prakash.velayutham at cchmc.org (Prakash Velayutham) Date: Thu, 9 Jul 2009 22:04:25 -0400 Subject: [torqueusers] Torque 2.3.7 and connection Refused In-Reply-To: <4A568BD6.1020500@hi.is> References: <4A561A79.4050608@hi.is> <4A568BD6.1020500@hi.is> Message-ID: <66F84E25-850B-4EAF-BDA8-CC42E8D1B6F5@cchmc.org> Hi, On Jul 9, 2009, at 8:31 PM, Anil Thapa wrote: > Hello again, > > I been into number of problem today while testing 2.3.7 version. > > 1. After re-building the all package and clean installation now > pbs_mom > is reporting to the server with all the firewall turned off. > 2. Jobs can be submitted to the frontnode - when i submit job it gives > you jobid and etc as usual. For example : I tested with echo "sleep > 30" > | qsub ---this submits the job and watch qsub can be seen the job > activities. - that okay and normal. > 3. then i wrote very simple script and submit the job it does > submitted > but it does not download the output. It looks job has been submitted- > finished but don?t know what happened? mom_log shows this - > > pbs_mom;Req;dis_request_read;decoding command CopyFiles from > PBS_Server > pbs_mom;Req;;Type CopyFiles request received from > PBS_Server at p34.test.local, sock=12 > pbs_mom;Job;dispatch_request;dispatching request CopyFiles on sd=12 > pbs_mom;Job;62.bhairab.rhi.hi.is;attempting to copy file > 'bhairab.rhi.hi.is:/test/anil/test.sh.o62' > pbs_mom;Job;62.bhairab.rhi.hi.is;forking to user, uid: 501 gid: 501 > homedir: '/test/anil' > pbs_mom;n/a;mom_close_poll;entered > pbs_mom;Svr;mom_get_sample;proc_array load started > pbs_mom;n/a;mom_get_sample;proc_array loaded - nproc=126 > pbs_mom;Svr;mom_get_sample;proc_array load started > pbs_mom;n/a;mom_get_sample;proc_array loaded - nproc=126 > pbs_mom;n/a;cput_sum;proc_array loop start - jobid = > 62.bhairab.rhi.hi.is > pbs_mom;n/a;mem_sum;proc_array loop start - jobid = > 62.bhairab.rhi.hi.is > pbs_mom;n/a;resi_sum;proc_array loop start - jobid = > 62.bhairab.rhi.hi.is > pbs_mom;n/a;cput_sum;proc_array loop start - jobid = > 61.bhairab.rhi.hi.is > pbs_mom;n/a;mem_sum;proc_array loop start - jobid = > 61.bhairab.rhi.hi.is > pbs_mom;n/a;resi_sum;proc_array loop start - jobid = > 61.bhairab.rhi.hi.is > pbs_mom;Job;scan_for_terminated;pid 5027 not tracked, exitcode=0 > pbs_mom;Req;dis_request_read;decoding command DeleteJob from > PBS_Server I think this means your job completed successfully. Not sure what the job is and what it is supposed to do. > > I am not sure if this is issue with this version or just the > configuration error. In the past I did the similar way with 2.1.7 and > have no problem. Can someone point me to right direction to fix this > installation or configuration. > > 4. I joined headnode into our ldap server so users can login to > headnode > with their own username passowrd. users home directory are in > different > servers which are mounted at the time they login so users get their > own > home drive space. But when the users submit thier jobs does users home > dirve and accounts also be created in all the nodes ? or how is the > general practice ? I guess you mean LDAP for name services and authentication and automounted home directories, right? If yes, is LDAP and automounter configured correctly on the head node and all the compute nodes and is automounter daemon running on the head node and all the compute nodes? If LDAP is not configured correctly on the compute nodes, user jobs won't even start as the compute nodes won't know anything about the user. And if automounter is not running, compute nodes will not run a job as a shell cannot be started. > > Finally, last time I build the packages it also created the torque pam > package but i do not see in this version by default. I would be > grateful > if someone point me to torque pam pakacge link. Not sure about 2.3.7, but 2.3.6 has it. > > Some input - help, suggesstion would be great > > apologies for long > > Regards, > A Good luck, Prakash > Anil Thapa wrote: >> Hi all, >> >> I just build the new version and installed. I am having a small >> problem. Everything looks well however when job submitted it always >> stays in R mode and rest other jobs are in Q mode when i do qstat. I >> looked at the node mom_logs it has loads of this error: >> >> 07/09/2009 16:18:10;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Connection >> refused (111) in scan_for_exiting, cannot bind to port 1023 in >> client_to_svr - connection refused >> >> Note: submitted job is very simple script. and for the testing >> purpose >> firewall has temporarily turned off. >> >> Has anyone came up with this error. Any tips, help and suggestion >> would >> be appreciated. >> >> Regards, >> \A >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From prakash.velayutham at cchmc.org Fri Jul 10 09:38:35 2009 From: prakash.velayutham at cchmc.org (Prakash Velayutham) Date: Fri, 10 Jul 2009 11:38:35 -0400 Subject: [torqueusers] Torque 2.3.7 and connection Refused In-Reply-To: <4A575CAD.7020505@hi.is> References: <4A561A79.4050608@hi.is> <4A568BD6.1020500@hi.is> <66F84E25-850B-4EAF-BDA8-CC42E8D1B6F5@cchmc.org> <4A570324.5040809@hi.is> <5273CC80-545E-4EBB-9626-D0EC68F44EE7@cchmc.org> <4A575CAD.7020505@hi.is> Message-ID: Hi, Please copy the list in your emails as this might help others looking for the same answers. On Jul 10, 2009, at 11:22 AM, Anil Thapa wrote: > Hello > > Thanks for you help. I actually remove all the torque from server > and client then rebuild with ./configure ---with- scp for both head > node and client. jobs sumission was working fine but jobs were > always in /var/spool/torque/underliverd directory. then I followed > 6.1.5 - Enabling Bi-Directional SCP Access from cluster resources "http://www.clusterresources.com/products/torque/docs/6.1scpsetup.shtml > ". I created the identical users locally in both head node and > computer node with same uid. Then it worked as it supposed to. jobs > are sent to compute node and result back to users /home/user > directory. At least this looks working but this is not an ideal way. > > It would be ideal I don?t have to create user and its home directory > for very users. I was thinking adding every compute node to LDAP > server and export their home directory as my head node. What is your > thought in this. It is supposed to work that way if LDAP and automounter are configured correctly and your home directory server's export list allows head node and compute nodes to mount the relevant directory with the required permissions. > > However still user have to ssh-keygen -t rsa bi-directionally in > order to compute node could send the result back (or are there any > better option) ? Not needed. Please see usecp directive in PBS Mom's configuration file. As the home directories are uniform and automounted across the whole cluster, MOM needs to just cp the output and error files instead of using any kind of remote copy. > > Thanks and have a good weekend. > > A Prakash > Prakash Velayutham wrote: >> Hi, >> >> On Jul 10, 2009, at 5:00 AM, Anil Thapa wrote: >> >>> Hello ! >>> >>> Thanks for your response >>> >>> Prakash Velayutham wrote: >>>> Hi, >>>> >>>> On Jul 9, 2009, at 8:31 PM, Anil Thapa wrote: >>>> >>>>> Hello again, >>>>> >>>>> I been into number of problem today while testing 2.3.7 version. >>>>> >>>>> 1. After re-building the all package and clean installation now >>>>> pbs_mom >>>>> is reporting to the server with all the firewall turned off. >>>>> 2. Jobs can be submitted to the frontnode - when i submit job it >>>>> gives >>>>> you jobid and etc as usual. For example : I tested with echo >>>>> "sleep 30" >>>>> | qsub ---this submits the job and watch qsub can be seen the job >>>>> activities. - that okay and normal. >>>>> 3. then i wrote very simple script and submit the job it does >>>>> submitted >>>>> but it does not download the output. It looks job has been >>>>> submitted- >>>>> finished but don?t know what happened? mom_log shows this - >>>>> >>>>> pbs_mom;Req;dis_request_read;decoding command CopyFiles from >>>>> PBS_Server >>>>> pbs_mom;Req;;Type CopyFiles request received from >>>>> PBS_Server at p34.test.local, sock=12 >>>>> pbs_mom;Job;dispatch_request;dispatching request CopyFiles on >>>>> sd=12 >>>>> pbs_mom;Job;62.bhairab.rhi.hi.is;attempting to copy file >>>>> 'bhairab.rhi.hi.is:/test/anil/test.sh.o62' >>>>> pbs_mom;Job;62.bhairab.rhi.hi.is;forking to user, uid: 501 gid: >>>>> 501 >>>>> homedir: '/test/anil' >>>>> pbs_mom;n/a;mom_close_poll;entered >>>>> pbs_mom;Svr;mom_get_sample;proc_array load started >>>>> pbs_mom;n/a;mom_get_sample;proc_array loaded - nproc=126 >>>>> pbs_mom;Svr;mom_get_sample;proc_array load started >>>>> pbs_mom;n/a;mom_get_sample;proc_array loaded - nproc=126 >>>>> pbs_mom;n/a;cput_sum;proc_array loop start - jobid = >>>>> 62.bhairab.rhi.hi.is >>>>> pbs_mom;n/a;mem_sum;proc_array loop start - jobid = >>>>> 62.bhairab.rhi.hi.is >>>>> pbs_mom;n/a;resi_sum;proc_array loop start - jobid = >>>>> 62.bhairab.rhi.hi.is >>>>> pbs_mom;n/a;cput_sum;proc_array loop start - jobid = >>>>> 61.bhairab.rhi.hi.is >>>>> pbs_mom;n/a;mem_sum;proc_array loop start - jobid = >>>>> 61.bhairab.rhi.hi.is >>>>> pbs_mom;n/a;resi_sum;proc_array loop start - jobid = >>>>> 61.bhairab.rhi.hi.is >>>>> pbs_mom;Job;scan_for_terminated;pid 5027 not tracked, exitcode=0 >>>>> pbs_mom;Req;dis_request_read;decoding command DeleteJob from >>>>> PBS_Server >>>> >>>> I think this means your job completed successfully. Not sure what >>>> the job is and what it is supposed to do. >>>> >>> Okay!, the job was just to output the hostname. Normally, when job >>> finished it dowload the output in file with joid.e01 or similar in >>> users home directory. >>>>> >>>>> I am not sure if this is issue with this version or just the >>>>> configuration error. In the past I did the similar way with >>>>> 2.1.7 and >>>>> have no problem. Can someone point me to right direction to fix >>>>> this >>>>> installation or configuration. >>>>> >>>>> 4. I joined headnode into our ldap server so users can login to >>>>> headnode >>>>> with their own username passowrd. users home directory are in >>>>> different >>>>> servers which are mounted at the time they login so users get >>>>> their own >>>>> home drive space. But when the users submit thier jobs does >>>>> users home >>>>> dirve and accounts also be created in all the nodes ? or how is >>>>> the >>>>> general practice ? >>>> >>>> I guess you mean LDAP for name services and authentication and >>>> automounted home directories, right? If yes, is LDAP and >>>> automounter configured correctly on the head node and all the >>>> compute nodes and is automounter daemon running on the head node >>>> and all the compute nodes? >>>> >>>> If LDAP is not configured correctly on the compute nodes, user >>>> jobs won't even start as the compute nodes won't know anything >>>> about the user. And if automounter is not running, compute nodes >>>> will not run a job as a shell cannot be started. >>>> >>> Yes this what I am talking about. In the head node home >>> directories are auto mounted and users can get their home >>> directories in head node. Isn?t a similar configuration LDAP >>> configuration for nodes ? >> >> That depends. There are 2 things going on here. >> Name services are configured in /etc/nsswitch.conf. If LDAP is told >> as an option in that file for authentication (shadow and passwd) >> and for automounts (automount), then LDAP will be used. >> Next, LDAP configuration (for system-level authentication) >> generally is in /etc/ldap.conf (could be somewhere else in your >> distribution). So, if that file is common in the head node and the >> compute nodes, sure they should all behave the same way. >> >>>>> >>>>> Finally, last time I build the packages it also created the >>>>> torque pam >>>>> package but i do not see in this version by default. I would be >>>>> grateful >>>>> if someone point me to torque pam pakacge link. >>>> >>>> Not sure about 2.3.7, but 2.3.6 has it. >>>> >>>>> >>> what this pam pakage do actually ! >> >> It is supposed to let you control who can gain access to the >> compute nodes through services other than Torque (like SSH). You >> would not want users to be able to SSH into compute nodes and start >> any process outside of Torque. But you would want to enable users >> to be able to SSH to a compute node where they already have a job >> running. That is what pam_pbssimpleauth module lets you configure. >> >>> >>> Thanks >>> A >>>> >>>>> Anil Thapa wrote: >>>>>> Hi all, >>>>>> >>>>>> I just build the new version and installed. I am having a small >>>>>> problem. Everything looks well however when job submitted it >>>>>> always >>>>>> stays in R mode and rest other jobs are in Q mode when i do >>>>>> qstat. I >>>>>> looked at the node mom_logs it has loads of this error: >>>>>> >>>>>> 07/09/2009 16:18:10;0001; >>>>>> pbs_mom;Svr;pbs_mom;LOG_ERROR::Connection >>>>>> refused (111) in scan_for_exiting, cannot bind to port 1023 in >>>>>> client_to_svr - connection refused >>>>>> >>>>>> Note: submitted job is very simple script. and for the testing >>>>>> purpose >>>>>> firewall has temporarily turned off. >>>>>> >>>>>> Has anyone came up with this error. Any tips, help and >>>>>> suggestion would >>>>>> be appreciated. >>>>>> >>>>>> Regards, >>>>>> \A >> >> Hope that helps, >> Prakash > From abhig at Princeton.EDU Fri Jul 10 10:36:29 2009 From: abhig at Princeton.EDU (Abhishek Gupta) Date: Fri, 10 Jul 2009 12:36:29 -0400 Subject: [torqueusers] rcp to cp Message-ID: <4A576E0D.7050404@princeton.edu> Hi all, I was wondering if someone could tell me where to create/modify "config" file to add "usecp"? On server or nodes? Also in some of the previous posts, I noticed some people were talking about $pbsserver, $logevent. I found such entries in config file on nodes. Does that mean, I need to add usecp on nodes? If there is anything else I need to add other that "usecp", please let me know that too. Thanks, Abhi. From jdsmit at sandia.gov Fri Jul 10 10:41:28 2009 From: jdsmit at sandia.gov (Jerry Smith) Date: Fri, 10 Jul 2009 10:41:28 -0600 Subject: [torqueusers] rcp to cp In-Reply-To: <4A576E0D.7050404@princeton.edu> References: <4A576E0D.7050404@princeton.edu> Message-ID: <4A576F38.6060202@sandia.gov> Abhi, Here is an example from /var/spool/pbs/mom_priv/config (residing on all compute nodes) [root at an1 ~]# cat /var/spool/pbs/mom_priv/config $logevent 0x1ff $pbsserver tbird-admin2 ##pbs server name $remote_reconfig 1 ## allow momctl -r $node_check_script /var/spool/pbs/mom_priv/node-health ## health check scripts $node_check_interval 30 $status_update_time 90 $down_on_error 1 ## if healthcheck fails mark the node down, and let the scheduler know $usecp *:/home /home ## treat filesystem as a local copy no need for rcp/scp $usecp *:/projects /projects ## treat filesystem as a local copy no need for rcp/scp $usecp *:/apps /apps ## treat filesystem as a local copy no need for rcp/scp --Jerry Abhishek Gupta wrote: > Hi all, > I was wondering if someone could tell me where to create/modify "config" > file to add "usecp"? On server or nodes? > Also in some of the previous posts, I noticed some people were talking > about $pbsserver, $logevent. I found such entries in config file on > nodes. Does that mean, I need to add usecp on nodes? If there is > anything else I need to add other that "usecp", please let me know that too. > Thanks, > Abhi. > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > From prakash.velayutham at cchmc.org Fri Jul 10 10:53:23 2009 From: prakash.velayutham at cchmc.org (Prakash Velayutham) Date: Fri, 10 Jul 2009 12:53:23 -0400 Subject: [torqueusers] rcp to cp In-Reply-To: <4A576E0D.7050404@princeton.edu> References: <4A576E0D.7050404@princeton.edu> Message-ID: <929730FE-B1C5-49B3-A0AC-9D7664E24DB5@cchmc.org> On Jul 10, 2009, at 12:36 PM, Abhishek Gupta wrote: > Hi all, > I was wondering if someone could tell me where to create/modify > "config" > file to add "usecp"? On server or nodes? > Also in some of the previous posts, I noticed some people were talking > about $pbsserver, $logevent. I found such entries in config file on > nodes. Does that mean, I need to add usecp on nodes? If there is > anything else I need to add other that "usecp", please let me know > that too. > Thanks, > Abhi. Hi, $usecp goes in the MOM config. What else can go in that config file depends on what else you need. Like, $loglevel, $restricted, $remote_reconfig etc. Prakash From jbernstein at penguincomputing.com Fri Jul 10 12:07:08 2009 From: jbernstein at penguincomputing.com (Joshua Bernstein) Date: Fri, 10 Jul 2009 11:07:08 -0700 Subject: [torqueusers] accounting In-Reply-To: <1245864625.13550.302.camel@aeolis> References: <4A412858.4010900@nada.kth.se> <1245784949.20566.10.camel@browncoat.jics.utk.edu> <4A4176FE.1040604@ldeo.columbia.edu> <1245864625.13550.302.camel@aeolis> Message-ID: <4A57834C.5060101@penguincomputing.com> Naveed Near-Ansari wrote: > I think if you specify in torque.cfg to use a qsub filter it will use it > even when submitting through qsub. > > torque.cfg: > > SUBMITFILTER /opt/torque/bin/qsub_filter > > I have not attempted exiting on certain conditions using this since i > only use it to add in default wallclock times when not set by the user, > but presumably you could do your checking here. > If you want to just set a default wallclock time isn't it easier to just add it as a parameter in qmgr rather then having an entire wrapper just for that? For example, to set the default wallclock time for the batch queue, use: # qmgr -c "set queue batch resources_default.walltime=3600" Note the time here is in seconds. -Joshua Bernstein Software Engineer Penguin Computing From charles.johnson at accre.vanderbilt.edu Fri Jul 10 12:38:33 2009 From: charles.johnson at accre.vanderbilt.edu (Charles Johnson) Date: Fri, 10 Jul 2009 13:38:33 -0500 Subject: [torqueusers] From deferred to idle and back Message-ID: <7A526994-6835-4D22-A348-97F6B876C3BF@accre.vanderbilt.edu> We have several multi-processor jobs that will not start. showq -b shows them as deferred; later showq -i will show them as idle; then they will be deferred and so forth. Checkjob -v shows messages similar to these: Message[0] 9 nodes unavailable to start reserved job after 63 seconds (reserved node vmp089 is in state 'Running' - check node) Message[1] 9 nodes unavailable to start reserved job after 63 seconds (reserved node vmp090 is in state 'Running' - check node) Message[2] 9 nodes unavailable to start reserved job after 63 seconds (reserved node vmp066 is in state 'Running' - check node) Message[3] 10 nodes unavailable to start reserved job after 63 seconds (reserved node vmp069 is in state 'Running' - check node) I haven't found anything revealing in the log files, but I am not exactly sure what to look for. The identified nodes have jobs running on them, but there are free processors. We use torque 2.3.6, and moab 5.3.2 (revision 12709) I would appreciate any suggestions. Cheers-- Charles --- Charles Johnson Advanced Computing Center for Research and Education Vanderbilt University Office: 615-343-2776 Cell: 615-478-8799 From Gareth.Williams at csiro.au Sun Jul 12 01:16:03 2009 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Sun, 12 Jul 2009 17:16:03 +1000 Subject: [torqueusers] Does "qsub -p" set the job priority? In-Reply-To: <4A569081.7010501@ldeo.columbia.edu> References: <4A569081.7010501@ldeo.columbia.edu> Message-ID: <007DECE986B47F4EABF823C1FBB19C625C13A7BCDD@exvic-mbx04.nexus.csiro.au> > From: Gus Correa [mailto:gus at ldeo.columbia.edu] -snip- > Can anybody clarify if "qsub -p" is effective? > > I.e., does it set the job priority when one is using Torque AND Maui? > > In this case, how the Torque "qsub -p" priority range of -1024 to +1023 > translates into a Maui job priority number? > > Or is the user supposed to change the job priority with "setspri", > after the job is queued with "qsub"? Hi Gus, My understanding is that at least with moab it does work - but only for negative values. So you can lower your own priority but not raise it. Makes a certain perverse sense. You would need to couple this with maiu/moab settings to activate it and set a weighting. Found the doco. SERV USERPRIO http://www.clusterresources.com/products/mwm/docs/5.1.2priorityfactors.shtml#userprio but it looks like moab only - might work but be undocumented in maui. I/we decided it wasn't useful enough to turn on... but still see a need for users to be able to set priority at least within their own jobs. I would appreciate any pointers if you make progress. Cheers, Gareth From brockp at umich.edu Mon Jul 13 09:18:53 2009 From: brockp at umich.edu (Brock Palen) Date: Mon, 13 Jul 2009 11:18:53 -0400 Subject: [torqueusers] Limiting number of running jobs user selectable Message-ID: Is there an easy way to limit the number of running job for a user, where the user can choose? For example user wants to submit 100 jobs, but never have more than 20 running at a time (politics) but has the need some times to go to all the available nodes. I was looking at the depend flags, and while you could make the normal graph using after or before, is there a way to say "job only if running less than count" ? Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp at umich.edu (734)936-1985 From akohlmey at cmm.chem.upenn.edu Mon Jul 13 09:37:02 2009 From: akohlmey at cmm.chem.upenn.edu (Axel Kohlmeyer) Date: Mon, 13 Jul 2009 11:37:02 -0400 Subject: [torqueusers] Limiting number of running jobs user selectable In-Reply-To: References: Message-ID: <1247499422.2840.73.camel@zero> On Mon, 2009-07-13 at 11:18 -0400, Brock Palen wrote: > Is there an easy way to limit the number of running job for a user, > where the user can choose? not that i know of. i usually recommend to "package" jobs. i.e. have multiple backgrounded mpirun/mpiexec calls on subsets of the machine. with libtorque this should distribute the jobs automatically, IIRC. we mostly use this trick on cray XT machines, though, where this helps to get large projects into the proper job scheduling class... and then put a "wait" into the submit script to wait for completion of the backgrounded chunk and then just have a regular job dependency chain on those jobs. cheers, axel. > For example user wants to submit 100 jobs, but never have more than 20 > running at a time (politics) but has the need some times to go to all > the available nodes. > > I was looking at the depend flags, and while you could make the normal > graph using after or before, is there a way to say "job only if > running less than count" ? > > Brock Palen > www.umich.edu/~brockp > Center for Advanced Computing > brockp at umich.edu > (734)936-1985 > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -- ======================================================================= Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu http://www.cmm.upenn.edu Center for Molecular Modeling -- University of Pennsylvania Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323 tel: 1-215-898-1582, fax: 1-215-573-6233, office-tel: 1-215-898-5425 ======================================================================= If you make something idiot-proof, the universe creates a better idiot. From abhig at Princeton.EDU Mon Jul 13 13:55:59 2009 From: abhig at Princeton.EDU (abhig at Princeton.EDU) Date: Mon, 13 Jul 2009 15:55:59 -0400 Subject: [torqueusers] pbs job running slow Message-ID: <2452.1247514959@princeton.edu> BODY { font-family:Arial, Helvetica, sans-serif;font-size:12px; }Hi All, The jobs with PBS are running very slow. Earlier I was thinking that it might be because of rcp so I tried configuring a node with usecp. The log file shows that it read the usecp parameter but the speed is still slow. Can someone suggest me anything? The same job running interactively take almost half the time. Thanks, Abhi. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20090713/f0ab57b2/attachment.html From prakash.velayutham at cchmc.org Mon Jul 13 14:06:22 2009 From: prakash.velayutham at cchmc.org (Prakash Velayutham) Date: Mon, 13 Jul 2009 16:06:22 -0400 Subject: [torqueusers] pbs job running slow In-Reply-To: <2452.1247514959@princeton.edu> References: <2452.1247514959@princeton.edu> Message-ID: <84FAAFBD-9C22-41B4-A9FF-CB608B8ADA7D@cchmc.org> What is the job? And what does your batch file look like? The best way to debug would be to use "qsub -I" and see if interactive PBS job also runs slow compared to a direct SSH-based job (job being the same). Then go from there. Prakash On Jul 13, 2009, at 3:55 PM, abhig at Princeton.EDU wrote: > Hi All, > The jobs with PBS are running very slow. Earlier I was thinking that > it might be because of rcp so I tried configuring a node with usecp. > The log file shows that it read the usecp parameter but the speed is > still slow. Can someone suggest me anything? The same job running > interactively take almost half the time. > Thanks, > Abhi. > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20090713/2f97d364/attachment.html From abhig at Princeton.EDU Mon Jul 13 14:42:19 2009 From: abhig at Princeton.EDU (abhig at Princeton.EDU) Date: Mon, 13 Jul 2009 16:42:19 -0400 Subject: [torqueusers] pbs job running slow Message-ID: <3399.1247517739@princeton.edu> Hi Prakash, Thanks for your reply. I tried running job interactively and it seems that it is running normal. But ssh based job is running slow. Any further suggestion? Abhi. On Mon 13/07/09 4:06 PM , Prakash Velayutham prakash.velayutham at cchmc.org sent: What is the job? And what does your batch file look like? The best way to debug would be to use "qsub -I" and see if interactive PBS job also runs slow compared to a direct SSH-based job (job being the same). Then go from there. Prakash On Jul 13, 2009, at 3:55 PM, wrote: Hi All, The jobs with PBS are running very slow. Earlier I was thinking that it might be because of rcp so I tried configuring a node with usecp. The log file shows that it read the usecp parameter but the speed is still slow. Can someone suggest me anything? The same job running interactively take almost half the time. Thanks, Abhi. _______________________________________________ torqueusers mailing list http://www.supercluster.org/mailman/listinfo/torqueusers [3] Links: ------ [3] http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20090713/3ccae08f/attachment-0001.html From naveed at caltech.edu Mon Jul 13 17:16:28 2009 From: naveed at caltech.edu (Naveed Near-Ansari) Date: Mon, 13 Jul 2009 16:16:28 -0700 Subject: [torqueusers] accounting In-Reply-To: <4A57834C.5060101@penguincomputing.com> References: <4A412858.4010900@nada.kth.se> <1245784949.20566.10.camel@browncoat.jics.utk.edu> <4A4176FE.1040604@ldeo.columbia.edu> <1245864625.13550.302.camel@aeolis> <4A57834C.5060101@penguincomputing.com> Message-ID: <1247526988.29557.7.camel@aeolis> On Fri, 2009-07-10 at 11:07 -0700, Joshua Bernstein wrote: > > Naveed Near-Ansari wrote: > > I think if you specify in torque.cfg to use a qsub filter it will use it > > even when submitting through qsub. > > > > torque.cfg: > > > > SUBMITFILTER /opt/torque/bin/qsub_filter > > > > I have not attempted exiting on certain conditions using this since i > > only use it to add in default wallclock times when not set by the user, > > but presumably you could do your checking here. > > > > If you want to just set a default wallclock time isn't it easier to just add it > as a parameter in qmgr rather then having an entire wrapper just for that? For > example, to set the default wallclock time for the batch queue, use: > > # qmgr -c "set queue batch resources_default.walltime=3600" > > Note the time here is in seconds. > > -Joshua Bernstein > Software Engineer > Penguin Computing > Yes, that would make alot more sense ;). I originally was going to do some other things there, so may have been blinded using this method for unnecessary additions also. Naveed From Gareth.Williams at csiro.au Mon Jul 13 17:16:45 2009 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Tue, 14 Jul 2009 09:16:45 +1000 Subject: [torqueusers] Limiting number of running jobs user selectable In-Reply-To: References: Message-ID: <007DECE986B47F4EABF823C1FBB19C625C13A7BCE5@exvic-mbx04.nexus.csiro.au> > Is there an easy way to limit the number of running job for a user, > where the user can choose? > > For example user wants to submit 100 jobs, but never have more than 20 > running at a time (politics) but has the need some times to go to all > the available nodes. > > I was looking at the depend flags, and while you could make the normal > graph using after or before, is there a way to say "job only if > running less than count" ? > > Brock Palen Hi Brock, I have a script for 'topping up' the number of queued/running jobs. I'll send it off-list. I'd be happy for it to go in the contrib section of the torque distribution. Who would I talk to about that? -- Gareth From prakash.velayutham at cchmc.org Mon Jul 13 20:29:55 2009 From: prakash.velayutham at cchmc.org (Prakash Velayutham) Date: Mon, 13 Jul 2009 22:29:55 -0400 Subject: [torqueusers] pbs job running slow In-Reply-To: <3399.1247517739@princeton.edu> References: <3399.1247517739@princeton.edu> Message-ID: <4241CB28-8311-49E4-9636-B863DA13FA83@cchmc.org> Send your job commands when you run them interactively and also send the contents of your batch file. Prakash On Jul 13, 2009, at 4:42 PM, abhig at Princeton.EDU wrote: > Hi Prakash, > Thanks for your reply. I tried running job interactively and it > seems that it is running normal. But ssh based job is running slow. > Any further suggestion? > Abhi. > > > > On Mon 13/07/09 4:06 PM , Prakash Velayutham prakash.velayutham at cchmc.org > sent: > What is the job? And what does your batch file look like? > > The best way to debug would be to use "qsub -I" and see if > interactive PBS job also runs slow compared to a direct SSH-based > job (job being the same). Then go from there. > > Prakash > > On Jul 13, 2009, at 3:55 PM, abhig at Princeton.EDU wrote: > >> Hi All, >> The jobs with PBS are running very slow. Earlier I was thinking >> that it might be because of rcp so I tried configuring a node with >> usecp. The log file shows that it read the usecp parameter but the >> speed is still slow. Can someone suggest me anything? The same job >> running interactively take almost half the time. >> Thanks, >> Abhi. >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20090713/2558345a/attachment.html From abhig at Princeton.EDU Mon Jul 13 21:15:05 2009 From: abhig at Princeton.EDU (Abhishek Gupta) Date: Mon, 13 Jul 2009 23:15:05 -0400 Subject: [torqueusers] pbs job running slow In-Reply-To: <4241CB28-8311-49E4-9636-B863DA13FA83@cchmc.org> References: <3399.1247517739@princeton.edu> <4241CB28-8311-49E4-9636-B863DA13FA83@cchmc.org> Message-ID: <4A5BF839.70302@princeton.edu> Here is the script we are running. This is basically a test script of the actual script to test the performance. #!/bin/bash #PBS -j oe #PBS -o qsub1.out #PBS -m ae ###PBS -M abhig at princeton.edu #PBS -l cput=36:0:0 ###PBS -l mem=1GB #PBS -l nodes=node022 date echo "start" export PATH=.:$PATH source /mnt/dayabay/trunk-source cd /mnt/dayabay/project/speedtest time nuwa.py -l4 -n100 -H 345788 -o test1.root -m FullChain > test1.log echo "finish" date The command I run during interactively were # qsub -I run.sh This took us to node022 and then # run.sh run.sh is the name of the script I wrote above. When job was finished, I run exit. Abhi. Prakash Velayutham wrote: > Send your job commands when you run them interactively and also send > the contents of your batch file. > > Prakash > > On Jul 13, 2009, at 4:42 PM, abhig at Princeton.EDU > wrote: > >> Hi Prakash, >> Thanks for your reply. I tried running job interactively and it seems >> that it is running normal. But ssh based job is running slow. Any >> further suggestion? >> Abhi. >> >> >> >> On Mon 13/07/09 4:06 PM , Prakash Velayutham >> prakash.velayutham at cchmc.org sent: >> >> What is the job? And what does your batch file look like? >> >> The best way to debug would be to use "qsub -I" and see if >> interactive PBS job also runs slow compared to a direct SSH-based >> job (job being the same). Then go from there. >> >> Prakash >> >> On Jul 13, 2009, at 3:55 PM, abhig at Princeton.EDU >> wrote: >> >>> Hi All, >>> The jobs with PBS are running very slow. Earlier I was thinking >>> that it might be because of rcp so I tried configuring a node >>> with usecp. The log file shows that it read the usecp parameter >>> but the speed is still slow. Can someone suggest me anything? >>> The same job running interactively take almost half the time. >>> Thanks, >>> Abhi. >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20090713/0970ae81/attachment.html From prakash.velayutham at cchmc.org Mon Jul 13 21:35:57 2009 From: prakash.velayutham at cchmc.org (Prakash Velayutham) Date: Mon, 13 Jul 2009 23:35:57 -0400 Subject: [torqueusers] pbs job running slow In-Reply-To: <4A5BF839.70302@princeton.edu> References: <3399.1247517739@princeton.edu> <4241CB28-8311-49E4-9636-B863DA13FA83@cchmc.org> <4A5BF839.70302@princeton.edu> Message-ID: Hi, I am not sure why you are seeing this difference. Without the actual code (nuwa.py) I can't think of a way to reproduce or debug it. I can confirm that the Torque scheduling happens as expected, so nothing wrong with the PBS directives themselves. Sorry. Prakash On Jul 13, 2009, at 11:15 PM, Abhishek Gupta wrote: > Here is the script we are running. This is basically a test script > of the actual script to test the performance. > > #!/bin/bash > #PBS -j oe > #PBS -o qsub1.out > #PBS -m ae > ###PBS -M abhig at princeton.edu > #PBS -l cput=36:0:0 > ###PBS -l mem=1GB > #PBS -l nodes=node022 > > date > echo "start" > export PATH=.:$PATH > > source /mnt/dayabay/trunk-source > cd /mnt/dayabay/project/speedtest > > time nuwa.py -l4 -n100 -H 345788 -o test1.root -m FullChain > > test1.log > > echo "finish" > date > > > The command I run during interactively were > # qsub -I run.sh > > This took us to node022 and then > # run.sh > run.sh is the name of the script I wrote above. > > When job was finished, I run exit. > > Abhi. > > Prakash Velayutham wrote: >> >> Send your job commands when you run them interactively and also >> send the contents of your batch file. >> >> Prakash >> >> On Jul 13, 2009, at 4:42 PM, abhig at Princeton.EDU wrote: >> >>> Hi Prakash, >>> Thanks for your reply. I tried running job interactively and it >>> seems that it is running normal. But ssh based job is running >>> slow. Any further suggestion? >>> Abhi. >>> >>> >>> >>> On Mon 13/07/09 4:06 PM , Prakash Velayutham prakash.velayutham at cchmc.org >>> sent: >>> What is the job? And what does your batch file look like? >>> >>> The best way to debug would be to use "qsub -I" and see if >>> interactive PBS job also runs slow compared to a direct SSH-based >>> job (job being the same). Then go from there. >>> >>> Prakash >>> >>> On Jul 13, 2009, at 3:55 PM, abhig at Princeton.EDU wrote: >>> >>>> Hi All, >>>> The jobs with PBS are running very slow. Earlier I was thinking >>>> that it might be because of rcp so I tried configuring a node >>>> with usecp. The log file shows that it read the usecp parameter >>>> but the speed is still slow. Can someone suggest me anything? The >>>> same job running interactively take almost half the time. >>>> Thanks, >>>> Abhi. >>>> _______________________________________________ >>>> torqueusers mailing list >>>> torqueusers at supercluster.org >>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20090713/2f01912d/attachment-0001.html From brockp at umich.edu Tue Jul 14 07:42:52 2009 From: brockp at umich.edu (Brock Palen) Date: Tue, 14 Jul 2009 09:42:52 -0400 Subject: [torqueusers] Limiting number of running jobs user selectable In-Reply-To: <007DECE986B47F4EABF823C1FBB19C625C13A7BCE5@exvic-mbx04.nexus.csiro.au> References: <007DECE986B47F4EABF823C1FBB19C625C13A7BCE5@exvic-mbx04.nexus.csiro.au> Message-ID: <25E6472F-B5DE-4E53-B6EC-C14CAF451FCE@umich.edu> >> > > Hi Brock, > > I have a script for 'topping up' the number of queued/running jobs. > I'll send it off-list. > > I'd be happy for it to go in the contrib section of the torque > distribution. Who would I talk to about that? I am interested, Josh or Glen would be a good start, they both should be lurking here some place. > > > -- Gareth > > From dmitri.chubarov at gmail.com Tue Jul 14 08:11:09 2009 From: dmitri.chubarov at gmail.com (Dmitri Chubarov) Date: Tue, 14 Jul 2009 21:11:09 +0700 Subject: [torqueusers] Limiting number of running jobs user selectable In-Reply-To: <25E6472F-B5DE-4E53-B6EC-C14CAF451FCE@umich.edu> References: <007DECE986B47F4EABF823C1FBB19C625C13A7BCE5@exvic-mbx04.nexus.csiro.au> <25E6472F-B5DE-4E53-B6EC-C14CAF451FCE@umich.edu> Message-ID: Hello, I just had a user who had to queue all the jobs she could submit before taking a leave. Being a considerate user she asked to limit the number of jobs running at a time. What we did was to set up a dedicated queue for her with a limit on the number of running jobs. create queue irina set queue irina queue_type = Execution set queue irina max_running = 10 set queue irina resources_default.nodes = 1 set queue irina enabled = True set queue irina started = True I believe that could answer the problem that Brock put above: > For example user wants to submit 100 jobs, but never have more than 20 > running at a time (politics) ?but has the need some times to go to all > the available nodes. Best regards, ?Dima From abhig at Princeton.EDU Tue Jul 14 14:10:26 2009 From: abhig at Princeton.EDU (Abhishek Gupta) Date: Tue, 14 Jul 2009 16:10:26 -0400 Subject: [torqueusers] pbs job running slow In-Reply-To: References: <3399.1247517739@princeton.edu> <4241CB28-8311-49E4-9636-B863DA13FA83@cchmc.org> <4A5BF839.70302@princeton.edu> Message-ID: <4A5CE632.5000205@princeton.edu> Hi Prakash, The nuwa.py script looks like this: #!/usr/bin/env python if '__main__' == __name__: from DybPython.Control import main nuwa = main() nuwa.run() nuwa.finalize() pass #end I believe it calls some other script too(FullChain.py i believe). I am not sure since I am testing this script on someone's behalf. Thanks, Abhi. Prakash Velayutham wrote: > Hi, > > I am not sure why you are seeing this difference. Without the actual > code (nuwa.py) I can't think of a way to reproduce or debug it. I can > confirm that the Torque scheduling happens as expected, so nothing > wrong with the PBS directives themselves. Sorry. > > Prakash > > On Jul 13, 2009, at 11:15 PM, Abhishek Gupta wrote: > >> Here is the script we are running. This is basically a test script of >> the actual script to test the performance. >> >> #!/bin/bash >> #PBS -j oe >> #PBS -o qsub1.out >> #PBS -m ae >> ###PBS -M abhig at princeton.edu >> #PBS -l cput=36:0:0 >> ###PBS -l mem=1GB >> #PBS -l nodes=node022 >> >> date >> echo "start" >> export PATH=.:$PATH >> >> source /mnt/dayabay/trunk-source >> cd /mnt/dayabay/project/speedtest >> >> time nuwa.py -l4 -n100 -H 345788 -o test1.root -m FullChain > test1.log >> >> echo "finish" >> date >> >> >> The command I run during interactively were >> # qsub -I run.sh >> >> This took us to node022 and then >> # run.sh >> run.sh is the name of the script I wrote above. >> >> When job was finished, I run exit. >> >> Abhi. >> >> Prakash Velayutham wrote: >>> Send your job commands when you run them interactively and also send >>> the contents of your batch file. >>> >>> Prakash >>> >>> On Jul 13, 2009, at 4:42 PM, abhig at Princeton.EDU >>> wrote: >>> >>>> Hi Prakash, >>>> Thanks for your reply. I tried running job interactively and it >>>> seems that it is running normal. But ssh based job is running slow. >>>> Any further suggestion? >>>> Abhi. >>>> >>>> >>>> >>>> On Mon 13/07/09 4:06 PM , Prakash Velayutham >>>> prakash.velayutham at cchmc.org >>>> sent: >>>> >>>> What is the job? And what does your batch file look like? >>>> >>>> The best way to debug would be to use "qsub -I" and see if >>>> interactive PBS job also runs slow compared to a direct >>>> SSH-based job (job being the same). Then go from there. >>>> >>>> Prakash >>>> >>>> On Jul 13, 2009, at 3:55 PM, abhig at Princeton.EDU >>>> wrote: >>>> >>>>> Hi All, >>>>> The jobs with PBS are running very slow. Earlier I was >>>>> thinking that it might be because of rcp so I tried >>>>> configuring a node with usecp. The log file shows that it read >>>>> the usecp parameter but the speed is still slow. Can someone >>>>> suggest me anything? The same job running interactively take >>>>> almost half the time. >>>>> Thanks, >>>>> Abhi. >>>>> _______________________________________________ >>>>> torqueusers mailing list >>>>> torqueusers at supercluster.org >>>>> >>>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>> >>> > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20090714/9a42fc20/attachment.html From glen.beane at gmail.com Tue Jul 14 18:54:44 2009 From: glen.beane at gmail.com (Glen Beane) Date: Tue, 14 Jul 2009 20:54:44 -0400 Subject: [torqueusers] Limiting number of running jobs user selectable In-Reply-To: <25E6472F-B5DE-4E53-B6EC-C14CAF451FCE@umich.edu> References: <007DECE986B47F4EABF823C1FBB19C625C13A7BCE5@exvic-mbx04.nexus.csiro.au> <25E6472F-B5DE-4E53-B6EC-C14CAF451FCE@umich.edu> Message-ID: <5caae8c60907141754j618e1020sb177542f30617ee3@mail.gmail.com> On Tue, Jul 14, 2009 at 9:42 AM, Brock Palen wrote: >>> >> >> Hi Brock, >> >> I have a script for 'topping up' the number of queued/running jobs. >> I'll send it off-list. >> >> I'd be happy for it to go in the contrib section of the torque >> distribution. Who would I talk to about that? > > I am interested, > Josh or Glen would be a good start, ?they both should be lurking here > some place. Gareth, send me the script and a README and I can probably add it to the contrib directory in TORQUE From anilth at hi.is Wed Jul 15 05:45:23 2009 From: anilth at hi.is (Anil Thapa) Date: Wed, 15 Jul 2009 11:45:23 +0000 Subject: [torqueusers] Torque 2.3.7 and connection Refused In-Reply-To: <4A5DA556.9060309@hi.is> References: <4A561A79.4050608@hi.is> <4A568BD6.1020500@hi.is> <66F84E25-850B-4EAF-BDA8-CC42E8D1B6F5@cchmc.org> <4A570324.5040809@hi.is> <5273CC80-545E-4EBB-9626-D0EC68F44EE7@cchmc.org> <4A575CAD.7020505@hi.is> <4A5DA556.9060309@hi.is> Message-ID: <4A5DC153.1040904@hi.is> Anil Thapa wrote: > Hello Prakash, > > I am running into some LDAP problem. I added both headnode and > computer node into LDAP and home directories are also exported to > both. When i login to headnode with username i get the right home > directory (thats good and it is normal). When I submit the job with > the LDAP user it does submitted and distributed to the compute node > but nothing happens. Then mom_logs shows this : > > > 10:38:54;0008; pbs_mom;Job;130.bhairab.rhi.hi.is;attempting to copy > file 'bhairab.rhi.hi.is:/users/annad/peter/example.sh.o130' > 07/14/2009 10:38:54;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Success (0) > in fork_to_user, cannot find user 'peter' in password file > 07/14/2009 10:38:54;0080; pbs_mom;Req;req_reject;Reject reply > code=15023(Bad UID for job execution REJHOST=305.rhi.hi.is MSG=cannot > find user 'peter' in password file), aux=0, type=CopyFiles, from > PBS_Server at bhairab.rhi.hi.is > 07/14/2009 10:38:54;0001; > pbs_mom;Svr;pbs_mom;LOG_ERROR::Inappropriate ioctl for device (25) in > req_cpyfile, fork_to_user failed with rc=-15023 'cannot find user > 'peter' in password file' - returning failure > 07/14/2009 10:38:54;0080; pbs_mom;Req;dis_request_read;decoding > command DeleteJob from PBS_Server > > Then one of the LDAP user submits the job but keep pilling on > /var/spool/torque/underlivered. Server logs doesn?t say much. My ldap > conf looks like this: > > headnode: /etc/ldap.conf > uri ldap://neptune.hi.is/ ldaps://satrun.rhi.hi.is/ > ssl on > tls_cacertdir /etc/openldap/cacerts > > Headnode: /etc/openldap/ldap.conf > uri ldap://neptune.hi.is/ ldaps://satrun.rhi.hi.is/ > ssl on > tls_cacertdir /etc/openldap/cacerts > > Headnode: /etc/nsswitch.conf > passwd: files ldap > shadow: files ldap > group: files ldap > > ethers: files > netmasks: files > networks: files > protocols: files > rpc: files > services: files > netgroup: files ldap > publickey: nisplus > automount: files ldap > aliases: files nisplus > > > These configuration are identical to compute node. Any hint or input > would be great. > > Anil > >>> Hello >>> >>> Thanks for you help. I actually remove all the torque from server >>> and client then rebuild with ./configure ---with- scp for both head >>> node and client. jobs sumission was working fine but jobs were >>> always in /var/spool/torque/underliverd directory. then I followed >>> 6.1.5 - Enabling Bi-Directional SCP Access from cluster resources >>> "http://www.clusterresources.com/products/torque/docs/6.1scpsetup.shtml". >>> I created the identical users locally in both head node and computer >>> node with same uid. Then it worked as it supposed to. jobs are sent >>> to compute node and result back to users /home/user directory. At >>> least this looks working but this is not an ideal way. >>> >>> It would be ideal I don?t have to create user and its home directory >>> for very users. I was thinking adding every compute node to LDAP >>> server and export their home directory as my head node. What is your >>> thought in this. >> >> It is supposed to work that way if LDAP and automounter are >> configured correctly and your home directory server's export list >> allows head node and compute nodes to mount the relevant directory >> with the required permissions. >> >>> >>> However still user have to ssh-keygen -t rsa bi-directionally in >>> order to compute node could send the result back (or are there any >>> better option) ? >> >> Not needed. Please see >> usecp >> directive in PBS Mom's configuration file. As the home directories >> are uniform and automounted across the whole cluster, MOM needs to >> just cp the output and error files instead of using any kind of >> remote copy. >> >>> >>> Thanks and have a good weekend. >>> >>> A >> >> Prakash > > From cholam20 at yahoo.co.in Thu Jul 16 03:39:11 2009 From: cholam20 at yahoo.co.in (revathi ganesh) Date: Thu, 16 Jul 2009 15:09:11 +0530 (IST) Subject: [torqueusers] Creation of home directories and automounting Message-ID: <252640.60741.qm@web8401.mail.in.yahoo.com> Hi all I am totally newbie to clusters. I have now successfully installing Rocks 5.1 and PBS roll for 5.1. Now when I am submitting testpbs jobs i am getting the following which are pasted below error. The partitions in rocks front node is [root at bithorax server_logs]# df -k Filesystem?????????? 1K-blocks????? Used Available Use% Mounted on /dev/sda1???????????? 15872604?? 4027148? 11026144? 27% / /dev/sda5??????????? 1396245596?? 2986536 1321189868?? 1% /state/partition1 /dev/sda2????????????? 3968124??? 215056?? 3548240?? 6% /var tmpfs????????????????? 2019372???????? 0?? 2019372?? 0% /dev/shm tmpfs?????????????????? 986020????? 3456??? 982564?? 1% /var/lib/ganglia/rrds My questons are: 1) where to create home directories? 2) how to automount and what to automount? my /stat/partition1 directory is 1.3 TB and want to use it for home directories and applicaitons and storing data.. please suggest and give the procedure for mounting.. regds revathi front-end shows:07/16/2009 10:02:27;0100;PBS_Server;Job;50.bithorax.ccmb.res.in;enqueuing into default, state 1 hop 1 07/16/2009 10:02:27;0002;PBS_Server;Svr;Act;Account file /opt/torque/server_priv/accounting/20090716 opened 07/16/2009 10:02:27;0008;PBS_Server;Job;50.bithorax.ccmb.res.in;Job Queued at request of hersh at bithorax.ccmb.res.in, owner = hersh at bithorax.ccmb.res.in, job name = testpbs, queue = default 07/16/2009 10:02:27;0040;PBS_Server;Svr;bithorax.ccmb.res.in;Scheduler sent command new 07/16/2009 10:02:29;0008;PBS_Server;Job;50.bithorax.ccmb.res.in;Job Modified at request of maui at bithorax.ccmb.res.in 07/16/2009 10:02:29;0008;PBS_Server;Job;50.bithorax.ccmb.res.in;Job Run at request of maui at bithorax.ccmb.res.in 07/16/2009 10:02:29;0010;PBS_Server;Job;50.bithorax.ccmb.res.in;Exit_status=-1 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb resources_used.walltime=00:00:00 07/16/2009 10:02:29;000d;PBS_Server;Job;50.bithorax.ccmb.res.in;Post job file processing error; job 50.bithorax.ccmb.res.in on host compute-0-3/0 07/16/2009 10:02:29;0100;PBS_Server;Job;50.bithorax.ccmb.res.in;dequeuing from default, state COMPLETE 07/16/2009 10:02:29;0040;PBS_Server;Svr;bithorax.ccmb.res.in;Scheduler sent command term 07/16/2009 10:03:04;0100;PBS_Server;Job;51.bithorax.ccmb.res.in;enqueuing into default, state 1 hop 1 07/16/2009 10:03:04;0008;PBS_Server;Job;51.bithorax.ccmb.res.in;Job Queued at request of hersh at bithorax.ccmb.res.in, owner = hersh at bithorax.ccmb.res.in, job name = testpbs, queue = default 07/16/2009 10:03:04;0040;PBS_Server;Svr;bithorax.ccmb.res.in;Scheduler sent command new 07/16/2009 10:03:05;0008;PBS_Server;Job;51.bithorax.ccmb.res.in;Job Modified at request of maui at bithorax.ccmb.res.in 07/16/2009 10:03:05;0008;PBS_Server;Job;51.bithorax.ccmb.res.in;Job Run at request of maui at bithorax.ccmb.res.in 07/16/2009 10:03:05;0010;PBS_Server;Job;51.bithorax.ccmb.res.in;Exit_status=-1 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb resources_used.walltime=00:00:00 07/16/2009 10:03:05;000d;PBS_Server;Job;51.bithorax.ccmb.res.in;Post job file processing error; job 51.bithorax.ccmb.res.in on host compute-0-3/0 07/16/2009 10:03:05;0100;PBS_Server;Job;51.bithorax.ccmb.res.in;dequeuing from default, state COMPLETE Love Cricket? Check out live scores, photos, video highlights and more. Click here http://cricket.yahoo.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20090716/28f4bdb3/attachment-0001.html From garrick at usc.edu Thu Jul 16 10:51:24 2009 From: garrick at usc.edu (Garrick Staples) Date: Thu, 16 Jul 2009 09:51:24 -0700 Subject: [torqueusers] Creation of home directories and automounting In-Reply-To: <252640.60741.qm@web8401.mail.in.yahoo.com> References: <252640.60741.qm@web8401.mail.in.yahoo.com> Message-ID: <20090716165124.GD10246@polop.usc.edu> On Thu, Jul 16, 2009 at 03:09:11PM +0530, revathi ganesh alleged: > Hi all > > I am totally newbie to clusters. > > I have now successfully installing Rocks 5.1 and PBS roll for 5.1. > > Now when I am submitting testpbs jobs i am getting the following which are pasted below error. > > The partitions in rocks front node is > [root at bithorax server_logs]# df -k > Filesystem?????????? 1K-blocks????? Used Available Use% Mounted on > /dev/sda1???????????? 15872604?? 4027148? 11026144? 27% / > /dev/sda5??????????? 1396245596?? 2986536 1321189868?? 1% /state/partition1 > /dev/sda2????????????? 3968124??? 215056?? 3548240?? 6% /var > tmpfs????????????????? 2019372???????? 0?? 2019372?? 0% /dev/shm > tmpfs?????????????????? 986020????? 3456??? 982564?? 1% /var/lib/ganglia/rrds > > My questons are: > > 1) where to create home directories? > > 2) how to automount and what to automount? TORQUE doesn't really how or where you create user's home directories, but I'm sure Rocks has their own defined conventions and procedures. I suggest asking this question on a Rocks list to get their best practices. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20090716/dce5538c/attachment.bin From stijn.deweirdt at ugent.be Fri Jul 17 04:07:13 2009 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Fri, 17 Jul 2009 12:07:13 +0200 Subject: [torqueusers] attempting connect to host Message-ID: <1247825233.2454.4.camel@spike.ugent.be> hi all, had a systemcrash this week, and we still ahve a non-stable torque/maui i'm in process of resolving the final issue (i think) that leads to delays in eg qstat and that seems to lock up maui. the problem is linked with following message in the server log file svr_connect;attempting connect to host 177471491 port using strace, i found out that this host is an ip that is not a node at all, and that is halting torque waiting for timeout i guess. my question is the following: based on what information could torque think that it needs to contact this machine. i already grepped the spool directory, found nothing related to this ip. ideas are more then welcome! thanks a lot, stijn -- The system will shutdown in 5 minutes. From SAngelovich at lgc.com Thu Jul 16 13:41:32 2009 From: SAngelovich at lgc.com (Steve Angelovich) Date: Thu, 16 Jul 2009 14:41:32 -0500 Subject: [torqueusers] queue properties Message-ID: All, Is it possible to setup a queue where it will only release jobs to a small subset of the nodes during business hours but release jobs to a much larger set of nodes in the evenings. We are currently using maui as our scheduler. Thanks, Steve ---------------------------------------------------------------------- This e-mail, including any attached files, may contain confidential and privileged information for the sole use of the intended recipient. Any review, use, distribution, or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive information for the intended recipient), please contact the sender by reply e-mail and delete all copies of this message. From gabe at msi.umn.edu Fri Jul 17 08:32:16 2009 From: gabe at msi.umn.edu (Gabe Turner) Date: Fri, 17 Jul 2009 09:32:16 -0500 Subject: [torqueusers] queue properties In-Reply-To: References: Message-ID: <20090717143216.GA16602@blackice.msi.umn.edu> On Thu, Jul 16, 2009 at 02:41:32PM -0500, Steve Angelovich wrote: > All, > > Is it possible to setup a queue where it will only release jobs to a small subset of the nodes during business hours but release jobs to a much larger set of nodes in the evenings. > > We are currently using maui as our scheduler. You'll want to look into Standing Reservations in Maui: http://www.clusterresources.com/products/maui/docs/7.1.3standingreservations.shtml -- Gabe Turner gabe at msi.umn.edu HPC Systems Administrator, University of Minnesota Supercomputing Institute http://www.msi.umn.edu From roman at chem.ubc.ca Fri Jul 17 17:34:41 2009 From: roman at chem.ubc.ca (Roman Baranowski) Date: Fri, 17 Jul 2009 16:34:41 -0700 Subject: [torqueusers] torque-2.3.7 QSUBHOST parameter Message-ID: <1247873681.2109.498.camel@obsidian.westgrid.ca> Dear All, Running Torque 2.3.7 on a management node (no user access) and enabling the job submission from 3 headnodes via SUBMITHOSTS = head-node1,head-node2,head-node3 in torque.cfg. All is working fine till the moment I want to submit the interactive job via qsub -I The PBS_O_HOST is set to the name of PBS_SERVER and the interactive job fails since it is submitted from the head. Added a parameter QSUBHOST head1 to the torque.cfg, all submissions fail with the usual: Bad UID for job execution MSG=ruserok failed ****** Fixed by putting head1,2,3 to the /etc/hosts.equiv on the PBS_SERVER host However The issue I have is the following, in this setup I can submit Interactive jobs only from the head1, setting QSUBHOST head1,head2,head3 does not solve the problem. What I am missing here ? How to set the PBS_O_HOST to the name of host on which qsub is issued The attempt to: qsub -I -v PBS_O_HOST=head3 job fails too. I will appreciate your comments and suggestions. All the best Roman Baranowski UBC WestGrid -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20090717/bed6889e/attachment.bin From davidmcgivenn at gmail.com Mon Jul 20 06:51:18 2009 From: davidmcgivenn at gmail.com (David McGiven) Date: Mon, 20 Jul 2009 14:51:18 +0200 Subject: [torqueusers] PBS_MOM kills running jobs when restarted Message-ID: <1CCC8E2A-BA0E-4602-AEBA-ABE88C2C3A42@gmail.com> Dear TORQUE users, I am using pbs_mom torque 2.3.6 in a computing node with 24 processors. From time to time, it unexpectedly crashes dumping the following error to syslog : Jul 20 13:21:48 klaus kernel: [3352942.287574] pbs_mom[31758]: segfault at 4 rip 417ddf rsp 7fffb6c36e20 error 4 Then, when I restart pbs_mom on that computing node, all the jobs that were still running are killed. I don't understand why. I've looked for an option to tell mom to not kill the jobs when restarting but I didn't find anything. Does anybody know what is causing the problem or how to solve it ? Thanks in advance. Best Regards, David McGiven P.S : Those are the messages pbs_mom dumps when started after a crash. 07/20/2009 13:26:25;0001; pbs_mom;Svr;pbs_mom;No such file or directory (2) in read_config, fstat: config 07/20/2009 13:26:25;0002; pbs_mom;Svr;setpbsserver;bender 07/20/2009 13:26:25;0002; pbs_mom;Svr;mom_server_add;server bender added 07/20/2009 13:26:25;0002; pbs_mom;n/a;initialize;independent 07/20/2009 13:26:25;0080; pbs_mom;Svr;pbs_mom;before init_abort_jobs 07/20/2009 13:26:25;0002; pbs_mom;Svr;pbs_mom;Is up 07/20/2009 13:26:25;0002; pbs_mom;Svr;setup_program_environment;MOM executable path and mtime at launch: /opt/soft/torque- 2.3.6/sbin/pbs_mom 1244734930 07/20/2009 13:26:25;0002; pbs_mom;n/ a;mom_server_check_connection;sending hello to server bender 07/20/2009 13:26:26;0008; pbs_mom;Job;27571.bender ;kill_task: killing pid 3866 task 1 gracefully with sig 15 07/20/2009 13:26:31;0008; pbs_mom;Job;27571.bender ;kill_task: killing pid 3866 task 1 with sig 9 07/20/2009 13:26:31;0008; pbs_mom;Job;27571.bender ;kill_task: killing pid 3877 task 1 gracefully with sig 15 07/20/2009 13:26:32;0008; pbs_mom;Job;27571.bender ;kill_task: killing pid 3884 task 1 gracefully with sig 15 07/20/2009 13:26:34;0008; pbs_mom;Job;27565.bender ;kill_task: killing pid 31771 task 1 gracefully with sig 15 07/20/2009 13:26:39;0008; pbs_mom;Job;27565.bender ;kill_task: killing pid 31771 task 1 with sig 9 07/20/2009 13:26:39;0008; pbs_mom;Job;27565.bender ;kill_task: killing pid 31782 task 1 gracefully with sig 15 07/20/2009 13:26:39;0008; pbs_mom;Job;27565.bender ;kill_task: killing pid 31789 task 1 gracefully with sig 15 07/20/2009 13:26:41;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply 07/20/2009 13:26:41;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/ decode_DIS_replySvr worked, top of while loop 07/20/2009 13:26:41;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat 07/20/2009 13:26:42;0008; pbs_mom;Job;27571.bender ;checking job post-processing routine 07/20/2009 13:26:42;0080; pbs_mom;Job;27571.bender ;obit sent to server 07/20/2009 13:26:42;0008; pbs_mom;Job;27586.bender ;kill_task: killing pid 4918 task 1 gracefully with sig 15 07/20/2009 13:26:47;0008; pbs_mom;Job;27586.bender ;kill_task: killing pid 4918 task 1 with sig 9 07/20/2009 13:26:47;0008; pbs_mom;Job;27586.bender ;kill_task: killing pid 4929 task 1 gracefully with sig 15 07/20/2009 13:26:47;0008; pbs_mom;Job;27586.bender ;kill_task: killing pid 4936 task 1 gracefully with sig 15 07/20/2009 13:26:51;0008; pbs_mom;Job;27576.bender ;kill_task: killing pid 4023 task 1 gracefully with sig 15 07/20/2009 13:26:56;0008; pbs_mom;Job;27576.bender ;kill_task: killing pid 4023 task 1 with sig 9 07/20/2009 13:26:56;0008; pbs_mom;Job;27576.bender ;kill_task: killing pid 4032 task 1 gracefully with sig 15 07/20/2009 13:26:56;0008; pbs_mom;Job;27576.bender ;kill_task: killing pid 4039 task 1 gracefully with sig 15 07/20/2009 13:27:01;0008; pbs_mom;Job;27576.bender ;kill_task: killing pid 4039 task 1 with sig 9 07/20/2009 13:27:01;0002; pbs_mom;Svr;im_eof;End of File from addr 158.109.211.63:15001 07/20/2009 13:27:01;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply 07/20/2009 13:27:01;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/ decode_DIS_replySvr worked, top of while loop 07/20/2009 13:27:01;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat 07/20/2009 13:27:01;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply 07/20/2009 13:27:01;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/ decode_DIS_replySvr worked, top of while loop 07/20/2009 13:27:01;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat 07/20/2009 13:27:01;0002; pbs_mom;n/ a;mom_server_check_connection;sending hello to server bender 07/20/2009 13:27:02;0008; pbs_mom;Job;27565.bender ;checking job post-processing routine -- David McGiven Associate Researcher Universitat Autonoma de Barcelona davidmcgivenn at gmail.com From turnerg at indiana.edu Mon Jul 20 08:48:17 2009 From: turnerg at indiana.edu (George Wm Turner) Date: Mon, 20 Jul 2009 10:48:17 -0400 Subject: [torqueusers] PBS_MOM kills running jobs when restarted In-Reply-To: <6889_1248100824_n6KEeL8g008652_1CCC8E2A-BA0E-4602-AEBA-ABE88C2C3A42@gmail.com> References: <6889_1248100824_n6KEeL8g008652_1CCC8E2A-BA0E-4602-AEBA-ABE88C2C3A42@gmail.com> Message-ID: <6795CC31-558A-473A-B80E-15BD19CB132D@indiana.edu> use the -p option when you restart pbs_mom; from pbs_mom man page -p Specifies the impact on jobs which were in execution when the mini-server shut down. On any restart of MOM, the new mini- server will not be the parent of any running jobs, MOM has lost control of her offspring (not a new situation for a mother). With the -p option, Mom will allow the jobs to continue to run and monitor them indirectly via polling. The -p option is mutu- ally exclusive with the -r option. george wm turner high performance systems 812 855 5156 On Jul 20, 2009, at 8:51 AM, David McGiven wrote: > > Dear TORQUE users, > > I am using pbs_mom torque 2.3.6 in a computing node with 24 > processors. > > From time to time, it unexpectedly crashes dumping the following > error to syslog : > > Jul 20 13:21:48 klaus kernel: [3352942.287574] pbs_mom[31758]: > segfault at 4 rip 417ddf rsp 7fffb6c36e20 error 4 > > Then, when I restart pbs_mom on that computing node, all the jobs that > were still running are killed. > > I don't understand why. I've looked for an option to tell mom to not > kill the jobs when restarting but I didn't find anything. > > Does anybody know what is causing the problem or how to solve it ? > > Thanks in advance. > > Best Regards, > David McGiven > > P.S : Those are the messages pbs_mom dumps when started after a crash. > > 07/20/2009 13:26:25;0001; pbs_mom;Svr;pbs_mom;No such file or > directory (2) in read_config, fstat: config > 07/20/2009 13:26:25;0002; pbs_mom;Svr;setpbsserver;bender > 07/20/2009 13:26:25;0002; pbs_mom;Svr;mom_server_add;server > bender added > 07/20/2009 13:26:25;0002; pbs_mom;n/a;initialize;independent > 07/20/2009 13:26:25;0080; pbs_mom;Svr;pbs_mom;before init_abort_jobs > 07/20/2009 13:26:25;0002; pbs_mom;Svr;pbs_mom;Is up > 07/20/2009 13:26:25;0002; pbs_mom;Svr;setup_program_environment;MOM > executable path and mtime at launch: /opt/soft/torque- > 2.3.6/sbin/pbs_mom 1244734930 > 07/20/2009 13:26:25;0002; pbs_mom;n/ > a;mom_server_check_connection;sending hello to server bender > 07/20/2009 13:26:26;0008; pbs_mom;Job;27571.bender ;kill_task: > killing pid 3866 task 1 gracefully with sig 15 > 07/20/2009 13:26:31;0008; pbs_mom;Job;27571.bender ;kill_task: > killing pid 3866 task 1 with sig 9 > 07/20/2009 13:26:31;0008; pbs_mom;Job;27571.bender ;kill_task: > killing pid 3877 task 1 gracefully with sig 15 > 07/20/2009 13:26:32;0008; pbs_mom;Job;27571.bender ;kill_task: > killing pid 3884 task 1 gracefully with sig 15 > 07/20/2009 13:26:34;0008; pbs_mom;Job;27565.bender ;kill_task: > killing pid 31771 task 1 gracefully with sig 15 > 07/20/2009 13:26:39;0008; pbs_mom;Job;27565.bender ;kill_task: > killing pid 31771 task 1 with sig 9 > 07/20/2009 13:26:39;0008; pbs_mom;Job;27565.bender ;kill_task: > killing pid 31782 task 1 gracefully with sig 15 > 07/20/2009 13:26:39;0008; pbs_mom;Job;27565.bender ;kill_task: > killing pid 31789 task 1 gracefully with sig 15 > 07/20/2009 13:26:41;0080; pbs_mom;Svr;preobit_reply;top of > preobit_reply > 07/20/2009 13:26:41;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/ > decode_DIS_replySvr worked, top of while loop > 07/20/2009 13:26:41;0080; pbs_mom;Svr;preobit_reply;in while loop, > no error from job stat > 07/20/2009 13:26:42;0008; pbs_mom;Job;27571.bender ;checking job > post-processing routine > 07/20/2009 13:26:42;0080; pbs_mom;Job;27571.bender ;obit sent to > server > 07/20/2009 13:26:42;0008; pbs_mom;Job;27586.bender ;kill_task: > killing pid 4918 task 1 gracefully with sig 15 > 07/20/2009 13:26:47;0008; pbs_mom;Job;27586.bender ;kill_task: > killing pid 4918 task 1 with sig 9 > 07/20/2009 13:26:47;0008; pbs_mom;Job;27586.bender ;kill_task: > killing pid 4929 task 1 gracefully with sig 15 > 07/20/2009 13:26:47;0008; pbs_mom;Job;27586.bender ;kill_task: > killing pid 4936 task 1 gracefully with sig 15 > 07/20/2009 13:26:51;0008; pbs_mom;Job;27576.bender ;kill_task: > killing pid 4023 task 1 gracefully with sig 15 > 07/20/2009 13:26:56;0008; pbs_mom;Job;27576.bender ;kill_task: > killing pid 4023 task 1 with sig 9 > 07/20/2009 13:26:56;0008; pbs_mom;Job;27576.bender ;kill_task: > killing pid 4032 task 1 gracefully with sig 15 > 07/20/2009 13:26:56;0008; pbs_mom;Job;27576.bender ;kill_task: > killing pid 4039 task 1 gracefully with sig 15 > 07/20/2009 13:27:01;0008; pbs_mom;Job;27576.bender ;kill_task: > killing pid 4039 task 1 with sig 9 > 07/20/2009 13:27:01;0002; pbs_mom;Svr;im_eof;End of File from addr > 158.109.211.63:15001 > 07/20/2009 13:27:01;0080; pbs_mom;Svr;preobit_reply;top of > preobit_reply > 07/20/2009 13:27:01;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/ > decode_DIS_replySvr worked, top of while loop > 07/20/2009 13:27:01;0080; pbs_mom;Svr;preobit_reply;in while loop, > no error from job stat > 07/20/2009 13:27:01;0080; pbs_mom;Svr;preobit_reply;top of > preobit_reply > 07/20/2009 13:27:01;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/ > decode_DIS_replySvr worked, top of while loop > 07/20/2009 13:27:01;0080; pbs_mom;Svr;preobit_reply;in while loop, > no error from job stat > 07/20/2009 13:27:01;0002; pbs_mom;n/ > a;mom_server_check_connection;sending hello to server bender > 07/20/2009 13:27:02;0008; pbs_mom;Job;27565.bender ;checking job > post-processing routine > > > -- > David McGiven > Associate Researcher > Universitat Autonoma de Barcelona > davidmcgivenn at gmail.com > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From davidmcgivenn at gmail.com Mon Jul 20 08:55:39 2009 From: davidmcgivenn at gmail.com (David McGiven) Date: Mon, 20 Jul 2009 16:55:39 +0200 Subject: [torqueusers] PBS_MOM kills running jobs when restarted In-Reply-To: <6795CC31-558A-473A-B80E-15BD19CB132D@indiana.edu> References: <6889_1248100824_n6KEeL8g008652_1CCC8E2A-BA0E-4602-AEBA-ABE88C2C3A42@gmail.com> <6795CC31-558A-473A-B80E-15BD19CB132D@indiana.edu> Message-ID: George, Too bad I couldn't see that on the man page when I checked it. Thank you very much. David On Jul 20, 2009, at 4:48 PM, George Wm Turner wrote: > use the -p option when you restart pbs_mom; from pbs_mom man page > > > -p Specifies the impact on jobs which were in execution when > the > mini-server shut down. On any restart of MOM, the new > mini- > server will not be the parent of any running jobs, MOM has > lost > control of her offspring (not a new situation for a > mother). > With the -p option, Mom will allow the jobs to continue to > run > and monitor them indirectly via polling. The -p option is > mutu- > ally exclusive with the -r option. > > > george wm turner > high performance systems > 812 855 5156 > > > > On Jul 20, 2009, at 8:51 AM, David McGiven wrote: > >> >> Dear TORQUE users, >> >> I am using pbs_mom torque 2.3.6 in a computing node with 24 >> processors. >> >> From time to time, it unexpectedly crashes dumping the following >> error to syslog : >> >> Jul 20 13:21:48 klaus kernel: [3352942.287574] pbs_mom[31758]: >> segfault at 4 rip 417ddf rsp 7fffb6c36e20 error 4 >> >> Then, when I restart pbs_mom on that computing node, all the jobs >> that >> were still running are killed. >> >> I don't understand why. I've looked for an option to tell mom to not >> kill the jobs when restarting but I didn't find anything. >> >> Does anybody know what is causing the problem or how to solve it ? >> >> Thanks in advance. >> >> Best Regards, >> David McGiven >> >> P.S : Those are the messages pbs_mom dumps when started after a >> crash. >> >> 07/20/2009 13:26:25;0001; pbs_mom;Svr;pbs_mom;No such file or >> directory (2) in read_config, fstat: config >> 07/20/2009 13:26:25;0002; pbs_mom;Svr;setpbsserver;bender >> 07/20/2009 13:26:25;0002; pbs_mom;Svr;mom_server_add;server >> bender added >> 07/20/2009 13:26:25;0002; pbs_mom;n/a;initialize;independent >> 07/20/2009 13:26:25;0080; pbs_mom;Svr;pbs_mom;before >> init_abort_jobs >> 07/20/2009 13:26:25;0002; pbs_mom;Svr;pbs_mom;Is up >> 07/20/2009 13:26:25;0002; pbs_mom;Svr;setup_program_environment;MOM >> executable path and mtime at launch: /opt/soft/torque- >> 2.3.6/sbin/pbs_mom 1244734930 >> 07/20/2009 13:26:25;0002; pbs_mom;n/ >> a;mom_server_check_connection;sending hello to server bender >> 07/20/2009 13:26:26;0008; pbs_mom;Job;27571.bender ;kill_task: >> killing pid 3866 task 1 gracefully with sig 15 >> 07/20/2009 13:26:31;0008; pbs_mom;Job;27571.bender ;kill_task: >> killing pid 3866 task 1 with sig 9 >> 07/20/2009 13:26:31;0008; pbs_mom;Job;27571.bender ;kill_task: >> killing pid 3877 task 1 gracefully with sig 15 >> 07/20/2009 13:26:32;0008; pbs_mom;Job;27571.bender ;kill_task: >> killing pid 3884 task 1 gracefully with sig 15 >> 07/20/2009 13:26:34;0008; pbs_mom;Job;27565.bender ;kill_task: >> killing pid 31771 task 1 gracefully with sig 15 >> 07/20/2009 13:26:39;0008; pbs_mom;Job;27565.bender ;kill_task: >> killing pid 31771 task 1 with sig 9 >> 07/20/2009 13:26:39;0008; pbs_mom;Job;27565.bender ;kill_task: >> killing pid 31782 task 1 gracefully with sig 15 >> 07/20/2009 13:26:39;0008; pbs_mom;Job;27565.bender ;kill_task: >> killing pid 31789 task 1 gracefully with sig 15 >> 07/20/2009 13:26:41;0080; pbs_mom;Svr;preobit_reply;top of >> preobit_reply >> 07/20/2009 13:26:41;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/ >> decode_DIS_replySvr worked, top of while loop >> 07/20/2009 13:26:41;0080; pbs_mom;Svr;preobit_reply;in while loop, >> no error from job stat >> 07/20/2009 13:26:42;0008; pbs_mom;Job;27571.bender ;checking job >> post-processing routine >> 07/20/2009 13:26:42;0080; pbs_mom;Job;27571.bender ;obit sent to >> server >> 07/20/2009 13:26:42;0008; pbs_mom;Job;27586.bender ;kill_task: >> killing pid 4918 task 1 gracefully with sig 15 >> 07/20/2009 13:26:47;0008; pbs_mom;Job;27586.bender ;kill_task: >> killing pid 4918 task 1 with sig 9 >> 07/20/2009 13:26:47;0008; pbs_mom;Job;27586.bender ;kill_task: >> killing pid 4929 task 1 gracefully with sig 15 >> 07/20/2009 13:26:47;0008; pbs_mom;Job;27586.bender ;kill_task: >> killing pid 4936 task 1 gracefully with sig 15 >> 07/20/2009 13:26:51;0008; pbs_mom;Job;27576.bender ;kill_task: >> killing pid 4023 task 1 gracefully with sig 15 >> 07/20/2009 13:26:56;0008; pbs_mom;Job;27576.bender ;kill_task: >> killing pid 4023 task 1 with sig 9 >> 07/20/2009 13:26:56;0008; pbs_mom;Job;27576.bender ;kill_task: >> killing pid 4032 task 1 gracefully with sig 15 >> 07/20/2009 13:26:56;0008; pbs_mom;Job;27576.bender ;kill_task: >> killing pid 4039 task 1 gracefully with sig 15 >> 07/20/2009 13:27:01;0008; pbs_mom;Job;27576.bender ;kill_task: >> killing pid 4039 task 1 with sig 9 >> 07/20/2009 13:27:01;0002; pbs_mom;Svr;im_eof;End of File from addr >> 158.109.211.63:15001 >> 07/20/2009 13:27:01;0080; pbs_mom;Svr;preobit_reply;top of >> preobit_reply >> 07/20/2009 13:27:01;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/ >> decode_DIS_replySvr worked, top of while loop >> 07/20/2009 13:27:01;0080; pbs_mom;Svr;preobit_reply;in while loop, >> no error from job stat >> 07/20/2009 13:27:01;0080; pbs_mom;Svr;preobit_reply;top of >> preobit_reply >> 07/20/2009 13:27:01;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/ >> decode_DIS_replySvr worked, top of while loop >> 07/20/2009 13:27:01;0080; pbs_mom;Svr;preobit_reply;in while loop, >> no error from job stat >> 07/20/2009 13:27:01;0002; pbs_mom;n/ >> a;mom_server_check_connection;sending hello to server bender >> 07/20/2009 13:27:02;0008; pbs_mom;Job;27565.bender ;checking job >> post-processing routine >> >> >> -- >> David McGiven >> Associate Researcher >> Universitat Autonoma de Barcelona >> davidmcgivenn at gmail.com >> >> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > -- David McGiven Associate Researcher Universitat Autonoma de Barcelona davidmcgivenn at gmail.com From ripke.mgh at gmail.com Mon Jul 20 13:28:20 2009 From: ripke.mgh at gmail.com (Stephan Ripke) Date: Mon, 20 Jul 2009 15:28:20 -0400 Subject: [torqueusers] dependencies with hundreds of jobs. Message-ID: Hi there, I want a job, that starts, after a group of other jobs finished. this works with "-W depend=afterok:jobid" very nicely for up to 50 jobs. but with a few hundred of jobs, this job is rejected. is there a workaround? e.g. can I start all the other jobs under a group name, and the start the next job dependent on this group? thanks, Stephan -- Stephan Ripke, M.D. Center for Human Genetic Research MGH Simches Research Center 185 Cambridge Street - CPZN6818 Boston, MA 02114 USA ripke at chgr.mgh.harvard.edu sripke at broadinstitute.org -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20090720/2561f9b7/attachment.html From SAngelovich at lgc.com Mon Jul 20 16:50:39 2009 From: SAngelovich at lgc.com (Steve Angelovich) Date: Mon, 20 Jul 2009 17:50:39 -0500 Subject: [torqueusers] node exclusive Message-ID: All, How can I specify a job to be node exclusive? I've been using ppn but we have nodes with varying number of processors (and memory). I found a references to a syntax that looks something like nodes=1:ppn=1#excl but I can't find anything in the documentation that looks promising. Thanks for any help. Steve ---------------------------------------------------------------------- This e-mail, including any attached files, may contain confidential and privileged information for the sole use of the intended recipient. Any review, use, distribution, or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive information for the intended recipient), please contact the sender by reply e-mail and delete all copies of this message. From KSong at lbl.gov Wed Jul 22 13:52:48 2009 From: KSong at lbl.gov (Song, Kai Song) Date: Wed, 22 Jul 2009 12:52:48 -0700 Subject: [torqueusers] Torque 1.10p2 compatibility with myrinet ? Message-ID: Hi All, For our torque scheduler, if we only submit the job to 1 node, it works just fine. However, when we submit our job to 2 or more than 2 nodes, the nodes will not communicate to each other so that the job will just hang there until it's timeout. We have tested our open-mpi program manually as follow: /home/software/ompi/1.3.2-pgi/bin/mpirun -machinefile ./nodes -np 16 ./helloworld It works fine so that we rule out the possibility of open-mpi's problem and myrinet connection problem. The only thing left is the torque scheduler, because it has a very old version(1.1.0p2) So, we are wondering if the very old torque we have doesn't support myrinet so that we need to build a newer version of torque. Does anyone know more detail about torque 1.1.0p2 and help us with this issue? Thanks in advance, Kai -------------------- Kai Song 1.510.486.4894 High Performance Computing Services (HPCS) Intern Lawrence Berkeley National Laboratory - http://scs.lbl.gov From naveed at caltech.edu Thu Jul 23 11:17:49 2009 From: naveed at caltech.edu (Naveed Near-Ansari) Date: Thu, 23 Jul 2009 10:17:49 -0700 Subject: [torqueusers] segfault with many depencies Message-ID: <1248369469.19188.2658.camel@aeolis> I alternet a submission script for FSL which is now being used by some of my users. FSL submissions generate alot of depencies. When the number of dependencies get high, qsub segfaults. Is this a known problem? is their a workaround? decreasing the depencies allows it to work. $ qsub -l nodes=1 -l pmem=256mb -V -d /home/user/ -q short.q -M user at host.caltech.edu -N bpx_postproc -m ae -o /home/user/data/maq004/dish1.bedpostX/logs//home/user/data/maq004/dish1.bed= postX/logs.out -e /home/user/data/maq004/dish1.bedpostX/logs//home/user/data/maq004/dish1.= bedpostX/logs.err -W depend=3Dafterany:9080:9081:9082:9083:9084:9085:9086:9087:9088:9089:9090= :9091:9092:9093:9094:9095:9096:9097:9098:9099:9100:9101:9102:9103:9104:9105= :9106:9107:9108:9109:9110:9111:9112:9113:9114:9115:9116:9117:9118:9119:9120= :9121:9122:9123:9124:9125:9126:9127:9128:9129:9130:9131:9132:9133:9134:9135= :9136:9137:9138:9139:9140:9141:9142:9143:9144:9145:9146:9147:9148:9149:9150= :9151:9152:9153:9154:9155:9156:9157:9158:9159:9160:9161:9162:9163:9164 /tmp/tmp.hZBQb27634 Segmentation fault From josh at clusterresources.com Thu Jul 23 11:19:47 2009 From: josh at clusterresources.com (Josh Butikofer) Date: Thu, 23 Jul 2009 11:19:47 -0600 Subject: [torqueusers] segfault with many depencies In-Reply-To: <1248369469.19188.2658.camel@aeolis> References: <1248369469.19188.2658.camel@aeolis> Message-ID: <4A689BB3.4070405@clusterresources.com> What version of TORQUE are you using? Josh Butikofer Cluster Resources, Inc. ############################# Naveed Near-Ansari wrote: > I alternet a submission script for FSL which is now being used by some > of my users. FSL submissions generate alot of depencies. When the > number of dependencies get high, qsub segfaults. Is this a known > problem? is their a workaround? decreasing the depencies allows it to > work. > > > > $ qsub -l nodes=1 -l pmem=256mb -V -d /home/user/ -q > short.q -M user at host.caltech.edu -N bpx_postproc -m ae -o > /home/user/data/maq004/dish1.bedpostX/logs//home/user/data/maq004/dish1.bed= > postX/logs.out > -e > /home/user/data/maq004/dish1.bedpostX/logs//home/user/data/maq004/dish1.= > bedpostX/logs.err > -W > depend=3Dafterany:9080:9081:9082:9083:9084:9085:9086:9087:9088:9089:9090= > :9091:9092:9093:9094:9095:9096:9097:9098:9099:9100:9101:9102:9103:9104:9105= > :9106:9107:9108:9109:9110:9111:9112:9113:9114:9115:9116:9117:9118:9119:9120= > :9121:9122:9123:9124:9125:9126:9127:9128:9129:9130:9131:9132:9133:9134:9135= > :9136:9137:9138:9139:9140:9141:9142:9143:9144:9145:9146:9147:9148:9149:9150= > :9151:9152:9153:9154:9155:9156:9157:9158:9159:9160:9161:9162:9163:9164 > /tmp/tmp.hZBQb27634 > Segmentation fault > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From naveed at caltech.edu Thu Jul 23 11:26:53 2009 From: naveed at caltech.edu (Naveed Near-Ansari) Date: Thu, 23 Jul 2009 10:26:53 -0700 Subject: [torqueusers] segfault with many depencies In-Reply-To: <4A689BB3.4070405@clusterresources.com> References: <1248369469.19188.2658.camel@aeolis> <4A689BB3.4070405@clusterresources.com> Message-ID: <1248370013.19188.2659.camel@aeolis> 2.3.6 as packaged in the torque roll for rocks 5.2. Naveed On Thu, 2009-07-23 at 11:19 -0600, Josh Butikofer wrote: > What version of TORQUE are you using? > > Josh Butikofer > Cluster Resources, Inc. > ############################# > > > Naveed Near-Ansari wrote: > > I alternet a submission script for FSL which is now being used by some > > of my users. FSL submissions generate alot of depencies. When the > > number of dependencies get high, qsub segfaults. Is this a known > > problem? is their a workaround? decreasing the depencies allows it to > > work. > > > > > > > > $ qsub -l nodes=1 -l pmem=256mb -V -d /home/user/ -q > > short.q -M user at host.caltech.edu -N bpx_postproc -m ae -o > > /home/user/data/maq004/dish1.bedpostX/logs//home/user/data/maq004/dish1.bed= > > postX/logs.out > > -e > > /home/user/data/maq004/dish1.bedpostX/logs//home/user/data/maq004/dish1.= > > bedpostX/logs.err > > -W > > depend=3Dafterany:9080:9081:9082:9083:9084:9085:9086:9087:9088:9089:9090= > > :9091:9092:9093:9094:9095:9096:9097:9098:9099:9100:9101:9102:9103:9104:9105= > > :9106:9107:9108:9109:9110:9111:9112:9113:9114:9115:9116:9117:9118:9119:9120= > > :9121:9122:9123:9124:9125:9126:9127:9128:9129:9130:9131:9132:9133:9134:9135= > > :9136:9137:9138:9139:9140:9141:9142:9143:9144:9145:9146:9147:9148:9149:9150= > > :9151:9152:9153:9154:9155:9156:9157:9158:9159:9160:9161:9162:9163:9164 > > /tmp/tmp.hZBQb27634 > > Segmentation fault > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > From arnaubria at pic.es Mon Jul 27 08:00:22 2009 From: arnaubria at pic.es (Arnau Bria) Date: Mon, 27 Jul 2009 16:00:22 +0200 Subject: [torqueusers] net_move, no server specified in $worker_node Message-ID: <20090727160022.44aff416@lx-arnau.pic.es> Hi all, I've started seeing the error messages specified in subject in my torque server. I see no reference in google about it. My client has good torque conf, so I don't know what the error means... anyonw could explain why is this error in messages? TIA, Arnau From knielson at clusterresources.com Mon Jul 27 09:04:54 2009 From: knielson at clusterresources.com (Ken Nielson) Date: Mon, 27 Jul 2009 09:04:54 -0600 Subject: [torqueusers] net_move, no server specified in $worker_node In-Reply-To: <20090727160022.44aff416@lx-arnau.pic.es> References: <20090727160022.44aff416@lx-arnau.pic.es> Message-ID: <4A6DC216.5010908@clusterresources.com> Arnau Bria wrote: > Hi all, > > I've started seeing the error messages specified in subject in my > torque server. > I see no reference in google about it. > > > My client has good torque conf, so I don't know what the error means... > > anyonw could explain why is this error in messages? > > TIA, > Arnau > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > Arnau, From the code it looks like the process is moving a job. It expects the destination to have an '@' symbol but one does not exist. The destination comes from the job structure. Does this help any? Ken Nielson Cluster Resources, Inc. From arnaubria at pic.es Mon Jul 27 10:20:47 2009 From: arnaubria at pic.es (Arnau Bria) Date: Mon, 27 Jul 2009 18:20:47 +0200 Subject: [torqueusers] net_move, no server specified in $worker_node In-Reply-To: <4A6DC216.5010908@clusterresources.com> References: <20090727160022.44aff416@lx-arnau.pic.es> <4A6DC216.5010908@clusterresources.com> Message-ID: <20090727182047.55bd1a86@amparo> On Mon, 27 Jul 2009 09:04:54 -0600 Ken Nielson wrote: > Arnau, Hi Ken, > From the code it looks like the process is moving a job. It expects > the destination to have an '@' symbol but one does not exist. > > The destination comes from the job structure. thanks for looking at it :-) > Does this help any? nop... maybe I should have to study with what kind of job it happens, but the error is always about same 4 or 5 nodes, and appears many times in logs... will do a little deeper study. > Ken Nielson > Cluster Resources, Inc. Thanks, Arnau From jbernstein at penguincomputing.com Mon Jul 27 11:59:54 2009 From: jbernstein at penguincomputing.com (Josh Bernstein) Date: Mon, 27 Jul 2009 10:59:54 -0700 Subject: [torqueusers] net_move, no server specified in $worker_node In-Reply-To: <20090727160022.44aff416@lx-arnau.pic.es> References: <20090727160022.44aff416@lx-arnau.pic.es> Message-ID: <9A4D30C7-8A15-40B9-93A9-620BBD8C6246@penguincomputing.com> Hi, The only case that I can think of when a job would be migrated to a queue with an @ sign would be a routing queue. Perhaps you have a misconfigured routing queue? -Joshua Bernstein Senior Software Engineer Penguin Computing On Jul 27, 2009, at 7:09 AM, "Arnau Bria" wrote: > Hi all, > > I've started seeing the error messages specified in subject in my > torque server. > I see no reference in google about it. > > > My client has good torque conf, so I don't know what the error > means... > > anyonw could explain why is this error in messages? > > TIA, > Arnau > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From arnaubria at pic.es Mon Jul 27 12:15:55 2009 From: arnaubria at pic.es (Arnau Bria) Date: Mon, 27 Jul 2009 20:15:55 +0200 Subject: [torqueusers] net_move, no server specified in $worker_node In-Reply-To: <9A4D30C7-8A15-40B9-93A9-620BBD8C6246@penguincomputing.com> References: <20090727160022.44aff416@lx-arnau.pic.es> <9A4D30C7-8A15-40B9-93A9-620BBD8C6246@penguincomputing.com> Message-ID: <20090727201555.5c6d01e3@amparo> On Mon, 27 Jul 2009 10:59:54 -0700 Josh Bernstein wrote: > Hi, Hi Josh, > The only case that I can think of when a job would be migrated to a > queue with an @ sign would be a routing queue. Perhaps you have a > misconfigured routing queue? no, we do not use routing queues... we had some network problem during the weekend, maybe it could cause some cut between nodes and server and that cause this strange error? # grep "net_move, no server specified in" /var/log/messages.1|cut -d" " -f 4,5,6,7,8,9,10,11|sort|uniq -c 8 pbs02 PBS_Server: net_move, no server specified in td027.pic.es 8 pbs02 PBS_Server: net_move, no server specified in td078.pic.es 8 pbs02 PBS_Server: net_move, no server specified in td147.pic.es 8 pbs02 PBS_Server: net_move, no server specified in td153.pic.es 8 pbs02 PBS_Server: net_move, no server specified in td159.pic.es 8 pbs02 PBS_Server: net_move, no server specified in td172.pic.es thanks for your reply. > -Joshua Bernstein > Senior Software Engineer > Penguin Computing From abhig at Princeton.EDU Wed Jul 29 10:09:03 2009 From: abhig at Princeton.EDU (abhishek gupta) Date: Wed, 29 Jul 2009 12:09:03 -0400 Subject: [torqueusers] ncpus acting wierd Message-ID: <4A70741F.7050101@princeton.edu> Hi All, I am trying to use 'ncpus' but when I submit my job, the job runs only on one cpu. Is there any parameter that I need to set to make it work? I am running a dummy script which looks like this: #!/bin/bash #PBS -j oe #PBS -m abe #PBS -l cput=36:0:0 #PBS -l ncpus=2 /home/abhig/pbs_nfs_test/pbs-nfs.sh The script pbs-nfs.sh is just a while loop that prints hello 20000 times in a file. Thanks, Abhi. From aeszter at gwdg.de Wed Jul 29 10:48:29 2009 From: aeszter at gwdg.de (Ansgar Esztermann) Date: Wed, 29 Jul 2009 18:48:29 +0200 Subject: [torqueusers] ncpus acting wierd In-Reply-To: <4A70741F.7050101@princeton.edu> References: <4A70741F.7050101@princeton.edu> Message-ID: <2BF1B7B3-ED98-4878-9014-6B167B000FA9@gwdg.de> On Jul 29, 2009, at 18:09 , abhishek gupta wrote: > I am trying to use 'ncpus' but when I submit my job, the job runs only > on one cpu. Is there any parameter that I need to set to make it work? I am not sure if I understand what you mean; but torque does not make your scripts (or binaries, for that matter) run on several CPUs. If you submit a script with the command "hostname" to 16 nodes with 4 CPUs each, the output will be the name of one of the hosts, printed once. In order to use more than one CPU, you will have to set up your software accordingly. Parallelizing existing software is not always trivial. But then again, the program you wish to use might already support parallel execution. In that case, you will have to run it in an appropriate way. If it is MPI-based, you will probably want to start it with mpirun or mpiexec, like so: mpirun myprog A. -- Ansgar Esztermann DV-Systemadministration Max-Planck-Institut f?r biophysikalische Chemie, Abteilung 105 From jkusznir at gmail.com Wed Jul 29 17:50:16 2009 From: jkusznir at gmail.com (Jim Kusznir) Date: Wed, 29 Jul 2009 16:50:16 -0700 Subject: [torqueusers] Node assignment queue for shared memory computing Message-ID: Hi all: I'm currently working on getting hadoop running in scheduler mode with torque, and basically need a shared memory node allocation. By this, I mean when the program requests -nodes=4, they mean 4 unique nodes with all processors in those nodes allocated, AND ideally the generated machine file only containing one entry for each node. Unfortunately, I am not able to modify how it requests nodes (such as make it use the :np=8 option), so when it requests --nodes=4, it needs 4 physically seperate nodes. I tried a few ways to "outsmart" hadoop, but all without success. I also see this as required for running hybrid MPI/OpenMP jobs. When I ran such jobs, I want my MPI stack to only start one process per physical node, but then have OpenMP run on "lightweight" threads to use all the cores on that system. I can do the -nodes=4:np=8 in this case, but the generated machines file that OpenMPI gets tells it it has 32 nodes in this case, so it would start 32 executables, 8 on each node, when I actually only want 4 executables started. Thanks! --Jim From nate.a.woody at gmail.com Wed Jul 29 10:35:41 2009 From: nate.a.woody at gmail.com (Nate Woody) Date: Wed, 29 Jul 2009 12:35:41 -0400 Subject: [torqueusers] epilogue.precancel documentation and execution questions In-Reply-To: <516eeec60907281238q4c8a574bn194b519798f5a1c1@mail.gmail.com> References: <516eeec60907281238q4c8a574bn194b519798f5a1c1@mail.gmail.com> Message-ID: <516eeec60907290935l6786955y37af9047b1294aa5@mail.gmail.com> Hello, Is it just me or is the epilogue.precancel documentation (http://www.clusterresources.com/products/torque/docs/a.gprologueepilogue.shtml) just wrong? ?It states that the epilogue.precancel script should be 500, but if I do that I get a permissions error in the mom log. Garrick answered a previous question saying it had to be 755 and that seemed to worked for me. ?The docs also appear to say that the script is run as root, but it looks to me like it's executed by the owner of the job (user privs), which makes sense with the 755 permissions, I suppose. ?The docs also seem to say that script stdout is hooked up to the job stdout, but that doesn't seem to be true either, I haven't actually found where stdout from epilogue.precancel is going, but it's not the job stdout. ?Lastly, it appears that this script is getting executed as the regular epilogue script as well as the precancel script (in addition to the regular epilogue script), so that it get's executed once for ALL jobs and twice for canceled jobs. Is anyone using the precancel out there that is seeing this double execution? I piped stdout to a file and staged it back out, but has anybody gotten this into the job stdout? Thanks, Nate From Gareth.Williams at csiro.au Wed Jul 29 19:55:17 2009 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Thu, 30 Jul 2009 11:55:17 +1000 Subject: [torqueusers] Node assignment queue for shared memory computing In-Reply-To: References: Message-ID: <007DECE986B47F4EABF823C1FBB19C625C13A7BD57@exvic-mbx04.nexus.csiro.au> > Hi all: > > I'm currently working on getting hadoop running in scheduler mode with > torque, and basically need a shared memory node allocation. By this, > I mean when the program requests -nodes=4, they mean 4 unique nodes > with all processors in those nodes allocated, AND ideally the > generated machine file only containing one entry for each node. > > Unfortunately, I am not able to modify how it requests nodes (such as > make it use the :np=8 option), so when it requests --nodes=4, it needs > 4 physically seperate nodes. I tried a few ways to "outsmart" hadoop, > but all without success. > > I also see this as required for running hybrid MPI/OpenMP jobs. When > I ran such jobs, I want my MPI stack to only start one process per > physical node, but then have OpenMP run on "lightweight" threads to > use all the cores on that system. I can do the -nodes=4:np=8 in this > case, but the generated machines file that OpenMPI gets tells it it > has 32 nodes in this case, so it would start 32 executables, 8 on each > node, when I actually only want 4 executables started. > > Thanks! > --Jim Hi Jim, I can't see how to get around the need to request ppn=8 (or whatever all processors in a node is) but... Openmpi mpirun has options you can use. For -l nodes=4:ppn=8 you might run mpirun -bynode -np=4 ... Or you could filter the $PBS_NODEFILE to a temporary file and use that in an mpirun arg. cat $PBS_NODEFILE | sort | uniq > uniq.nodes mpirun -hostfile uniq.nodes ... There may even be better options with mpirun cheers, Gareth From Jacques.Foury at math.u-bordeaux1.fr Thu Jul 30 02:27:46 2009 From: Jacques.Foury at math.u-bordeaux1.fr (Jacques Foury) Date: Thu, 30 Jul 2009 10:27:46 +0200 Subject: [torqueusers] Node assignment queue for shared memory computing In-Reply-To: References: Message-ID: <4A715982.7030403@math.u-bordeaux1.fr> Jim Kusznir a ?crit : > Hi all: > > I'm currently working on getting hadoop running in scheduler mode with > torque, and basically need a shared memory node allocation. By this, > I mean when the program requests -nodes=4, they mean 4 unique nodes > with all processors in those nodes allocated, AND ideally the > generated machine file only containing one entry for each node. > What I would do in your case is declare nodes to have only 1 CPU... and then you deal with MPI on your own... -- Jacques Foury administrateur systemes, reseaux, clusters Institut de Mathematiques de Bordeaux http://www.math.u-bordeaux1.fr/maths/cellule From Jacques.Foury at math.u-bordeaux1.fr Thu Jul 30 03:04:43 2009 From: Jacques.Foury at math.u-bordeaux1.fr (Jacques Foury) Date: Thu, 30 Jul 2009 11:04:43 +0200 Subject: [torqueusers] node exclusive In-Reply-To: References: Message-ID: <4A71622B.6070502@math.u-bordeaux1.fr> Steve Angelovich a ?crit : > All, > > How can I specify a job to be node exclusive? > > I've been using ppn but we have nodes with varying number of processors (and memory). > > I found a references to a syntax that looks something like nodes=1:ppn=1#excl but I can't find anything in the documentation that looks promising. > > Thanks for any help. > if you were using MAUI, you could set : NODEACCESSPOLICY SINGLEJOB -- Jacques Foury administrateur systemes, reseaux, clusters Institut de Mathematiques de Bordeaux http://www.math.u-bordeaux1.fr/maths/cellule From Jacques.Foury at math.u-bordeaux1.fr Thu Jul 30 03:13:15 2009 From: Jacques.Foury at math.u-bordeaux1.fr (Jacques Foury) Date: Thu, 30 Jul 2009 11:13:15 +0200 Subject: [torqueusers] pbs_mom request, was Re: PBS_MOM kills running jobs when restarted In-Reply-To: <6795CC31-558A-473A-B80E-15BD19CB132D@indiana.edu> References: <6889_1248100824_n6KEeL8g008652_1CCC8E2A-BA0E-4602-AEBA-ABE88C2C3A42@gmail.com> <6795CC31-558A-473A-B80E-15BD19CB132D@indiana.edu> Message-ID: <4A71642B.7000604@math.u-bordeaux1.fr> George Wm Turner a ?crit : > use the -p option when you restart pbs_mom; from pbs_mom man page > > > -p Specifies the impact on jobs which were in execution when > the > mini-server shut down. On any restart of MOM, the new > mini- > server will not be the parent of any running jobs, MOM has > lost > control of her offspring (not a new situation for a > mother). > With the -p option, Mom will allow the jobs to continue to > run > and monitor them indirectly via polling. The -p option is > mutu- > ally exclusive with the -r option. > I experienced serious trouble because this is not the default behaviour for mom... I always put the -p parameter, and I wonder why I may have something else !!! Why could someone want his job killed when restarting mom ??? I urge the developers to put "-p" as the default behaviour for mom... or at least to put this option in the startup script for the different packages (rpm or deb...) -- Jacques Foury administrateur systemes, reseaux, clusters Institut de Mathematiques de Bordeaux http://www.math.u-bordeaux1.fr/maths/cellule From peter.schmitt at dartmouth.edu Thu Jul 30 04:16:30 2009 From: peter.schmitt at dartmouth.edu (Pete Schmitt) Date: Thu, 30 Jul 2009 06:16:30 -0400 Subject: [torqueusers] node exclusive In-Reply-To: <4A71622B.6070502@math.u-bordeaux1.fr> References: <4A71622B.6070502@math.u-bordeaux1.fr> Message-ID: <4A7172FE.4080008@dartmouth.edu> You can specify this in your torque pbs script: #PBS -l naccesspolicy=single Jacques Foury wrote: > Steve Angelovich a ?crit : > >> All, >> >> How can I specify a job to be node exclusive? >> >> I've been using ppn but we have nodes with varying number of processors (and memory). >> >> I found a references to a syntax that looks something like nodes=1:ppn=1#excl but I can't find anything in the documentation that looks promising. >> >> Thanks for any help. >> >> > > if you were using MAUI, you could set : > > NODEACCESSPOLICY SINGLEJOB > > -- *Pete Schmitt* *Technical Director: Discovery Cluster 179B Berry Library, HB 6224 Dartmouth College Hanover, NH 03755* *Dartmouth: 603-646-8109 Mon,Tue,Thu,Fri** DHMC/Cell: 603-252-2452 Wednesday Fax: 603-646-1042 AIM: CongressSt * Computational Genetics Lab -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20090730/21800f12/attachment.html From glen.beane at gmail.com Thu Jul 30 04:38:49 2009 From: glen.beane at gmail.com (Glen Beane) Date: Thu, 30 Jul 2009 06:38:49 -0400 Subject: [torqueusers] pbs_mom request, was Re: PBS_MOM kills running jobs when restarted In-Reply-To: <4A71642B.7000604@math.u-bordeaux1.fr> References: <6889_1248100824_n6KEeL8g008652_1CCC8E2A-BA0E-4602-AEBA-ABE88C2C3A42@gmail.com> <6795CC31-558A-473A-B80E-15BD19CB132D@indiana.edu> <4A71642B.7000604@math.u-bordeaux1.fr> Message-ID: <5caae8c60907300338t3e741f92x509f26f03c4e4993@mail.gmail.com> you do not want -p as the default behavior when you reboot a node. pbs_mom could find pids that match its last known jobs and attempt to take ownership of them when in fact the pids have no relation to the previous jobs. -p should not be in any startup script, except for maybe a "restart" option -p should only be used in rare cases when you need to (re)start pbs_mom on a node already running jobs, _never_ at boot time, which is why it is not the default and why it is not in most sites startup scripts. On Thu, Jul 30, 2009 at 5:13 AM, Jacques Foury wrote: > George Wm Turner a ?crit : >> use the -p option when you restart pbs_mom; from pbs_mom man page >> >> >> ? -p ? ? Specifies ?the ?impact ?on jobs which were in execution when >> the >> ? ? ? ? ?mini-server shut down. ?On any restart of ?MOM, ?the ?new >> mini- >> ? ? ? ? ?server ?will not be the parent of any running jobs, MOM has >> lost >> ? ? ? ? ?control of her offspring (not a new ?situation ?for ?a >> mother). >> ? ? ? ? ?With ?the ?-p option, Mom will allow the jobs to continue to >> run >> ? ? ? ? ?and monitor them indirectly via polling. ?The -p option is >> mutu- >> ? ? ? ? ?ally exclusive with the -r option. >> > > I experienced serious trouble because this is not the default behaviour > for mom... > > I always put the -p parameter, and I wonder why I may have something > else !!! Why could someone want his job killed when restarting mom ??? > > I urge the developers to put "-p" as the default behaviour for mom... or > at least to put this option in the startup script for the different > packages (rpm or deb...) > > -- > Jacques Foury > administrateur systemes, reseaux, clusters > Institut de Mathematiques de Bordeaux > http://www.math.u-bordeaux1.fr/maths/cellule > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From peter.schmitt at dartmouth.edu Thu Jul 30 04:42:28 2009 From: peter.schmitt at dartmouth.edu (Pete Schmitt) Date: Thu, 30 Jul 2009 06:42:28 -0400 Subject: [torqueusers] Node assignment queue for shared memory computing In-Reply-To: <4A715982.7030403@math.u-bordeaux1.fr> References: <4A715982.7030403@math.u-bordeaux1.fr> Message-ID: <4A717914.4020108@dartmouth.edu> Just add the following to get exclusive access to the node: #PBS -l naccesspolicy=single Jacques Foury wrote: > Jim Kusznir a ?crit : > >> Hi all: >> >> I'm currently working on getting hadoop running in scheduler mode with >> torque, and basically need a shared memory node allocation. By this, >> I mean when the program requests -nodes=4, they mean 4 unique nodes >> with all processors in those nodes allocated, AND ideally the >> generated machine file only containing one entry for each node. >> >> > > What I would do in your case is declare nodes to have only 1 CPU... and > then you deal with MPI on your own... > > -- *Pete Schmitt* *Technical Director: Discovery Cluster 179B Berry Library, HB 6224 Dartmouth College Hanover, NH 03755* *Dartmouth: 603-646-8109 Mon,Tue,Thu,Fri** DHMC/Cell: 603-252-2452 Wednesday Fax: 603-646-1042 AIM: CongressSt * Computational Genetics Lab -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20090730/4a444ee3/attachment-0001.html From sean.reilly at ersa.edu.au Wed Jul 29 20:54:41 2009 From: sean.reilly at ersa.edu.au (Sean Reilly) Date: Thu, 30 Jul 2009 12:24:41 +0930 Subject: [torqueusers] Applications/programs ran by users Message-ID: <4A710B71.8090902@ersa.edu.au> Hi All, Does any one know if its is possible to see what applications or programs where actually ran by the users ? I can see the /var/spool/pbs/server_priv/accounting logs provide quite a bit of information but cannot see what application or program was used is there someplace else for this information ? or is this something we have to set up in the logging itself and not done by default ? Thanks in advance Sean -- Sean Reilly Applications Support Officer eResearchSA Phone : +61 8 8303 8352 Mobile: +61 450 840 246 From jdsmit at sandia.gov Thu Jul 30 08:17:19 2009 From: jdsmit at sandia.gov (Jerry Smith) Date: Thu, 30 Jul 2009 08:17:19 -0600 Subject: [torqueusers] Applications/programs ran by users In-Reply-To: <4A710B71.8090902@ersa.edu.au> References: <4A710B71.8090902@ersa.edu.au> Message-ID: <4A71AB6F.9010509@sandia.gov> Sean, The job files exist on the pbs_server at $PBS_HOME/server_priv/jobs/*.SC, from submission until the job ends. We had to do this ourselves, and ended up just copying off their script ( $PBS_HOME/mom_priv/jobs/$SCRIPT.SC ) in the prologue, and writing a custom parser to look for mpiexec/orterun/mpirun etc post job execution. ----cut from prologue--- export SCRIPT_REGEX=$PBS_DIR/mom_priv/jobs/${PBS_JOBID/.*/}.*.SC export SCRIPT=`ls $SCRIPT_REGEX 2> /dev/null` if [ -f "$SCRIPT" ] ; then cp $SCRIPT /apps/jobscripts/ fi ---end cut ---- Obviously this does not work for interactive jobs, and you wouldn't believe how many people run "a.out". --Jerry Sean Reilly wrote: > Hi All, > > Does any one know if its is possible to see what applications or > programs where actually ran by the users ? > I can see the /var/spool/pbs/server_priv/accounting > logs provide quite a bit of information > but cannot see what application or program was used > > is there someplace else for this information ? > or is this something we have to set up in the logging itself and not > done by default ? > > Thanks in advance > Sean > > From mailmaverick666 at gmail.com Thu Jul 30 23:43:51 2009 From: mailmaverick666 at gmail.com (rishi pathak) Date: Fri, 31 Jul 2009 11:13:51 +0530 Subject: [torqueusers] Applications/programs ran by users In-Reply-To: <4A710B71.8090902@ersa.edu.au> References: <4A710B71.8090902@ersa.edu.au> Message-ID: <180b672e0907302243u3ffce880p954811024ff02bd1@mail.gmail.com> Hi , We keep copying .SC from $PBS_HOME/server_priv/jobs/ directory. We use dnotify for triggering the copy process. On Thu, Jul 30, 2009 at 8:24 AM, Sean Reilly wrote: > Hi All, > > Does any one know if its is possible to see what applications or > programs where actually ran by the users ? > I can see the /var/spool/pbs/server_priv/accounting > logs provide quite a bit of information > but cannot see what application or program was used > > is there someplace else for this information ? > or is this something we have to set up in the logging itself and not > done by default ? > > Thanks in advance > Sean > > -- > Sean Reilly > > Applications Support Officer > eResearchSA > > Phone : +61 8 8303 8352 > Mobile: +61 450 840 246 > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Regards-- Rishi Pathak Pune-Maharastra -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20090731/e524f048/attachment.html From tbaer at utk.edu Fri Jul 31 12:31:05 2009 From: tbaer at utk.edu (Troy Baer) Date: Fri, 31 Jul 2009 14:31:05 -0400 Subject: [torqueusers] Applications/programs ran by users In-Reply-To: <180b672e0907302243u3ffce880p954811024ff02bd1@mail.gmail.com> References: <4A710B71.8090902@ersa.edu.au> <180b672e0907302243u3ffce880p954811024ff02bd1@mail.gmail.com> Message-ID: <1249065065.4379.13.camel@ashitaka.baer.lan> On Fri, 2009-07-31 at 11:13 +0530, rishi pathak wrote: > We keep copying .SC from $PBS_HOME/server_priv/jobs/ > directory. We use dnotify for triggering the copy process. The DB back-end for the PHP/MySQL-based job accounting system that comes with pbstools [1,2] has a column for storing scripts as well. It comes with examples of using either dnotify or inotifywait to fire the DB insert script when a new .SC file appears. The web front-end also has several prefabbed reports for usage of particular software packages, and they're pretty easily extensible -- you just need to add patterns to look for in the job scripts that correspond to a particular application. --Troy [1] http://www.osc.edu/~troy/pbstools/pbstools.tar.gz [2] http://svn.osc.edu/repos/pbstools/trunk/ From naveed at caltech.edu Fri Jul 31 17:59:00 2009 From: naveed at caltech.edu (Naveed Near-Ansari) Date: Fri, 31 Jul 2009 16:59:00 -0700 Subject: [torqueusers] pbs_mom problem in rocks 5 Message-ID: <1249084740.19188.4268.camel@aeolis> we have had some trouble with 2 installations of Torque on rocks 5 (both 5.1 and 5.2). Our installations on rocks 4 do not exhibit the same problem. Essentially what happens is that the pbs_moms become unresponsive every couple of days. We get this reponse using momctl: ERROR: query[0] 'diag3' failed on compute-1-31 (errno=0-Success: 5-Input/output error) The node seems to be running pbs_mom, and restarting it resolves the issue. Sumitted jobs get rejected naturally: Job: 17809.hostname.caltech.edu 07/31/2009 15:09:51 S enqueuing into default, state 2 hop 1 07/31/2009 15:09:51 S Job Queued at request of username at hostname.caltech.edu, owner = username at hostname.caltech.edu, job name = Job158Task1, queue = default 07/31/2009 15:09:51 S Job Modified at request of username at hostname.caltech.edu 07/31/2009 15:09:51 A queue=default 07/31/2009 15:09:52 S Job Modified at request of maui at hostname.caltech.edu 07/31/2009 15:09:52 S Job Run at request of maui at hostname.caltech.edu 07/31/2009 15:09:52 S send of job to compute-1-31 failed error = 15008 07/31/2009 15:09:52 S unable to run job, MOM rejected/rc=1 07/31/2009 15:09:52 S Job Modified at request of maui at hostname.caltech.edu 07/31/2009 15:14:53 S Job deleted at request of username at hostname.caltech.edu 07/31/2009 15:14:53 S dequeuing from default, state EXITING 07/31/2009 15:14:53 A requestor=username at hostname.caltech.edu after restarting the mom, the output from momctl looks perfectly healthy. Have you seen this behavior before? Anyone have a solution for it? Naveed