[torqueusers] jobs stuck in queue until I force execution with qrun
Gustavo Correa
gus at ldeo.columbia.edu
Thu Feb 16 13:55:29 MST 2012
Hi Christina
This is just a vague thought, not sure if in the right direction.
I am a bit confused about the domain being admin.default.domain
Is this the sever name in $TORQUE/server_name on the head node?
Is it something else, perhaps the head node FQDN Internet address?
How about this line in the compute nodes' $TORQUE/mom_priv/config file:
$pbsserver .....
What is the server name that appears there?
These items were a source of confusion for me long ago.
I don't even remember anymore
what was the mistake and how it was fixed, but maybe there is something here.
Also, is there any hint of the problem in the $TORQUE/mom_logs files in the compute nodes?
How about the /var/log/messages on the compute nodes, any smoking gun there?
Can the compute nodes resolve the Torque server name [easy way via /etc/hosts]?
Can the Torque server resolve the compute nodes' names [ say in /etc/hosts]?
Is there a firewall between the server and the compute nodes?
Maybe the Torque Admin Guide, Ch. 1 [overview/installation/configuration]
and Ch 11 [troubleshooting] can help:
http://www.adaptivecomputing.com/resources/docs/
I hope this helps,
Gus Correa
On Feb 16, 2012, at 3:10 PM, Christina Salls wrote:
> Hi all,
>
> My situation has improved but I am still not there. I can submit a job successfully, but it will stay in the queue until I force execution with qrun.
>
> eg.
>
> -bash-4.1$ qsub ./example_submit_script_1
> 22.admin.default.domain
> -bash-4.1$ qstat -a
>
> admin.default.domain:
> Req'd Req'd Elap
> Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
> -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----
> 22.admin.default salls batch ExampleJob -- 1 1 -- 00:01 Q --
>
> .[root at wings ~]# qrun 22
> [root at wings ~]# qstat -a
>
> admin.default.domain:
> Req'd Req'd Elap
> Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
> -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----
> 22.admin.default salls batch ExampleJob 30429 1 1 -- 00:01 R --
>
> [root at wings ~]# qstat -a
>
> admin.default.domain:
> Req'd Req'd Elap
> Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
> -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----
> 22.admin.default salls batch ExampleJob 30429 1 1 -- 00:01 C 00:00
> [root at wings ~]#
>
>
> This is what tracejob output looks like:
>
> [root at wings ~]# tracejob 22
> /var/spool/torque/mom_logs/20120216: No such file or directory
> /var/spool/torque/sched_logs/20120216: No matching job records located
>
> Job: 22.admin.default.domain
>
> 02/16/2012 13:46:51 S enqueuing into batch, state 1 hop 1
> 02/16/2012 13:46:51 S Job Queued at request of salls at admin.default.domain, owner = salls at admin.default.domain,
> job name = ExampleJob, queue = batch
> 02/16/2012 13:46:51 A queue=batch
> 02/16/2012 13:53:53 S Job Run at request of root at admin.default.domain
> 02/16/2012 13:53:53 S Not sending email: User does not want mail of this type.
> 02/16/2012 13:53:53 A user=salls group=man jobname=ExampleJob queue=batch ctime=1329421611 qtime=1329421611
> etime=1329421611 start=1329422033 owner=salls at admin.default.domain
> exec_host=n001.default.domain/0 Resource_List.neednodes=1 Resource_List.nodect=1
> Resource_List.nodes=1 Resource_List.walltime=00:01:00
> 02/16/2012 13:54:03 S Not sending email: User does not want mail of this type.
> 02/16/2012 13:54:03 S Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb
> resources_used.walltime=00:00:10
> 02/16/2012 13:54:03 A user=salls group=man jobname=ExampleJob queue=batch ctime=1329421611 qtime=1329421611
> etime=1329421611 start=1329422033 owner=salls at admin.default.domain
> exec_host=n001.default.domain/0 Resource_List.neednodes=1 Resource_List.nodect=1
> Resource_List.nodes=1 Resource_List.walltime=00:01:00 session=30429 end=1329422043
> Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb
> resources_used.walltime=00:00:10
>
>
> This is what the output files look like:
>
> -bash-4.1$ more ExampleJob.o22
> Thu Feb 16 13:53:53 CST 2012
> Thu Feb 16 13:54:03 CST 2012
> -bash-4.1$ more ExampleJob.e22
> -bash-4.1$
>
> This is my basic server config:
>
> [root at wings ~]# qmgr
> Max open servers: 10239
> Qmgr: print server
> #
> # Create queues and set their attributes.
> #
> #
> # Create and define queue batch
> #
> create queue batch
> set queue batch queue_type = Execution
> set queue batch resources_default.nodes = 1
> set queue batch resources_default.walltime = 01:00:00
> set queue batch enabled = True
> set queue batch started = True
> #
> # Set server attributes.
> #
> set server scheduling = True
> set server acl_hosts = admin.default.domain
> set server acl_hosts += wings.glerl.noaa.gov
> set server managers = root at wings.glerl.noaa.gov
> set server managers += salls at wings.glerl.noaa.gov
> set server operators = root at wings.glerl.noaa.gov
> set server operators += salls at wings.glerl.noaa.gov
> set server default_queue = batch
> set server log_events = 511
> set server mail_from = adm
> set server scheduler_iteration = 600
> set server node_check_rate = 150
> set server tcp_timeout = 6
> set server mom_job_sync = True
> set server keep_completed = 300
> set server next_job_number = 23
>
> Processes running on server:
>
> root 32086 1 0 13:23 ? 00:00:00 /usr/local/sbin/pbs_server -d /var/spool/torque -H admin.default.domain
> root 32173 1 0 13:23 ? 00:00:00 /usr/local/sbin/pbs_sched -d /var/spool/torque
>
>
> My sched_config file looks like this. I left the default values as is.
>
> [root at wings sched_priv]# more sched_config
>
>
> # This is the config file for the scheduling policy
> # FORMAT: option: value prime_option
> # option - the name of what we are changing defined in config.h
> # value - can be boolean/string/numeric depending on the option
> # prime_option - can be prime/non_prime/all ONLY FOR SOME OPTIONS
>
> # Round Robin -
> # run a job from each queue before running second job from the
> # first queue.
>
> round_robin: False all
>
>
> # By Queue -
> # run jobs by queues.
> # If it is not set, the scheduler will look at all the jobs on
> # on the server as one large queue, and ignore the queues set
> # by the administrator
> # PRIME OPTION
>
> by_queue: True prime
> by_queue: True non_prime
>
>
> # Strict Fifo -
> # run jobs in strict fifo order. If one job can not run
> # move onto the next queue and do not run any more jobs
> # out of that queue even if some jobs in the queue could
> # be run.
> # If it is not set, it could very easily starve the large
> # resource using jobs.
> # PRIME OPTION
>
> strict_fifo: false ALL
>
> #
> # fair_share - schedule jobs based on usage and share values
> # PRIME OPTION
> #
> fair_share: false ALL
>
> # Help Starving Jobs -
> # Jobs which have been waiting a long time will
> # be considered starving. Once a job is considered
> # starving, the scheduler will not run any jobs
> # until it can run all of the starving jobs.
> # PRIME OPTION
>
> help_starving_jobs true ALL
>
> #
> # sort_queues - sort queues by the priority attribute
> # PRIME OPTION
> #
> sort_queues true ALL
>
> #
> # load_balancing - load balance between timesharing nodes
> # PRIME OPTION
> #
> load_balancing: false ALL
>
> # sort_by:
> # key:
> # to sort the jobs on one key, specify it by sort_by
> # If multiple sorts are necessary, set sory_by to multi_sort
> # specify the keys in order of sorting
>
> # if round_robin or by_queue is set, the jobs will be sorted in their
> # respective queues. If not the entire server will be sorted.
>
> # different sorts - defined in globals.c
> # no_sort shortest_job_first longest_job_first smallest_memory_first
> # largest_memory_first high_priority_first low_priority_first multi_sort
> # fair_share large_walltime_first short_walltime_first
> #
> # PRIME OPTION
> sort_by: shortest_job_first ALL
>
> # filter out prolific debug messages
> # 256 are DEBUG2 messages
> # NO PRIME OPTION
> log_filter: 256
>
> # all queues starting with this value are dedicated time queues
> # i.e. dedtime or dedicatedtime would be dedtime queues
> # NO PRIME OPTION
> dedicated_prefix: ded
>
> # ignored queues
> # you can specify up to 16 queues to be ignored by the scheduler
> #ignore_queue: queue_name
>
> # this defines how long before a job is considered starving. If a job has
> # been queued for this long, it will be considered starving
> # NO PRIME OPTION
> max_starve: 24:00:00
>
> # The following three config values are meaningless with fair share turned off
>
> # half_life - the half life of usage for fair share
> # NO PRIME OPTION
> half_life: 24:00:00
>
> # unknown_shares - the number of shares for the "unknown" group
> # NO PRIME OPTION
> unknown_shares: 10
>
> # sync_time - the amount of time between syncing the usage information to disk
> # NO PRIME OPTION
> sync_time: 1:00:00
>
>
> Any idea what I need to do?
>
> Thanks,
>
> Christina
>
>
> --
> Christina A. Salls
> GLERL Computer Group
> help.glerl at noaa.gov
> Help Desk x2127
> Christina.Salls at noaa.gov
> Voice Mail 734-741-2446
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list