|
|||
10.1 TroubleshootingThere are a few general strategies that can be followed to determine unexpected behavior. These are a few of the tools available to help determine where problems occur.
10.1.1 Host ResolutionThe TORQUE server host must be able to perform both forward and reverse name lookup on itself and on all compute nodes. Likewise, all compute nodes must be able to perform both forward and reverse name lookup on itself, the TORQUE server host, and all other compute nodes. In many cases, name resolution is handled by configuring the node's /etc/hosts file although DNS and NIS services may also be used. Commands such as nslookup or dig can be used to verify proper host resolution.NOTE: Invalid host resolution may exhibit itself with compute nodes reporting as down within the output of pbsnodes -a and with failure of the momctl -d 3 command. 10.1.2 Firewall ConfigurationBe sure that if you have firewalls running on the server or node machines that you allow connections on the appropriate ports for each machine. TORQUE pbs_mom daemons use UDP port 1023 and the pbs_server/pbs_mom daemons use ports 15001-15004 by default.Firewall based issues are often associated with server to mom communication failures and messages such as 'premature end of message' in the log files. Also, the 10.1.3 TORQUE Log FilesThe pbs_server keeps a daily log of all activity in the "<TORQUE_HOME_DIR>/server_logs/" directory. The pbs_mom also keeps a daily log of all activity in the "<TORQUE_HOME_DIR>/mom_logs/" directory. These logs contain information on communication between server and mom as well as information on jobs as they enter the queue and as they are dispatched, ran, and terminated. These logs can be very helpful in determining general job failures. For mom logs, the verbosity of the logging can be adjusted by setting the loglevel parameter in the mom_priv/config file. For server logs, the verbosity of the logging can be adjusted by setting the server log_level attribute in qmgr.For both pbs_mom and pbs_server daemons, the log verbosity level can also be adjusted by setting the environment variable PBSLOGLEVEL to a value between 0 and 7. Further, to dynamically change the log level of a running daemon, use the SIGUSR1 and SIGUSR2 signals to increase and decrease the active loglevel by one. Signals are sent to a process using the kill command. For example, kill -USR1 `pgrep pbs_mom` would raise the log level up by one. The current loglevel for pbs_mom can be displayed with the command momctl -d3. 10.1.4 Using tracejob to Locate Job FailuresOverviewThe tracejob utility extracts job status and job events from accounting records, mom log files, server log files, and scheduler log files. Using it can help identify where, how, a why a job failed. This tool takes a job id as a parameter as well as arguments to specify which logs to search, how far into the past to search, and other conditions.Syntaxtracejob [-a|s|l|m|q|v|z] [-c count] [-w size] [-p path] [ -n <DAYS>] [-f filter_type] <JOBID>
-p : path to PBS_SERVER_HOME
-w : number of columns of your terminal
-n : number of days in the past to look for job(s) [default 1]
-f : filter out types of log entries, multiple -f's can be specified
error, system, admin, job, job_usage, security, sched, debug,
debug2, or absolute numeric hex equivalent
-z : toggle filtering excessive messages
-c : what message count is considered excessive
-a : don't use accounting log files
-s : don't use server log files
-l : don't use scheduler log files
-m : don't use mom log files
-q : quiet mode - hide all error messages
-v : verbose mode - show more error messages
ExampleNOTE: The tracejob command operates by searching the pbs_server accounting records and the pbs_server, mom, and scheduler logs. To function properly, it must be run on a node and as a user which can access these files. By default, these files are all accessible by the user root and only available on the cluster management node. In particular, the files required by tracejob located in the following directories:
tracejob may only be used on systems where these files are made available. Non-root users may be able to use this command if the permissions on these directories or files is changed appropriately. 10.1.5 Using GDB to Locate FailuresIf either the pbs_mom or pbs_server fail unexpectedly (and the log files contain no information on the failure) gdb can be used to determine whether or not the program is crashing. To start pbs_mom or pbs_server under GDB export the environment variable PBSDEBUG=yes and start the program (i.e., gdb pbs_mom and then issue the run subcommand at the gdb prompt). GDB may run for some time until a failure occurs and which point, a message will be printed to the screen and a gdb prompt again made available. If this occurs, use the gdb where subcommand to determine the exact location in the code. The information provided may be adequate to allow local diagnosis and correction. If not, this output may be sent to the mailing list or to help for further assistance. (for more information on submitting bugs or requests for help please see the Mailing List Instructions)NOTE: See the PBSCOREDUMP parameter for enabling creation of core files. 10.1.6 Other Diagnostic Options
Some hard problems in Torque deal with the amount of time spent in routines. For example, one currently open problem appears to be caused by the design of the code in linux/mom_mach.c where the statistics are gathered for the node status. It appears that the /proc filesystem that contains information about the kernel and the processes is being accessed so often on some machines that the responces to some other message traffic is affected. The machine where this is happening has 128 processors. To debug these kinds of problems, it can be useful to see where in the code time is being spent. This is called profiling and there is a linux utility gprof that will output a listing of routines and the amount of time spent in these routines. This does require that the code be compiled with special options to instrument the code and to produce a file, gmon.out, that will be written at the end of program execution. The following listing shows how to build Torque with profiling enabled. Notice that the output file for pbs_mom will end up in the mom_priv directory because its startup code changes the default directory to this location. Another way to see areas where a program is spending most of its time is with the valgrind program. The advantage of using valgrind is that the programs do not have to be specially compiled. 10.1.7 Frequently Asked Questions (FAQ)
Cannot connect to server: error=15034This error occurs in TORQUE clients (or their APIs) because TORQUE cannot find the server_name file and/or the PBS_DEFAULT environment variable is not set. The server_name file or PBS_DEFAULT variable indicate the pbs_server's hostname that the client tools should communicate with. The server_name file is usually located in TORQUE's local state directory. Make sure the file exists, has proper permissions, and that the version of TORQUE you are running was built with the proper directory settings. Alternatively you can set the PBS_DEFAULT environment variable. Restart TORQUE daemons if you make changes to these settings.
Deleting 'Stuck' JobsTo manually delete a stale job which has no process, and for which the mother superior is still alive, sending a sig 0 with qsig will often cause MOM to realize the job is stale and issue the proper JobObit notice. Failing that, use momctl -c to forcefully cause MOM to purge the job. The following process should never be necessary:
If the mother superior mom has been lost and cannot be recovered (i.e, hardware or disk failure), a job running on that node can be purged from the output of qstat using the qdel -p command or can be removed manually using the following steps: To remove job X:
Which user must run TORQUE?TORQUE (pbs_server & pbs_mom) must be started by a user with root privileges.
Scheduler cannot run jobs - rc: 15003For a scheduler, such as Moab or Maui, to control jobs with TORQUE, the scheduler needs to be run be a user in the server operators / managers list (see qmgr (set server operators / managers)). The default for the server operators / managers list is root@localhost. For TORQUE to be used in a grid setting with Silver, the scheduler needs to be run as root.
PBS_Server: pbsd_init, Unable to read server database If this message is displayed upon starting pbs_server it means that the local database cannot be read. This can be for several reasons. The most likely is a version mismatch. Most versions of TORQUE can read each others' databases. However, there are a few incompatibilities between OpenPBS and TORQUE. Because of enhancements to TORQUE, it cannot read the job database of an OpenPBS server (job structure sizes have been altered to increase functionality). Also, a To reconstruct a database (excluding the job database), first print out the old data with this command:
Copy this information somewhere. Restart pbs_server with the following command:
When it to prompts to overwrite the previous database enter 'y' then enter the data exported by the qmgr command with a command similar to the following:
Restart pbs_server without the flags:
This will reinitialize the database to the current version. Note that reinitializing the server database will reset the next jobid to 1.
To get around this issue, the server can be told it has an inflated number of nodes using the resources_available attribute. To take affect, this attribute should be set on both the server and the associated queue as in the example below. See resources_available for more information.
NOTE: The pbs_server daemon will need to be restarted before these changes will take affect.
Job submission hosts must be explicitly specified within TORQUE or enabled via RCmd security mechanisms in order to be trusted. In the example above, the host 'login2' is not configured to be trusted. This process is documented in Configuring Job Submission Hosts describing how this configuration is done.
Also verify the following on all machines:
If using a scheduler such as Moab or Maui, use a scheduler tool such as checkjob to identify job start issues.
The problem is that this setup allows the users to bypass the batch
system by writing a job script that uses rsh/ssh to launch processes the
batch nodes. If there are relatively few users and they can more or less be
trusted, this setup can work.
|
|||
| © 2001-2008 Cluster Resources, Incorporated | |||