advanced torque administration...
TRANSCRIPT
Advanced TORQUE Administration
© Cluster Resources, Inc.
Advanced TORQUE Administration
Nick Ihli, Sales Engineer
Josh Butikofer, Director of Grid Technologies
Scott Jackson, VP Software Engineering
TORQUE Resource Manager
• General Overview
• Routing Queues
• Job Arrays
• Node Health
© Cluster Resources, Inc.
• Node Health
• Handling Failures
• Checkpoint/Restart
• High Throughput
• Tuning for Scale
• New Capabilities
• On the Horizon
Node States
• States
– down (down)
– offline (drained)
– job-exclusive (busy)
– free (idle/running)
© Cluster Resources, Inc.
– free (idle/running)
• Changing node state
– Offline
• pbsnodes -o <nodename>
– Online
• pbsnodes -c <nodename>
• Viewing nodes of a particular state
• pbsnodes -l
Example Job Script
#!/bin/sh
#PBS -N ds14FeedbackDefaults
#PBS -S /bin/sh
#PBS -l nodes=1:ppn=2,walltime=240:00:00
#PBS -M [email protected]
© Cluster Resources, Inc.
#PBS -M [email protected]
#PBS -m ae
source ~/.bashrc
cat $PBS_NODEFILE
cat $PBS_O_JOBID
Job Submission Options
Option Description
-d Working directory path to be used for the job
-e Path for standard error
© Cluster Resources, Inc.
-I Interactive job
-l Resources required by the job
-m Mail options
-N Job name
-o Path for standard output
-q Destination queue
Monitoring Jobs
• qstat
-f detailed job information
© Cluster Resources, Inc.
> qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
4807 scatter user01 12:56:34 R batch
Job Management
• qdel
-m sends a comment
$ qstatJob id Name User Time Use S Queue
© Cluster Resources, Inc.
Job id Name User Time Use S Queue---------------- ---------------- ---------------- -------- - -----4807 scatter user01 12:56:34 R batch...$ qdel -m "hey! Stop abusing the NFS servers" 4807$
• qalter
- modify job options
Routing Queues
qmgr -c "create queue route"qmgr -c "set queue route queue_type=Route"qmgr –c "set queue route route_destinations=short"qmgr –c "set queue route route_destinations+=med"qmgr –c "set queue route route_destinations+=long"
© Cluster Resources, Inc.
qmgr –c "set queue route route_destinations+=long"qmgr -c "set queue route started=true"qmgr -c "set queue route enabled=true"
qmgr -c "set server default_queue=route"qmgr –c "set server resources_default.ncpus=1"qmgr –c "set server resources_default.walltime=12:00:00"
qmgr -c "create queue short"qmgr -c "set queue short queue_type=execution"qmgr -c "set queue short started=true"qmgr -c "set queue short enabled=true"qmgr -c "set queue short resources_max.walltime=1:00:00"qmgr –c "set queue short priority=10000"
qmgr -c "create queue med"qmgr -c "set queue med queue_type=execution"
© Cluster Resources, Inc.
qmgr -c "set queue med queue_type=execution"qmgr -c "set queue med started=true"qmgr -c "set queue med enabled=true"qmgr -c "set queue med resources_min.walltime=1:00:00"qmgr -c "set queue med resources_max.walltime=12:00:00"qmgr –c "set queue med priority=1000"
qmgr -c "create queue long"qmgr -c "set queue long queue_type=execution" qmgr -c "set queue long started=true"qmgr -c "set queue long enabled=true "qmgr -c "set queue long resources_min.walltime=12:00:00"qmgr –c "set queue long priority=1"
Job Arrays
• Creation of multiple jobs with one qsub command
• Reference entire set of jobs as one group
> qsub -t 0-100 job_script
© Cluster Resources, Inc.
> qsub -t 0-100 job_script1098.hostname
qstat1098-0.hostname ...1098-1.hostname ...1098-2.hostname ...1098-3.hostname ...1098-4.hostname ...……
Node Health
© Cluster Resources, Inc.
Prologue/Epilogue Scripts
• Perform node health checks, clean up or prepare a system, etc.
• Must be available on all compute nodes
© Cluster Resources, Inc.
• Located in $PBS_HOME/mom_priv/
• Available arguments – on next slide
Prologue – Available Arguments
• job id
• job execution user name
• job execution group name
© Cluster Resources, Inc.
• job execution group name
• job name
• list of requested resource limits
• job execution queue
• job account
Epilogue – Available Arguments
• job id
• job execution user name
• job execution group name
© Cluster Resources, Inc.
• job execution group name
• job name session id
• list of requested resource limits
• list of resources used by job
• job execution queue
• job account
Compute Node Health Check
• Configured via the pbs_mom config file using the parameters:
– $node_check_script
– $node_check_interval
© Cluster Resources, Inc.
• Example Health Check Script
#!/bin/sh
/bin/mount | grep global
if [ $? != "0" ]thenecho "ERROR cannot locate filesystem global"
fi
http://www.clusterresources.com/wiki/doku.php?id=torque:10.2_compute_node_health_check
Node Health Script and Moab
• Create triggers based on the failure the health script reports
• Trigger will perform an action
© Cluster Resources, Inc.
– Offline node, email admin, display message in Moab diagnostic commands
HA – High Availability
• Multiple server host machines
– One server locks the server.lock file
– Other server spins in a loop until lock clears
© Cluster Resources, Inc.
• pbs_server -ha
Handling Failures
© Cluster Resources, Inc.
Job Failures
• keep_completed option
– Specifies # of seconds job information should be kept after job has completed
– Set on the queue or server by qmgr
© Cluster Resources, Inc.
– Set on the queue or server by qmgr
– If set on both levels, queue value is used
Job Failures – tracejob
• tracejob– tracejob [ -n <DAYS>] <JOBID>
05/28/2008 11:41:31 S enqueuing into route, state 1 hop 105/28/2008 11:41:31 S dequeuing from route, state QUEUED05/28/2008 11:41:31 S enqueuing into long, state 1 hop 105/28/2008 11:41:31 S Job Queued at request of torque@mele, owner =
torque@mele, job name = STDIN, queue = long05/28/2008 11:41:31 A queue=route05/28/2008 11:41:31 A queue=long
© Cluster Resources, Inc.
05/28/2008 11:42:20 S Job Modified at request of root@mele05/28/2008 11:42:20 S Job Run at request of root@mele05/28/2008 11:42:20 S Job Modified at request of root@mele05/28/2008 11:42:20 A user=torque group=torque jobname=STDIN queue=long
ctime=1211996491 qtime=1211996491 etime=1211996491start=1211996540 owner=torque@mele exec_host=pala/1Resource_List.ncpus=1 Resource_List.neednodes=palaResource_List.nodect=1 Resource_List.nodes=1:ppn=1Resource_List.walltime=122:00:00
05/28/2008 11:44:00 S Exit_status=0 resources_used.cput=00:00:00resources_used.mem=3104kb resources_used.vmem=11496kbresources_used.walltime=00:01:34
05/28/2008 11:44:00 A user=torque group=torque jobname=STDIN queue=longctime=1211996491 qtime=1211996491 etime=1211996491start=1211996540 owner=torque@mele exec_host=pala/1Resource_List.ncpus=1 Resource_List.nodect=1Resource_List.nodes=1:ppn=1Resource_List.walltime=122:00:00 session=16069end=1211996640 Exit_status=0resources_used.cput=00:00:00 resources_used.mem=3104kbresources_used.vmem=11496kbresources_used.walltime=00:01:34
05/28/2008 11:44:09 S Post job file processing error
Cleaning up
• qdel -p
– Purge jobs that cannot be properly deleted
• qrerun [-f]
– Rerun a specified job to rerun, completed and idle
© Cluster Resources, Inc.
Checkpoint/Restart
© Cluster Resources, Inc.
Berkeley Lab Checkpoint/Restart (BLCR)
• Kernel level package – no changes needed to application code
• Allows programs running on Linux to be "checkpointed"
© Cluster Resources, Inc.
• Allows programs running on Linux to be "checkpointed"
— http://ftg.lbl.gov/CheckpointRestart/CheckpointRestart.shtml
TORQUE/BLCR Integration (Beta)
• BLCR must be installed into the kernel
• Provides 3 command line utilities:
– cr_run – runs a subprocess with checkpoint library loaded
– cr_checkpoint – causes a process, all processes within a process group, or all processes within a session, to be
© Cluster Resources, Inc.
process group, or all processes within a session, to be checkpointed
– cr_restart - restarts a process from a checkpoint file created with cr_checkpoint
TORQUE pbs_mom configuration
• checkpoint_interval
– How often periodic job checkpoints will be taken (minutes)
• checkpoint_script
– Path to BLCR checkpoint script
© Cluster Resources, Inc.
– Path to BLCR checkpoint script
• restart_script
– Path to BLCR restart script
• checkpoint_run_exe
– Path to cr_run
Starting a Checkpointable Job
• Use -c and other arguments to control checkpointing behavior– enabled
• Checkpointing allowed but must be explicitly invoked by either qhold or qchkpt
– shutdown• Checkpointing at pbs_mom shutdown
© Cluster Resources, Inc.
• Checkpointing at pbs_mom shutdown– periodic
• Enable periodic checkpointing– interval=minutes
• Checkpoint interval in minutes.– depth=number
• Number of checkpoint images to be kept– dir=path
• Checkpoint directory (default is /var/spool/torque/checkpoint)
Checkpointing and Restarting
• qsub –c [argument]
• qhold or qchkpt
• qrls
© Cluster Resources, Inc.
• qrls
• Checkpoint and restart through Moab preemption policies
High Throughput
© Cluster Resources, Inc.
Asynchronous Job Start
• qrun -a
– Reply from pbs_server returns immediately
– Reply returns before node assignments
© Cluster Resources, Inc.
– Reply returns before job is started on pbs_mom
No “neednodes”
• Typical job submission runs through a submission, start and modify
• No “neednodes” removes the modify step
© Cluster Resources, Inc.
• EnableMsubQuickSubmit
Disable Authentication
• Faster submission as user authentication is turned off
• More practical in data centers and highly trusted environments
© Cluster Resources, Inc.
• Edit src/include/libpbs.h � #define ENABLE_TRUSTED_AUTH TRUE
• Save, recompile, reinstall pbs_server
Unix Domain Sockets
• ./configure --enable-unixsockets
– Enables the use of Unix domain sockets instead of internet sockets
– Faster communication for messages within the
© Cluster Resources, Inc.
– Faster communication for messages within the same machine
– Is now the default TORQUE behaviour – as of 2.3.0
Autorun (Experimental)
• TORQUE server finds first available node
• Runs job - bypasses the scheduler
• If failure happens, scheduler takes over the job
© Cluster Resources, Inc.
• If failure happens, scheduler takes over the job
Tuning for Scale
• tcp_timeout
– Default=6
• >300 nodes - build TORQUE using TCP rather than the default of RPP
© Cluster Resources, Inc.
the default of RPP
– --disable-rpp
• End user command caching
– Reduce excessive load caused from excessive user client command usage
Tuning for Scale Cont’d
• job_stat_rate
• poll_jobs
• pbs_tcp_timeout
© Cluster Resources, Inc.
• pbs_tcp_timeout
• Moab specific
– JOBAGGREGATIONTIME
– RM TIMEOUT
• --disable-filesync
• Network ARP cache
New Capabilities
• Cpusets
© Cluster Resources, Inc.
On the Horizon
• Scheduler synch
• Ensure a stable branch
• Improved Documentation
© Cluster Resources, Inc.
• Improved Documentation
• Fulltime and Community Developers