advanced torque administration...

Advanced TORQUE Administration

© Cluster Resources, Inc.

Advanced TORQUE Administration

Nick Ihli, Sales Engineer

Josh Butikofer, Director of Grid Technologies

Scott Jackson, VP Software Engineering

TORQUE Resource Manager

• General Overview

• Routing Queues

• Job Arrays

• Node Health


• Node Health

• Handling Failures

• Checkpoint/Restart

• High Throughput

• Tuning for Scale

• New Capabilities

• On the Horizon

Node States

• States

– down (down)

– offline (drained)

– job-exclusive (busy)

– free (idle/running)


– free (idle/running)

• Changing node state

– Offline

• pbsnodes -o <nodename>

– Online

• pbsnodes -c <nodename>

• Viewing nodes of a particular state

• pbsnodes -l

Example Job Script

#!/bin/sh

#PBS -N ds14FeedbackDefaults

#PBS -S /bin/sh

#PBS -l nodes=1:ppn=2,walltime=240:00:00

#PBS -M [email protected]


#PBS -M [email protected]

#PBS -m ae

source ~/.bashrc

cat $PBS_NODEFILE

cat $PBS_O_JOBID

Job Submission Options

Option Description

-d Working directory path to be used for the job

-e Path for standard error


-I Interactive job

-l Resources required by the job

-m Mail options

-N Job name

-o Path for standard output

-q Destination queue

Monitoring Jobs

• qstat

-f detailed job information


> qstat

Job id Name User Time Use S Queue

---------------- ---------------- ---------------- -------- - -----

4807 scatter user01 12:56:34 R batch

Job Management

• qdel

-m sends a comment

$ qstatJob id Name User Time Use S Queue


Job id Name User Time Use S Queue---------------- ---------------- ---------------- -------- - -----4807 scatter user01 12:56:34 R batch...$ qdel -m "hey! Stop abusing the NFS servers" 4807$

• qalter

- modify job options

Routing Queues

qmgr -c "create queue route"qmgr -c "set queue route queue_type=Route"qmgr –c "set queue route route_destinations=short"qmgr –c "set queue route route_destinations+=med"qmgr –c "set queue route route_destinations+=long"


qmgr –c "set queue route route_destinations+=long"qmgr -c "set queue route started=true"qmgr -c "set queue route enabled=true"

qmgr -c "set server default_queue=route"qmgr –c "set server resources_default.ncpus=1"qmgr –c "set server resources_default.walltime=12:00:00"

qmgr -c "create queue short"qmgr -c "set queue short queue_type=execution"qmgr -c "set queue short started=true"qmgr -c "set queue short enabled=true"qmgr -c "set queue short resources_max.walltime=1:00:00"qmgr –c "set queue short priority=10000"

qmgr -c "create queue med"qmgr -c "set queue med queue_type=execution"


qmgr -c "set queue med queue_type=execution"qmgr -c "set queue med started=true"qmgr -c "set queue med enabled=true"qmgr -c "set queue med resources_min.walltime=1:00:00"qmgr -c "set queue med resources_max.walltime=12:00:00"qmgr –c "set queue med priority=1000"

qmgr -c "create queue long"qmgr -c "set queue long queue_type=execution" qmgr -c "set queue long started=true"qmgr -c "set queue long enabled=true "qmgr -c "set queue long resources_min.walltime=12:00:00"qmgr –c "set queue long priority=1"

Job Arrays

• Creation of multiple jobs with one qsub command

• Reference entire set of jobs as one group

> qsub -t 0-100 job_script


> qsub -t 0-100 job_script1098.hostname

qstat1098-0.hostname ...1098-1.hostname ...1098-2.hostname ...1098-3.hostname ...1098-4.hostname ...……

Node Health


Prologue/Epilogue Scripts

• Perform node health checks, clean up or prepare a system, etc.

• Must be available on all compute nodes


• Located in $PBS_HOME/mom_priv/

• Available arguments – on next slide

Prologue – Available Arguments

• job id

• job execution user name

• job execution group name



• job name

• list of requested resource limits

• job execution queue

• job account

Epilogue – Available Arguments

• job id

• job execution user name




• job name session id

• list of requested resource limits

• list of resources used by job

• job execution queue

• job account

Compute Node Health Check

• Configured via the pbs_mom config file using the parameters:

– $node_check_script

– $node_check_interval


• Example Health Check Script

#!/bin/sh

/bin/mount | grep global

if [ $? != "0" ]thenecho "ERROR cannot locate filesystem global"

fi

http://www.clusterresources.com/wiki/doku.php?id=torque:10.2_compute_node_health_check

Node Health Script and Moab

• Create triggers based on the failure the health script reports

• Trigger will perform an action


– Offline node, email admin, display message in Moab diagnostic commands

HA – High Availability

• Multiple server host machines

– One server locks the server.lock file

– Other server spins in a loop until lock clears


• pbs_server -ha

Handling Failures


Job Failures

• keep_completed option

– Specifies # of seconds job information should be kept after job has completed

– Set on the queue or server by qmgr


– Set on the queue or server by qmgr

– If set on both levels, queue value is used

Job Failures – tracejob

• tracejob– tracejob [ -n <DAYS>] <JOBID>

05/28/2008 11:41:31 S enqueuing into route, state 1 hop 105/28/2008 11:41:31 S dequeuing from route, state QUEUED05/28/2008 11:41:31 S enqueuing into long, state 1 hop 105/28/2008 11:41:31 S Job Queued at request of torque@mele, owner =

torque@mele, job name = STDIN, queue = long05/28/2008 11:41:31 A queue=route05/28/2008 11:41:31 A queue=long


05/28/2008 11:42:20 S Job Modified at request of root@mele05/28/2008 11:42:20 S Job Run at request of root@mele05/28/2008 11:42:20 S Job Modified at request of root@mele05/28/2008 11:42:20 A user=torque group=torque jobname=STDIN queue=long

ctime=1211996491 qtime=1211996491 etime=1211996491start=1211996540 owner=torque@mele exec_host=pala/1Resource_List.ncpus=1 Resource_List.neednodes=palaResource_List.nodect=1 Resource_List.nodes=1:ppn=1Resource_List.walltime=122:00:00

05/28/2008 11:44:00 S Exit_status=0 resources_used.cput=00:00:00resources_used.mem=3104kb resources_used.vmem=11496kbresources_used.walltime=00:01:34

05/28/2008 11:44:00 A user=torque group=torque jobname=STDIN queue=longctime=1211996491 qtime=1211996491 etime=1211996491start=1211996540 owner=torque@mele exec_host=pala/1Resource_List.ncpus=1 Resource_List.nodect=1Resource_List.nodes=1:ppn=1Resource_List.walltime=122:00:00 session=16069end=1211996640 Exit_status=0resources_used.cput=00:00:00 resources_used.mem=3104kbresources_used.vmem=11496kbresources_used.walltime=00:01:34

05/28/2008 11:44:09 S Post job file processing error

Cleaning up

• qdel -p

– Purge jobs that cannot be properly deleted

• qrerun [-f]

– Rerun a specified job to rerun, completed and idle


Checkpoint/Restart


Berkeley Lab Checkpoint/Restart (BLCR)

• Kernel level package – no changes needed to application code

• Allows programs running on Linux to be "checkpointed"


• Allows programs running on Linux to be "checkpointed"

— http://ftg.lbl.gov/CheckpointRestart/CheckpointRestart.shtml

TORQUE/BLCR Integration (Beta)

• BLCR must be installed into the kernel

• Provides 3 command line utilities:

– cr_run – runs a subprocess with checkpoint library loaded

– cr_checkpoint – causes a process, all processes within a process group, or all processes within a session, to be


process group, or all processes within a session, to be checkpointed

– cr_restart - restarts a process from a checkpoint file created with cr_checkpoint

TORQUE pbs_mom configuration

• checkpoint_interval

– How often periodic job checkpoints will be taken (minutes)

• checkpoint_script

– Path to BLCR checkpoint script


– Path to BLCR checkpoint script

• restart_script

– Path to BLCR restart script

• checkpoint_run_exe

– Path to cr_run

Starting a Checkpointable Job

• Use -c and other arguments to control checkpointing behavior– enabled

• Checkpointing allowed but must be explicitly invoked by either qhold or qchkpt

– shutdown• Checkpointing at pbs_mom shutdown


• Checkpointing at pbs_mom shutdown– periodic

• Enable periodic checkpointing– interval=minutes

• Checkpoint interval in minutes.– depth=number

• Number of checkpoint images to be kept– dir=path

• Checkpoint directory (default is /var/spool/torque/checkpoint)

Checkpointing and Restarting

• qsub –c [argument]

• qhold or qchkpt

• qrls


• qrls

• Checkpoint and restart through Moab preemption policies

High Throughput


Asynchronous Job Start

• qrun -a

– Reply from pbs_server returns immediately

– Reply returns before node assignments


– Reply returns before job is started on pbs_mom

No “neednodes”

• Typical job submission runs through a submission, start and modify

• No “neednodes” removes the modify step


• EnableMsubQuickSubmit

Disable Authentication

• Faster submission as user authentication is turned off

• More practical in data centers and highly trusted environments


• Edit src/include/libpbs.h � #define ENABLE_TRUSTED_AUTH TRUE

• Save, recompile, reinstall pbs_server

Unix Domain Sockets

• ./configure --enable-unixsockets

– Enables the use of Unix domain sockets instead of internet sockets

– Faster communication for messages within the


– Faster communication for messages within the same machine

– Is now the default TORQUE behaviour – as of 2.3.0

Autorun (Experimental)

• TORQUE server finds first available node

• Runs job - bypasses the scheduler

• If failure happens, scheduler takes over the job


• If failure happens, scheduler takes over the job

Tuning for Scale

• tcp_timeout

– Default=6

• >300 nodes - build TORQUE using TCP rather than the default of RPP


the default of RPP

– --disable-rpp

• End user command caching

– Reduce excessive load caused from excessive user client command usage

Tuning for Scale Cont’d

• job_stat_rate

• poll_jobs

• pbs_tcp_timeout


• pbs_tcp_timeout

• Moab specific

– JOBAGGREGATIONTIME

– RM TIMEOUT

• --disable-filesync

• Network ARP cache

New Capabilities

• Cpusets


On the Horizon

• Scheduler synch

• Ensure a stable branch

• Improved Documentation


• Improved Documentation

• Fulltime and Community Developers

advanced torque administration...

Documents