unix clusters

Download Unix Clusters

If you can't read please download the document

Upload: lesley

Post on 25-Feb-2016

30 views

Category:

Documents


2 download

DESCRIPTION

Guntis Barzdins Girts Folkmanis Leo Truksans. Unix Clusters. Mājas darbs #3. 1. Uzdevums (max 8 punkti) A. Uz UNIX bāzes izveidot MPI paralēlo programmu kompilācijas un izpildes vidi (skatīt lekcijas slaidus; MacOS X gadījumā ņemiet vērā, ka MPI ir daļēji iebūvēts). - PowerPoint PPT Presentation

TRANSCRIPT

  • Unix Clusters

    Guntis BarzdinsGirts FolkmanisLeo Truksans

  • Mjas darbs #31. Uzdevums (max 8 punkti)A. Uz UNIX bzes izveidot MPI parallo programmu kompilcijas un izpildes vidi (skatt lekcijas slaidus; MacOS X gadjum emiet vr, ka MPI ir daji iebvts).B. Uzrakstt nelielu MPI programmu, kas vec CPU-ietilpgus aprinus. Piemram, uzrakstt programmu, kas atrod ASCII teksta virknti ar
  • Moores Law - Density

  • Moore's Law and PerformanceThe performance of computers is determined by architecture and clock speed.Clock speed doubles over a 3 year period due to the scaling laws on chip.Processors using identical or similar architectures gain performance directly as a function of Moore's Law.Improvements in internal architecture can yield better gains cf Moore's Law.

  • Moores Law Data

  • Future of Moores LawShort-Term (1-5 years)Will operate (due to prototypes in lab)Fabrication cost will go up rapidly

    Medium-Term (5-15 years)Exponential growth rate will likely slowTrillion-dollar industry is motivated

    Long-Term (>15 years)May need new technology (chemical or quantum)We can do better (e.g., human brain)I would not close the patent office

  • Different kinds of ClustersHigh Performance Computing (HPC) Cluster Load Balancing (LB) ClusterHigh Availability (HA) Cluster

  • High Performance Computing (HPC) Cluster (Beowulf Grid)Start from 1994Donald Becker of NASA assembles the worlds first cluster with 16 sets of DX4 PCs and 10 Mb/s ethernetAlso called Beowulf clusterBuilt from commodity off-the-shelf hardwareApplications like data mining, simulations, parallel processing, weather modelling, computer graphical rendering, etc.

  • SoftwareBuilding clusters is straightforward, but managing its software can be complex.Oscar (Open Source Cluster Application Resources)Scyld scyld.com from the scientists of NASA - commercialtrue beowulf in a boxBeowulf Operating SystemRocksOpenSCEWareWulfClustermaticCondor -- gridUniCore -- gridgLite -- grid

  • Examples of Beowulf clusterScyld Cluster O.S. by Donald Beckerhttp://www.scyld.comROCKS from NPACIhttp://www.rocksclusters.orgOSCAR from open cluster grouphttp://oscar.sourceforge.netOpenSCE from Thailandhttp://www.opensce.org

  • Cluster Sizing Rule of ThumbSystem software (Linux, MPI, Filesystems, etc) scale from 64 nodes to at most 2048 nodes for most HPC applicationsMax socket connectionsDirect access message tag lists & buffersNFS / storage system clientsDebuggingEtcIt is probably hard to rewrite MPI and all Linux system software for O(100,000) node clusters

  • HPC Clusters

  • OSCARhttp://oscar.openclustergroup.org/Cluster on a CD automates cluster install processWizard drivenCan be installed on any Linux system, supporting RPMsComponents: Open source and BSD style license

  • RocksAward Winning Open Source High Performance Linux Cluster SolutionThe current release of NPACI Rocks is 3.3.0.Rocks is built on top of RedHat Linux releasesTwo types of nodesFrontendTwo ethernet interfacesLots of disk spaceComputeDisk drive for caching the base operating environment (OS and libararies)Rocks uses an SQL database for global variable saving

  • Rocks physical structure

  • Rocks frontend installation3 CDs Rocks Base CD, HPC Roll CD and Kernel Roll CDBootable Base CD User-friendly wizard mode installationCluster informationLocal hardwareBoth ethernet interfaces

  • Rocks frontend installation

  • Rocks compute nodesInstallationLogin frontend as rootRun insert-ethersRun a program which captures compute node DHCP requests and puts their information into the Rocks MySQL database

  • Rocks compute nodesUse install CD to boot the compute nodesInsert-ethers also authenticates compute nodes. When insert-ethers is running, the frontend will accept new nodes into the cluster based on their presence on the local network. Insert-ethers must continue to run until a new node requests its kickstart file, and will ask you to wait until that event occurs. When insert-ethers is off, no unknown node may obtain a kickstart file.

  • Rocks compute nodesMonitor installationWhen finished do next nodeNodes divided in cabinetsPossible to install compute nodes of different architecture than frontend

  • Rocks computingMpirun on Rocks clusters is used to launch jobs that are linked with the Ethernet device for MPICH."mpirun" is a shell script that attempts to hide the differences in starting jobs for various devices from the user. On workstation clusters, you must supply a file that lists the different machines that mpirun can use to run remote jobsMPICH is an implementation of MPI, the Standard for message-passing libraries.

  • Rocks computing exampleHigh-Performance Linpacksoftware package that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers.Launch HPL on two processors:Create a file in your home directory named machines, and put two entries in it, such as: compute-0-0compute-0-1Download the the two-processor HPL configuration file and save it as HPL.dat in your home directory. Now launch the job from the frontend: $ /opt/mpich/gnu/bin/mpirun -nolocal -np 2 -machinefile machines /opt/hpl/gnu/bin/xhpl

  • Rocks cluster-forkSame jobs of standard Unix commands on different nodesBy default, cluster-fork uses a simple series of ssh connections to launch the task serially on every compute node in the cluster. My processes on all nodes:$ cluster-fork ps -U$USERHostnames of nodes:$ cluster-fork ps -U$USER

  • Rocks cluster-fork againOften you wish to name the nodes your job is started on$ cluster-fork --query="select name from nodes where name like 'compute-1-%'" [cmd]Or use --nodes=compute-0-%d:0-2

  • Monitoring RocksSet of web pages to monitor activities and configurationApache webserver with access from internal network onlyFrom outside - viewing webpages involves sending a web browser screen over a secure, encrypted SSH channel ssh frontendsite, start Mozilla there, see http://localhostmozilla --no-remoteAccess from public network (not recommended)Modify IPtables

  • Rocks monitoring pagesShould look like this:

  • More monitoring through webAccess through web pages includes PHPMyAdmin for SQL server TOP command for clusterGraphical monitoring of clusterGanglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids.

  • Ganglia looks like this:

  • Default services411 Secure information serviceDistribute files to nodes password changes, login filesIn cron run every hourDNS for local communicationPostfix mail software

  • MPIMPI is a software systems that allows you to write message-passing parallel programs that run on a cluster, in Fortran and C. MPI (Message Passing Interface) is a defacto standard for portable message-passing parallel programs standardized by the MPI Forum and available on all massively-parallel supercomputers.

  • Parallel programmingMind that memory is distributed each node has its own memory spaceDecomposition divide large problems into smallerUse mpi.h for C programs

  • Message passingMessage Passing Program consists of multiple instances of serial program that communicate by library calls. These calls may be roughly divided into four classes:Calls used to initialise, manage and finally terminate communicationCalls used to communicate between pairs of processorsCalls that communicate operations among group of processorsCalls used to create arbitrary data type

  • Helloworld.c#include #include int main(int argc, char *argv[]){int rank, size;MPI_Init(&argc, &argv);MPI_Comm_rank(MPI_COMM_WORLD, &rank);MPI_Comm_size(MPI_COMM_WORLD, &size);printf("Hello world! I am %d of %d\n", rank, size);MPI_Finalize();return(0);}

  • Communication/** The root node sends out a message to the next node in the ring and* each node then passes the message along to the next node. The root* node times how long it takes for the message to get back to it.*/

    #include /* for input/output */#include /* for mpi routines */#define BUFSIZE 64 /* The size of the messege being passed */

    main( int argc, char** argv){double start,finish;int my_rank; /* the rank of this process */int n_processes; /* the total number of processes */char buf[BUFSIZE]; /* a buffer for the messege */int tag=0; /* not important here */MPI_Status status; /* not important here */MPI_Init(&argc, &argv); /* Initializing mpi */MPI_Comm_size(MPI_COMM_WORLD, &n_processes); /* Getting # of processes */MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); /* Getting my rank */

  • Communication again/** If this process is the root process send a messege to the next node* and wait to recieve one from the last node. Time how long it takes* for the messege to get around the ring. If this process is not the* root node, wait to recieve a messege from the previous node and* then send it to the next node.*/start=MPI_Wtime();printf("Hello world! I am %d of %d\n", my_rank, n_processes);if( my_rank == 0 ){/* send to the next node */MPI_Send(buf, BUFSIZE, MPI_CHAR, my_rank+1, tag, MPI_COMM_WORLD);/* receive from the last node */MPI_Recv(buf, BUFSIZE, MPI_CHAR, n_processes-1, tag,MPI_COMM_WORLD, &status);}

  • Even more of communicationif( my_rank != 0){/* receive from the previous node */MPI_Recv(buf, BUFSIZE, MPI_CHAR, my_rank-1, tag,MPI_COMM_WORLD, &status);/* send to the next node */MPI_Send(buf, BUFSIZE, MPI_CHAR, (my_rank+1)%n_processes, tag,MPI_COMM_WORLD);}finish=MPI_Wtime();MPI_Finalize(); /* Im done with mpi stuff *//* Print out the results. */if (my_rank == 0){printf("Total time used was %f seconds\n", finish-start);}return 0;}

  • CompilingCompile code using mpicc MPI C compiler. /u1/local/mpich-pgi/bin/mpicc -o helloworld2 helloworld2.cRun using mpirun.

  • K palaist MPI[guntisb@zars mpi]$ ls -ltotal 392-rw-rw-r-- 1 guntisb guntisb 122 Apr 28 07:08 Makefile-rw-rw-r-- 1 guntisb guntisb 13 May 17 14:33 mfile-rw-rw-r-- 1 guntisb guntisb 344 May 12 09:28 mpi.jdl-rw-rw-r-- 1 guntisb guntisb 2508 Apr 28 07:08 mpi.sh-rwxrwxr-x 1 guntisb guntisb 331899 May 17 14:48 passtonext-rw-rw-r-- 1 guntisb guntisb 3408 Apr 28 07:08 passtonext.c-rw-rw-r-- 1 guntisb guntisb 2132 May 17 14:48 passtonext.o[guntisb@zars mpi]$ more mfilelocalhost:4

    [guntisb@zars mpi]$[guntisb@zars mpi]$ makempicc passtonext.c -o passtonext -lmpich -lm[guntisb@zars mpi]$ mpirun -np 2 -machinefile mfile passtonextguntisb@localhost's password:Nodename=zars.latnet.lv Rank=0 Size=2INFO: zars.latnet.lv (0 of 2) sent 73 value to 1 of 2INFO: zars.latnet.lv (0 of 2) received 74 value from 1 of 2Nodename=zars.latnet.lv Rank=1 Size=2INFO: zars.latnet.lv (1 of 2) received 73 value from 0 of 2INFO: zars.latnet.lv (1 of 2) sent 73+1=74 value to 0 of 2[guntisb@zars mpi]$ Paldies Jnim Tragheimam!!

  • HPC Cluster and parallel computing applications Message Passing InterfaceMPICH (http://www-unix.mcs.anl.gov/mpi/mpich/)LAM/MPI (http://lam-mpi.org)Mathematicalfftw (fast fourier transform)pblas (parallel basic linear algebra software)atlas (a collections of mathematical library)sprng (scalable parallel random number generator)MPITB -- MPI toolbox for MATLABQuantum Chemistry softwaregaussian, qchemMolecular Dynamic solverNAMD, gromacs, gamessWeather modellingMM5 (http://www.mmm.ucar.edu/mm5/mm5-home.html)

  • The Success of MPIApplicationsMost recent Gordon Bell prize winners use MPI26TF Climate simulation on Earth Simulator, 16TF DNSLibrariesGrowing collection of powerful software componentsMPI programs with no MPI calls (all in libraries)ToolsPerformance tracing (Vampir, Jumpshot, etc.)Debugging (Totalview, etc.)Intel MPI: http://www.intel.com/cd/software/products/asmo-na/eng/308295.htm ResultsPapers: http://www.mcs.anl.gov/mpi/papersBeowulfUbiquitous parallel computingGridsMPICH-G2 (MPICH over Globus, http://www3.niu.edu/mpi)

  • POSIX Thread (pthread)The POSIX thread libraries are a standards based thread API for C/C++. It allows one to spawn a new concurrent process flow. It is most effective on multi-processor or multi-core systems where the process flow can be scheduled to run on another processor thus gaining speed through parallel or distributed processing. Threads require less overhead than "forking" or spawning a new process because the system does not initialize a new system virtual memory space and environment for the process.

  • Pthreads example#include #include void * entry_point(void *arg){ printf("Hello world!\n"); return NULL;}int main(int argc, char **argv){ pthread_t thr; if(pthread_create(&thr, NULL, &entry_point, NULL)) { printf("Could not create thread\n"); return -1; } if(pthread_join(thr, NULL)) { printf("Could not join thread\n"); return -1;} return 0;}

  • OpenMPShared-memory parallel programming

  • OMP fragmentsomp_set_dynamic(0);omp_set_num_threads(16);#pragma omp parallel shared(x, npoints) private(iam, ipoints){if (omp_get_num_threads() != 16) abort();iam = omp_get_thread_num();ipoints = npoints/16;do_by_16(x, iam, ipoints);}

  • OMP pilna programma

  • Multi-CPU Servers

  • AMD Opteron 800 HPCProcessing NodeHPC StrengthsFlat SMP like Memory Model:All four reside with the same 248 memory mapExpandable to 8P NUMA

    Glue-less Coherent multi-processing:low Latency and high Bandwidth ~1600M T/sec (6.4 GB/s)

    32GB of High B/W external memory bus (>5.3GB/sec.)

    Native high B/W memory map I/O (>25Gbits/sec.)

  • Sufficiently Uniform Memory Organization (SUMO)Disadvantages3P and 4P nodes work better if the OS is aware of the memory map>4P may require a NUMA (Non-uniform memory architecture) aware OS if the CACHE hit rate is low2.6.9 kernel needed to take full advantage of NUMA architecture of OpteronsAdvantages Software view of memory is SMPLatency difference between local & remote memory is a function of the number of processors in the node1P and 2P look like a SMP machine3P and 4P are NUMA like but can still be viewed as a ccUMA or asymmetric SMP node>4P can be viewed as ccUMA and depending on CACHE hit rate, may or may not required NUMA aware OSPhysical address space is flat and can be viewed as fully coherent or not (MOEIS state)DRAM can be contiguous or interleavedAdditional processor nodes bring true increased memory bandwidthDesigned for lower overall system chip count (glue-less interface)

  • Future NUMA Systems Scaling beyond 8 ProcessorScaling beyond 8P is enabledExternal Coherent HyperTransport switch Coherent Interconnect Snoop filter Data cachingUp to 16 processors within the same 240 SPM memory spaceInterconnect Fabric

  • Message-passing performance comparison(NUMA SMP and MPI is a good match)SMP MPI performanceGigabitEthernet MPI performance

  • High Density HPC ClusterSprayCool Technology from ISR 16 cards

    16G-flops/card

    256G-flops peak throughput

    64GB of memory per card

    1TerraByte of sys. Memory

    240 cubic inches114M-flops/cubic inch4.27GB of memory storage cubic inch

    ~6K watts ~3 watts/cubic inch141016

  • AMD OpteronBeowulf 4P SMP Processing NodeAMD Opteron200-333MHz9 byte Reg. DDR8GB DRAMAMD Opteron200-333MHz9 byte Reg. DDRFLASHLPCLegacy PCIUSB1.0AC97UDMA133MII10/100 Phy100 BaseTManagement LANManagementSIOPCIGraphicsVGAAMD-8111TMI/O Hub16x16 HyperTransport @ 1600MT/sPCI-XPCI-XAMD-8131TMPCI-X TunnelTo AMD 8131 TunnelTo AMD 8131 TunnelOne 4P SMP node16G-flops32GB DRAM10GB/sec. Memory BW

  • Symmetric Multi-Processing 4x DIMM1GB DDR266Avent Techn.4x DIMM1GB DDR266Avent Techn.

  • Sybase DBMS Performance

  • Unix Scheduling

  • Solaris OverviewMultithreaded, Symmetric Multi-Processing Preemptive kernel - protected data structuresInterrupts handled using threadsMP Support - per cpu dispatch queues, one global kernel preempt queue.System ThreadsPriority InheritanceTurnstiles rather than wait queues

  • LinuxToday Linux scales very well in SMP systems up to 4 CPUs.Linux on 8 CPUs is still competitive, but between 4way and 8way systems the price per CPU increases significantly.For SMP systems with more than 8 CPUs, classic Unix systems are the best choice.With Oracle Real Application Clusters (RAC), small 4 or 8way systems can be clustered to cross the todays Linux limitations.Commodity, inexpensive 4way Intel boxes, clustered with Oracle 9i RAC, help to reduce TCO.

  • Load Balancing ClusterCommonly used with busy ftp and web servers with large client baseLarge number of nodes to share the loadDNS, load-ballancing director (Linux Virtual Server),...Other applicationsWeb crawlersSearch enginesDataBase

  • High Availability ClusterAvoid downtime of servicesAvoid single point of failureAlways with redundancyAlmost all load balancing cluster are with HA capability

  • Examples of Load Balancing and High Availability ClusterRedHat HA clusterhttp://ha.redhat.comTurbolinux Cluster Serverhttp://www.turbolinux.com/products/tcsLinux Virtual Server Projecthttp://www.linuxvirtualserver.org/

  • High Availability Approach: Redundancy + FailoverRedundancy eliminates Single Points Of Failure (SPOF)Auto detect Failures (hardware, network, applications) Automatic Recovery from failures (no human intervention)

  • Real-Time Disk Replication (DRDB) Distributed Replicating Block Device

  • HeartbeatLinux-HA (Heartbeat)Open Source ProjectMultiple platform solution for IBM eServers, Solaris, BSDPackaged with several Linux DistributionsStrong focus on ease-of-use, security, simplicity, low-cost

    Features to avoid split-brainSTONIT (Shoot The Other Node In the Head) remote powercyclingIPAT (IP Address Takeover)GARP (Gratuitous ARP)IPFAIL (ping network

  • SystemImagerGold StandardAfter system re-installation, just change IP address and few essential parameters

  • OtherApplication-specific HA solutions (Oracle)rsync periodically synchronize rarely changing files on two systemsLinux Single System ImageSingle System Image Clusters for Linux (SSI) aims at providing a full, highly available, single system image cluster environment for Linux, with the goals of availability, scalablity, and manageability, built from standard servers.

  • Service Processor (SP) Dedicated SP on-board PowerPC basedOwn IP name/addressFront panelCommand line interfaceWeb-server

    Remote administrationSystem statusBoot/Reset/ShutdownFlash the BIOS

  • MOSIX and openMosixMOSIX: MOSIX is a software package that enhances the Linux kernel with cluster capabilities. The enhanced kernel supports any size cluster of X86/Pentium based boxes. MOSIX allows for the automatic and transparent migration of processes to other nodes in the cluster, while standard Linux process control utilities, such as 'ps' will show all processes as if they are running on the node the process originated from. openMosix: openMosix is a spin off of the original Mosix. The first version of openMosix is fully compatible with the last version of Mosix, but is going to go in its own direction.

  • MOSIX architecturePreemptive process migrationany users process, trasparently and at any time, can migrate to any available node.

    The migrating process is divided into two contexts:system context (deputy) that may not be migrated from home workstation (UHN);user context (remote) that can be migrated on a diskless node;

  • MOSIX architecturePreemptive process migrationmaster nodediskless node

  • MOSIX and MPIMOSIX is often used as a host system for MPI clusterHow it worksStart 100 MPI nodes on the MOSIX with only 10 Master nodesDuring night add 90 MPI nodes (Desktop PCs)During day remove desktop PCs and reboot in MS Win

  • OpenMosixView

  • MPI SHMEM example, DebianDevelopment support# apt-get install build-essential

    MPICH (an open MPI implementation), SHMEM version# apt-get install mpich-shmem-bin libmpich-shmem1.0-dev

  • MPI SHMEM example, Debianhellompi.c:int main(int argc, char *argv[]){ int rank, size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf("Hello world! I am %d of %d\n", rank, size); MPI_Finalize(); return(0);}

  • MPI SHMEM example, DebianCompiling$ mpicc -I /usr/lib/mpich-shmem/include hellompi.c -o hellompi

    Running$ mpirun -np 3 hellompiHello world! I am 1 of 3Hello world! I am 2 of 3Hello world! I am 0 of 3

  • MPI MPD example, UbuntuStatic hosts/IPs in /etc/hosts192.168.133.100 ub0192.168.133.101 ub1192.168.133.102 ub2192.168.133.103 ub3

    Installing NFS for file sharing$ sudo apt-get install nfs-kernel-server

  • MPI MPD example, UbuntuMake shared folder on all nodes$ sudo mkdir /mirror

    Export shared folder on ub0$ sudo echo /mirror *(rw,sync) >> /etc/exports

    Mount on all other nodes$ sudo mount ub0:/mirror /mirror

  • MPI MPD example, UbuntuDefine a user with same name, userid and home in /mirror on all nodes$ sudo adduser --home /mirror --uid 1100 mpiu

    Make mpiu the owner$ sudo chown mpiu /mirror

  • MPI MPD example, UbuntuInstall SSH server$ sudo apt-get install openssh-server

    Make password-less login$ su - mpiu$ ssh-keygen -t dsa[leave empty password]$ cd .ssh$ cat id_pub.dsa >> authorized_keys2

  • MPI MPD example, UbuntuInstall GCC$ sudo apt-get install build-essential

    Install MPICH2 on all nodes$ sudo apt-get install mpich2

    Verify installed binaries$ which mpd mpiexec mpirun

  • MPI MPD example, UbuntuCreate /mirror/mpd.hosts file:ub3ub2ub1ub0

    Set up password$ echo secretword=something >> /mirror/.mpd.conf$ chmod 600 .mpd.conf

  • MPI MPD example, UbuntuTest MPD$ mpd &$ mpdtrace$ mpdallexit... will exit mpd

  • MPI MPD example, UbuntuStart MPD for 4 nodes$ mpdboot n 4$ mpdtrace

    Test a program (copy from mpich2-1.0.5/examples)$ mpiexec -n 4 cpi

  • GPU Programming

  • With OpenCL you canLeverage CPUs and GPUs to accelerate parallel computation Get dramatic speedups for computationally intensive applicationsWrite accelerated portable code across different devices and architectures

    IMPORTANT OpenCL does both:GPUs accelerate data parallel computation Multicore CPUs accelerate task parallel computation

    OpenCLHere we are interested only in this

  • A host connected to one or more OpenCL devicesAn OpenCL device is A collection of one or more compute units (arguably cores)A compute unit is composed of one or more processing elementsProcessing elements execute code as SIMD or SPMDOpenCL Platform Model

  • KernelBasic unit of executable code - similar to a C functionData-parallel or task-parallelProgramCollection of kernels and other functionsAnalogous to a dynamic libraryApplications queue kernel execution instancesQueued in-orderExecuted in-order or out-of-orderOpenCL Execution Model

  • Define N-dimensional computation domain (N = 1, 2 or 3)Each independent element of execution in N-D domain is called a work-itemThe N-D domain defines the total number of work-items that execute in parallelEach work-item executes in parallel the same kernelE.g., process a 1024 x 1024 image: Global problem dimensions: 1024 x 1024 = 1 kernel execution per pixel: 1,048,576 total kernel executionsExpressing Data-Parallelism in OpenCLvoid scalar_mul(int n, const float *a, const float *b, float *result) { int i; for (i=0; i
  • 20 x 16 x 5 = 1600 processing elements

  • Global Dimensions: 1024 x 1024 (whole problem space)Work-group Local Dimensions: 128 x 128 (Work-group executed together on same compute unit)

    Global and Local Dimensions10241024Choose the dimensions that are best for your algorithm

  • Kernels executed across a global domain of work-itemsGlobal dimensions define the range of computationOne work-item per computation, executed in parallelWork-items are grouped in local workgroupsLocal dimensions define the size of the workgroupsExecuted together on one deviceShare local memory and synchronizationCaveatsGlobal work-items must be independent: no global synchronizationSynchronization can be done within a workgroupExpressing Data-Parallelism in OpenCL

  • OpenCL Memory ModelWorkgroupWork-ItemComputer DeviceWork-ItemWorkgroupWork-ItemWork-ItemHostPrivate MemoryPrivate MemoryPrivate MemoryPrivate MemoryLocal MemoryLocal MemoryGlobal/Constant MemoryHost Memory

  • OpenCL uses Dynamic/Runtime compilation model (like OpenGL):The code is complied to an Intermediate Representation (IR)Usually an assembler or a virtual machineKnown as offline compilationThe IR is compiled to a machine code for execution. This step is much shorter.It is known as online compilation.

    In dynamic compilation, step 1 is done usually only once, and the IR is stored. The App loads the IR and performs step 2 during the Apps runtime (hence the term)

    Compilation Model

  • Derived from ISO C99No standard C99 headers, function pointers, recursion, variable length arrays, and bit fieldsAdditions to the language for parallelismWork-items and workgroupsVector typesSynchronizationAddress space qualifiersOptimized image accessBuilt-in functionsOpenCL C Language

  • get_global_id(0)11KernelA data-parallel function executed for each work-itemkernel void square(__global float* input, __global float* output){ int i = get_global_id(0); output[i] = input[i] * input[i];}

    361108141611813614418164161814004964

    61109241197612219841920078

    361 1081 416 1 1 81 4936 1 4418164161814004964

  • Work-Items and Workgroup Functionsworkgroupsget_work_dim = 1get_global_size = 26get_num_groups = 2get_local_size = 13get_group_id = 0get_group_id = 1get_local_id = 8get_global_id = 21input

    61109241197612219841920078

  • Scalar data typeschar , uchar, short, ushort, int, uint, long, ulong, bool, intptr_t, ptrdiff_t, size_t, uintptr_t, void, half (storage)Image typesimage2d_t, image3d_t, sampler_tVector data typesVector length of 2, 4, 8, and 16Aligned at vector lengthVector operations and built-inData Typesint4 vi0 = (int4) -7;int4 vi1 = (int4)(0, 1, 2, 3);vi0.lo = vi1.hi;int8 v8 = (int8)(vi0, vi1.s01, vi1.odd);

    23-7-7

    -7-7-7-7

    0123

    23-7-70113

  • Kernel pointer arguments must use global, local, or constantkernel void distance(global float8* stars, local float8* local_stars)kernel void sum(private int* p) // Illegal because it uses private

    Default address space for arguments and local variables is privatekernel void smooth(global float* io) { float temp;

    image2d_t and image3d_t are always in global address spacekernel void average(read_only global image_t in, write_only image2d_t out)

    Address Spaces

  • IEEE 754 compatible rounding behavior for single precision floating-pointIEEE 754 compliant behavior for double precision floating-pointDefines maximum error of math functions as ULP valuesHandle ambiguous C99 library edge cases

    Commonly used single precision math functions come in three flavorseg. log(x)Full precision

  • On Linux, network failures are dealt with by network routing software, (typically BGP or OSPF), and network adapter failures are handled by Linux's built-in channel bonding software these things don't require support from HA software

    WHAT KINDS OF THINGS MOVE OVER??- Lots of things move over- IP addresses (which reroutes clients)- control of hardware- disk partitions- filesystems- processes / servicesetc.In the picture, note that the IP address (129.42.19.99) has also moved over to the active side The difference between this diagram and the previous one is the method of data sharing. DRBD is not the only option, but it's the best. It's supplied with SLES, and is available for RHES.Linux-HA supports DRBD. TSA probably could support replication.

    Instead of requiring the customer to purchase shared storage ($10K-$500K), this kind of environment uses any kind of storage accessible to a server, and replicates its contents in realtime across the dedicated (gigabit) LAN connection. This environment has NO single points of failure (SPOFs).

    This allows HA clusters to be configured extremely cost-effectively approximately >= $2500 (USD) total cost. This makes HA available in more applications, permitting it to be used in environments involving hundreds of HA servers. This is a huge win in cost-sensitive bids.

    Surprisingly, the performance impact is small, typically negligible.

    DRBD can also be used in active->active failover configurations with each partition replicating in the opposite direction.General Vague guidelines for deciding which is more appropriate:

    Cost sensitivity -> Linux-HAEase of use, installation -> Linux-HASupport for ServeRAID or DRBD -> Linux-HASimpler HA configurations -> Linux-HACustomer desire for / comfort with OSS solution -> Linux-HAAll Linux distributions/platforms, or Solaris, or *BSD compatibility -> Linux-HA

    More Sophisticated HA configurations -> TSACustomer desire for / comfort with proprietary solution -> TSAAIX, OS/400 compatibility -> TSA

    NOTE: TSA Licensing numbers are based on CPUs. A 2-node system with 4 CPUs on each side is 8 TSA licenses.speeds

    protocol

    endian

    coniguration

    interrup schemes

    cost

    **************