并行程序设计 programming for parallel computing 张少强 [email protected] qq: 249104218 (...
TRANSCRIPT
并行程序设计Programming for parallel computing
http://bioinfo.uncc.edu/szhangQQ: 249104218
http://renren.com/kindkid
( 第一讲: 2011 年 9 月 16 日, 博理楼 B204)
Parallel Computing
1. The use of multiple computers, or computers with multiple internal processors, to solve a problem at a greater computational speed than using a single computer.
2. Offers opportunity to tackle problems that could not be solved in a reasonable time otherwise.
3. Also can tackle problems with:
• Higher precision• More memory requirements.
1. Multiple interconnected computers
• Cluster Computing - A form of parallel computing in which the computing platform is a group of interconnected computers (a cluster)
For this course, we will use a small dedicated departmental cluster (59.67.76.156) consisting of 8 nodes:– 8-core Xeon processors, all interconnected thro a l
ocal Ethernet switch.(通过以太网高速连接 )– Programming is normally done using the message
–passing interface (MPI).
2. A computer system with multiple internal processors
Shared memory multiple processor system - Multiple processors connected internally to a common main memory.
Multi-core processor - a processor with multiple internal execution units on one chip (a form of shared memory multiprocessor).
For this course, we will use the cluster as it has both types. Programming uses a shared memory thread model.
Prerequisites
• Data Structures• Basic skills in C• What a computer consists of (-- proces
sors and memory and I/O).
Course Contents• Parallel computers: architectural types, shared memory, me
ssage passing, interconnection networks, potential for increased speed
• Message passing: MPI message passing APIs, send, receive, collective operations. Running MPI programs on a cluster.
• Basic parallel programming techniques:1. Embarrassingly parallel computations(易并行计算)2. Partitioning and divide and conquer(划分,分治策略 )3. Pipelined computations(流水线计算 )4. Synchronous computations(同步计算 )5. Load balancing and termination detection (负载平衡与终止检测)
Course Contents (Continued)
共享存储器程序设计
• Shared memory architectures: Hyperthreaded, multi-core, many core.
• Programming with shared memory programming: Specifying parallelism, sharing data, critical sections, threads, OpenMP. Running threaded/OpenMP programs on multi-core system.
• CPU-GPU systems: Architecture, programming in CUDA, issues for achieving high performance.
Course Contents (Continued)
Algorithms and applications: Selection from:
• Sorting algorithms
• Searching algorithms
• Numerical algorithms
• Image processing algorithms
Types of Parallel Computers
Two principal approaches:
• Shared memory multiprocessor
• Distributed memory multicomputer
Conventional ComputerConsists of a processor executing a program stored in a (main) memory:
Each main memory location located by its address. Addresses start at 0 and extend to 2b - 1 when there are b bits (binary digits) in address.
Main memory
Processor
Instructions (to processor)Data (to or from processor)
Shared Memory Multiprocessor System
Natural way to extend single processor model - have multiple processors connected to multiple memory modules, such that each processor can access any memory module:
Processors
Processor-memory Interconnections
Memory moduleOneaddressspace
Simplistic view of a small shared memory multiprocessor
Examples:• Dual Pentiums• Quad Pentiums
Processors Shared memory
Bus(总线 )
Real computer system have cache memory between the main memory and processors. Level 1 (L1) cache and Level 2 (L2) cache.
Example Quad Shared Memory Multiprocessor
Processor
L2 Cache
Bus interface
L1 cache
Processor
L2 Cache
Bus interface
L1 cache
Processor
L2 Cache
Bus interface
L1 cache
Processor
L2 Cache
Bus interface
L1 cache
Memory controller
Memory
Processor/memorybus
Shared memory
Since L1 cache is usually inside package and L2 cache outside package, dual-/multi-core processors usually share L2 cache.
Single quad core shared memory multiprocessor
L2 Cache
Memory controller
MemoryShared memory
Chip
ProcessorL1 cache
ProcessorL1 cache
ProcessorL1 cache
ProcessorL1 cache
Intel Core i7
Multiple quad-core multiprocessors
Memory controller
MemoryShared memory
L2 Cache
possible L3 cache
Processor
L1 cache
Processor
L1 cache
Processor
L1 cache
Processor
L1 cache
Processor
L1 cache
Processor
L1 cache
Processor
L1 cache
Processor
L1 cache
…
Programming Shared Memory Multiprocessors
1. Pthreads libraries (库 ): Programmer decomposes program into individual parallel sequences, (threads), each being able to access shared variables declared outside threads.pthread_create() pthread_join() Pthread_exit()
2. OpenMP: Higher level library functions and preprocessor compiler directives to declare shared variables and specify parallelism. OpenMP由小型的编译器命令集、一个扩展的小型库函数、和 C/C++和 Fortran基本语言环境组成。
#progma omp directive_name …
Programming Shared Memory Multiprocessors
3.Use a modified sequential programming language -- added syntax to declare shared variables and specify parallelism.
Example UPC (Unified Parallel C) - needs a UPC compiler.
4.Use a specially designed parallel programming language -- with syntax to express parallelism. Compiler automatically creates executable code for each processor (not now common).
5.Use a regular sequential programming language such as C and ask parallelizing compiler to convert it into parallel executable code. Also not now common.
Message-Passing MulticomputerComplete computers connected through an interconnection network:
Processor
Interconnectionnetwork
Local
Computers
Messages
memory
Networked Computers as a Computing Platform
• A network of computers became a very attractive alternative to expensive supercomputers and parallel computer systems for high-performance computing in early 1990s.
• Several early projects. Notable:
– Berkeley NOW (network of workstations) project.
– NASA Beowulf project.
Key advantages:
• Very high performance workstations and PCs readily available at low cost.
• The latest processors can easily be incorporated into the system as they become available.
• Existing software can be used or modified.
Beowulf Clusters
• A group of interconnected “commodity” computers achieving high performance with low cost.
• Typically using commodity interconnects - high speed (Gigabit) Ethernet, and Linux OS.
Dedicated cluster with a master node and compute nodes
User
Master node
Compute nodes
Dedicated Cluster
Ethernet interface
Switch
External network
Computers
Local network
Software Tools for Clusters
• Each node has a copy of OS: linux• Save apps in master node, master node can be set as a
file server to manage network file system• MPI installed in master node• Based upon message passing programming model• User-level libraries provided for explicitly specifying
messages to be sent between executing processes on each computer .
• Use with regular programming languages (C, C++, ...).