并行程序设计 programming for parallel computing 张少强 [email protected] qq: 249104218 (...

25
并并并并并并 Programming for parallel computing 张张张 [email protected] http://bioinfo.uncc.edu/szhang QQ: 249104218 http://renren.com/kindkid (并 并并 一: 2011 并 9 并 16 并 并并并 B204)

Upload: dominick-malone

Post on 31-Dec-2015

308 views

Category:

Documents


5 download

TRANSCRIPT

并行程序设计Programming for parallel computing

张少强[email protected]

http://bioinfo.uncc.edu/szhangQQ: 249104218

http://renren.com/kindkid

( 第一讲: 2011 年 9 月 16 日, 博理楼 B204)

References

当当有卖 75折Barry Wilkinson & M. Allen

主讲教材 MPI编程实例教材

Parallel Computing

1. The use of multiple computers, or computers with multiple internal processors, to solve a problem at a greater computational speed than using a single computer.

2. Offers opportunity to tackle problems that could not be solved in a reasonable time otherwise.

3. Also can tackle problems with:

• Higher precision• More memory requirements.

1. Multiple interconnected computers

• Cluster Computing - A form of parallel computing in which the computing platform is a group of interconnected computers (a cluster)

For this course, we will use a small dedicated departmental cluster (59.67.76.156) consisting of 8 nodes:– 8-core Xeon processors, all interconnected thro a l

ocal Ethernet switch.(通过以太网高速连接 )– Programming is normally done using the message

–passing interface (MPI).

2. A computer system with multiple internal processors

Shared memory multiple processor system - Multiple processors connected internally to a common main memory.

Multi-core processor - a processor with multiple internal execution units on one chip (a form of shared memory multiprocessor).

For this course, we will use the cluster as it has both types. Programming uses a shared memory thread model.

Prerequisites

• Data Structures• Basic skills in C• What a computer consists of (-- proces

sors and memory and I/O).

Course Contents• Parallel computers: architectural types, shared memory, me

ssage passing, interconnection networks, potential for increased speed

• Message passing: MPI message passing APIs, send, receive, collective operations. Running MPI programs on a cluster.

• Basic parallel programming techniques:1. Embarrassingly parallel computations(易并行计算)2. Partitioning and divide and conquer(划分,分治策略 )3. Pipelined computations(流水线计算 )4. Synchronous computations(同步计算 )5. Load balancing and termination detection (负载平衡与终止检测)

Course Contents (Continued)

共享存储器程序设计

• Shared memory architectures: Hyperthreaded, multi-core, many core.

• Programming with shared memory programming: Specifying parallelism, sharing data, critical sections, threads, OpenMP. Running threaded/OpenMP programs on multi-core system.

• CPU-GPU systems: Architecture, programming in CUDA, issues for achieving high performance.

Course Contents (Continued)

Algorithms and applications: Selection from:

• Sorting algorithms

• Searching algorithms

• Numerical algorithms

• Image processing algorithms

Types of Parallel Computers

Two principal approaches:

• Shared memory multiprocessor

• Distributed memory multicomputer

Conventional ComputerConsists of a processor executing a program stored in a (main) memory:

Each main memory location located by its address. Addresses start at 0 and extend to 2b - 1 when there are b bits (binary digits) in address.

Main memory

Processor

Instructions (to processor)Data (to or from processor)

Shared Memory Multiprocessor System

Natural way to extend single processor model - have multiple processors connected to multiple memory modules, such that each processor can access any memory module:

Processors

Processor-memory Interconnections

Memory moduleOneaddressspace

Simplistic view of a small shared memory multiprocessor

Examples:• Dual Pentiums• Quad Pentiums

Processors Shared memory

Bus(总线 )

Real computer system have cache memory between the main memory and processors. Level 1 (L1) cache and Level 2 (L2) cache.

Example Quad Shared Memory Multiprocessor

Processor

L2 Cache

Bus interface

L1 cache

Processor

L2 Cache

Bus interface

L1 cache

Processor

L2 Cache

Bus interface

L1 cache

Processor

L2 Cache

Bus interface

L1 cache

Memory controller

Memory

Processor/memorybus

Shared memory

Since L1 cache is usually inside package and L2 cache outside package, dual-/multi-core processors usually share L2 cache.

Single quad core shared memory multiprocessor

L2 Cache

Memory controller

MemoryShared memory

Chip

ProcessorL1 cache

ProcessorL1 cache

ProcessorL1 cache

ProcessorL1 cache

Intel Core i7

Multiple quad-core multiprocessors

Memory controller

MemoryShared memory

L2 Cache

possible L3 cache

Processor

L1 cache

Processor

L1 cache

Processor

L1 cache

Processor

L1 cache

Processor

L1 cache

Processor

L1 cache

Processor

L1 cache

Processor

L1 cache

Programming Shared Memory Multiprocessors

1. Pthreads libraries (库 ): Programmer decomposes program into individual parallel sequences, (threads), each being able to access shared variables declared outside threads.pthread_create() pthread_join() Pthread_exit()

2. OpenMP: Higher level library functions and preprocessor compiler directives to declare shared variables and specify parallelism. OpenMP由小型的编译器命令集、一个扩展的小型库函数、和 C/C++和 Fortran基本语言环境组成。

#progma omp directive_name …

Programming Shared Memory Multiprocessors

3.Use a modified sequential programming language -- added syntax to declare shared variables and specify parallelism.

Example UPC (Unified Parallel C) - needs a UPC compiler.

4.Use a specially designed parallel programming language -- with syntax to express parallelism. Compiler automatically creates executable code for each processor (not now common).

5.Use a regular sequential programming language such as C and ask parallelizing compiler to convert it into parallel executable code. Also not now common.

Message-Passing MulticomputerComplete computers connected through an interconnection network:

Processor

Interconnectionnetwork

Local

Computers

Messages

memory

Networked Computers as a Computing Platform

• A network of computers became a very attractive alternative to expensive supercomputers and parallel computer systems for high-performance computing in early 1990s.

• Several early projects. Notable:

– Berkeley NOW (network of workstations) project.

– NASA Beowulf project.

Key advantages:

• Very high performance workstations and PCs readily available at low cost.

• The latest processors can easily be incorporated into the system as they become available.

• Existing software can be used or modified.

Beowulf Clusters

• A group of interconnected “commodity” computers achieving high performance with low cost.

• Typically using commodity interconnects - high speed (Gigabit) Ethernet, and Linux OS.

Dedicated cluster with a master node and compute nodes

User

Master node

Compute nodes

Dedicated Cluster

Ethernet interface

Switch

External network

Computers

Local network

Software Tools for Clusters

• Each node has a copy of OS: linux• Save apps in master node, master node can be set as a

file server to manage network file system• MPI installed in master node• Based upon message passing programming model• User-level libraries provided for explicitly specifying

messages to be sent between executing processes on each computer .

• Use with regular programming languages (C, C++, ...).

MPI ( Message-passing interface)

Next step: Learn the message passing programming model, some MPI routines, write a message-passing program and test on the cluster.

To be continued … …^_^