מבוא לעיבוד מקבילי דר' גיא תל-צור

מבוא לעיבוד מקבילידר' גיא תל-צור

1שקפי הרצאה מס'

22/10/2001נערכה ביום ב',

Introduction to Parallel Processing

Course Number 36113621

אתר הקורס:

http://www.bgu.ac.il/~tel-zur/pp.html

21.10.01ד' חשון תשס"ב

אל: סטודנטים להנדסת תוכנהמאת: דר' הוגו גוטרמן

(36113621הנידון: ביטול קורס מבוא לעיבוד מקבילי )

א. נ.עקב בעיות תקציביות שלא בשליטתנו, אנו נאלצים לבטל את השתתפותכם

לעיבוד מקבילי.בקורס מבוא אנו מבקשם את סליחתכם על אי הנוחיות שנגרמת לכם כתוצאה מהחלטה זאת.

הנחיה יש לפנות למזכירות המחלקה להנדסת תוכנה . לצורך

בברכה, דר' הוגו גוטרמן

8:00-11:00הזמן: ימי שני,

301, חדר 28המקום: בניין

מרצה: דר' גיא תל-צור

קמ"ג

[email protected]

מתרגלת: דר' נטליה פנוב

המח' להנדסת חשמל ומחשבים

[email protected]

יצירת קשרשעות קבלה:•

ל- 11:00גיא תל-צור – מייד עם תום ההרצאה, בין –. המקום טרם נקבע.12:00

318. חדר 18:00 ל- 14:00נטליה פנוב – ימי ד', בין –בבניין הנדסת חשמל ומחשבים.

•Email: [email protected]

•Newsgroup: [email protected]

mailto:[email protected]






Course Objectives:

The goal of this course is to provide in-depth understanding of modern parallel processing. The course will cover theoretical and practical aspects of parallel processing.

Task #1Please send an email containing the following data: • Your first and last name • Your Email at BGU • Phone Number • Year • Course of Study

to: [email protected]

• PLEASE WRITE EMAILS ONLY IN ENGLISH

מבנה הקורס

מבוא•

טכניקות מיקבול•

אפליקציות•

פרקטיקה•

נושאים אחרים•

סמסטר א' תשס"ב

21.10.01היום הראשון ללימודים: •

16.12.01חופשת חנוכה: יום א' •

25.1.02היום האחרון לסמסטר הראשון: •

18.2.02מועד הגשת פרויקטי גמר: •

24.2.02היום הראשון לסמסטר השני: •

1/3תכנית שבועית של הקורס – יתכנו שינויים!

Week/Lecture

Date Topics

1 22.10.01 Introduction to Parallel Processing

2 29.10.01 Hands-on Practice

3 5.11.01 Parallel Algorithm Design

4 12.11.01 Parallel Programming

5 19.11.01The Message Passing Interface (MPI)

2/3תכנית שבועית של הקורס –

6 26.11.01The Message Passing Interface (MPI)

7 3.12.01 Applications



10 24.12.01Performance Evaluation, Shared Memory

3/3תכנית שבועית של הקורס –

11 31.12.01 Building a personal Super Computer

12 7.1.02 Condor, Grids and other topics

13 14.1.02 Students Presentations

14 21.1.02 Students Presentations

28.1.02

15 11.2.02 Submission of Final Projects

1/2מטלות וציונים –

. יום ב' בשבוע הבא. יוגש כשבועיים 1תרגיל מס' •) – ללא ציון אך חובה 4לאחר-מכן (הרצאה מס'

לעשותו.

. יש להגישו 4. ינתן בהרצאה מס' 2תרגיל מס' • מהציון הכולל.15%. 6בהרצאה מס'

2/2מטלות וציונים –

תפורסם רשימת הפרויקטים5בהרצאה מספר •

. משקלו 7בוחן ביניים יתקיים בהרצאה מספר • מהציון הסופי.20%

. יש להגישו 8 ינתן בהרצאה 3תרגיל מס' • מהציון הסופי15%. משקלו 10בהרצאה

ומשקלו כ- 18.2.02פרויקט גמר יוגש עד ה- • מהציון הסופי.50%

דרישות קדם

FORTRAN או Cדרושה שליטה בשפת •

דרוש רקע של לימודי היסוד במתימטיקה •ופיסיקה

הנוסח הקובע הוא זה שנכתב בשנתון, כפי •שפורסם על-ידי האוניברסיטה

ציפיות

יש להגיש את התרגילים באמצעות הדואר •האלקטרוני של הקורס

על התרגילים להיות ברי הרצה (מדובאגים)•יש לצרף הוראות הרצה ותיעוד מחוץ לגוף •

WORDהתכנית בקובץ יש להוסיף תיעוד רב גם בתוך קבצי התכניות•על שורות הקוד להיכתב בפשטות ובאופן המקל •

על הבנתן

ציפיות - המשך

יש לעמוד בלוחות הזמנים•איחור בהגשה יגרור גריעה מהציון עפ"י הנוסחה של –

n2 ימים!!!nנקודות לכל

יש להקפיד על כל כללי היושר בהתאם לתקנות האוניברסיטה

References

ספר לימוד: אין בקורס ספר לימוד שהוא חובה•

באתר הקורס ימצאו רוב המצגות•

באתר הקורס קיימים קישורים לאתרים חשובים •להם נזדקק

קיים חומר רב באינטרנט•

נא לבקר באתר לעיתים קרובות כדי להתעדכן!!!

Parallel Computer Architecture

David E. Culler et al

Introduction to Parallel Computing

Vipin Kumar et al

Using MPI

William Gropp et al

Parallel Programming With MPI

Peter Pacheco

Parallel Programming

Barry Wilkinson

Michael Allen

תכנית ההרצאה הראשונה

מבוא ל"מבוא לחישוב מקבילי"•

תאור קצר של המערך המקבילי עליו יתבצע •התרגול

…מתחילים

מהו }חישוב, עיבוד{ מקבילי?

• Parallel Computing

• Parallel Processing

• Cluster Computing

• Beowulf Clusters

• HPC – High Performance Computing

Oxford Dictionary of Science:

• A technique that allows more than one process – stream of activity – to be running at any given moment in a computer system, hence processes can be executed in parallel. This means that two or more processors are active among a group of processes at any instant.

האם מחשב מקבילי זהה למונחעל?- מחשב

A Supercomputer

• An extremely high power computer that has a large amount of main memory and very fast processors… Often the processors run in parallel.

http://www.netlib.org/benchmark/top500/top500.list.html

Why Study Parallel Architecture?

Parallelism:•Provides alternative to faster clock for performance•Applies at all levels of system design (H/W – S/W Integration)•Is a fascinating topic•Is increasingly central in information processing, science and engineering

The Demand for Computational Speed

• Continual demand for greater computational speed from a computer system than is currently possible.Areas requiring great computational speed include numerical modeling and simulation of scientific and engineering problems. Computations must be completed within a “reasonable” time period.

Large Memory Requirements

Use parallel computing for executing larger problems which require more memory than exists on a single computer.

Grand Challenge Problems

• A grand challenge problem is one that cannot be solved in a reasonable amount of time with today’s computers.Obviously, an execution time of 10 years is always unreasonable. Examples: Modeling large DNA structures,global weather forecasting, modeling motion of astronomical bodies.

Scientific Computing Demand

תרגיל

כוכבים. הערך את 11^10נניח שבגלקסיה יש • איטרציות על בסיס 100הזמן שידרש לחישוב

במחשב בעל כח-חישוב של O(N^2)חישוב של 1GFLOPS?

פתרון

22^10 כוכבים תהינה 11^10עבור •אינטראקטיות.

24^10 איטרציות: 100סה"כ פעולות כולל •

לכן זמן החישוב יהיה:•yearst 791,709,31sec10

10

10 159

24

פתרון - המשך

:N log(N)חישוב על-פי •

sec1010

10 59

14

t

מסקנה: שיפור באלגוריתם חשוב בד“כ הרבה יותר מהוספת מעבדים!

Technology TrendsP

erfo

rman

ce

0.1

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors

Clock Frequency Growth Rate

0.1

1

10

100

1,000

19701975

19801985

19901995

20002005

Clo

ck r

ate

(MH

z)

i4004i8008

i8080

i8086 i80286i80386

Pentium100

R10000

מיקבול הוא טוב אבל יש לו מחיר!

לא כל בעיה ניתנת למיקבול•

מיקבול תוכנה אינו דבר קל•

זמינות החומרה•

זמן הפיתוח מול אלטרנטיבות אחרות (טכנולוגיה •עתידית)

עלות•

Parallel Architecture Considerations•Resource Allocation:

–how large a collection? –how powerful are the elements?–how much memory?

•Data access, Communication and Synchronization–how do the elements cooperate and communicate?–how are data transmitted between processors?–what are the abstractions and primitives for cooperation?

•Performance and Scalability–how does it all translate into performance?–how does it scale?

Conventional Computer

Shared Memory System

Message-Passing Multi-computer

הגישה

יש לחלק את הבעיה לקטעים הניתנים להרצה •במקביל

כל קטע מהבעיה הוא תהליך אשר יורץ על •מעבד אחד

לשם העברת הנתונים/התוצאות בין המעבדים •Message Passingיש צורך בשליחת הודעות –

בין המעבדים (קיימות גם שיטות אחרות)

Distributed Shared Memory

Flynn (1966) Taxonomy

• SISD - a single instruction stream-single data stream computer.

• SIMD - a single instruction stream-multiple data stream computer.

• MIMD - a multiple instruction stream-multiple data stream computer.

Multiple Program Multiple Data (MPMD)

Single Program Multiple Data (SPMD)

• A Single source program

• Each processor will execute its personal copy of this program

• Independently and not in synchronism

Message-Passing Multi-computers

לרשת התקשורת תפקיד משמעותי במערך מחשבים מקבילי!

בשקפים הבאים נסקור פרמטרים מאפיינים של רשת התקשורת

Network Criteria – 1/6

• Bandwidth

• Network Latency

• Communication Latency (H/W+S/W)

• Message Latency (see next slide)


Latency

1/slope=Bandwidth

Message Size

Timeto SendMessage

Not latency

Bandwidth is the inverse of the slope of the line

time = latency + (1/rate) size_of_message

Latency is sometimes described as “time to send a message of zero bytes”. This is true only for the simple model. The number quoted is sometimes misleading.


• Bisection Width - # links to be cut in order to divide the network into two equal parts

2


• Diameter – The max. distance between any two nodes

P/2


• Connectivity – Multiplicity of paths between any two nodes

2


• Cost – Number of links

P

תרגיל: חשב את תכונות רשת בעלת P מעבדים שהיא Fully

Connected

פתרון

Diameter = 1

Bisection=p^2/4

Connectivity=p-1

Cost=p(p-1)/2

- המשךBisectionפתרון עבור ה-

• Number of links: p(p-1)/2

• Internal links in each half: (p/2)(p/2-1)/2

• Internal links in both halves: (p/2)(p/2-1)

• Number of links being cut:

p(p-1)/2 – (p/2)(p/2-1) = p^2/4

2D Mesh

Memory bus (64-bit, 50 MHz)

i860

L1 $

NI

DMA

i860

L1 $

Driver

Memctrl

4-wayinterleaved

DRAM

IntelParagonnode

8 bits,175 MHz,bidirectional2D grid network

with processing nodeattached to every switch

Sandia’ s Intel Paragon XP/S-based Super computer

Example: Intel Paragon

A Binary Tree – 1/2

A Binary – Tree 2/2

Fat tree: Thinking Machine CM5, 1993

3D Hypercube Network

4D Hypercube Network

Embedding – 1/2

Embedding – 2/2

Deadlock

Ethernet

Ethernet Frame Format

Point-to-Point Communication

Performance

• Computation/Communication ratio

• Speedup Factor

• Overhead

• Efficiency

• Cost

• Scalability

• Gustafson’s Law

Computation/Communication Ratio

comm

comp

t

t

timeionCommunicat

timenComputatio

Speedup Factor

p

s

t

t

nnS

processors using timeExecution

processor one using timeExecution)(

The maximum speedup is n (linear speadup)

nnt

tnS

s

s /

)(

Speedup and Comp/Comm Ratio

Sequential Work

Max (Work + Synch Wait Time + Comm Cost)Speedup <

Overhead

• Things that limit the speedup:– Serial parts of the computation– Some processors compute while others are idle– Communication time for sending messages– Extra computation in the parallel version not

appearing in the serial version

Amdahl’s Law (1967)

1)1()(

/))1(1(/)1(

timesections ableParalleliz)1(

fraction code Serial

timeprocessor 1 timeSerial

nf

n

t

tnS

nfntntftft

tf

f

t

p

s

sssp

s

s

Amdahl’s Law - continue

fnS

n

1)(

With only 5% of the computation being serial, the maximum speedup is 20

Speedup

1)1()(

nf

n

t

tnS

p

s

Efficiency

100n

S(n)percents) in(

processors ofnumber ssor multiprocea using timeExecution

processor one using timeExecution

E

nt

tE

E

p

s

E is the fraction of time that the processors are being used.

If E=100% then S(n)=n.

Cost

E

t

nS

ntnt

t

ssp

s

)(Cost

Cost

used)processor ofnumber (total time)(executionCost

p

s

Cost-optimal algorithm is when the cost is proportional to the single processor cost ( i.e. execution time)

Scalability

• An imprecise term

• Reflects H/W and S/W scalability

• How to get increased performance when the H/W increased?

• What H/W is needed when problem size (e.g. # cells) is increased?

• Problem dependent!

Gustafson’s Law (1988) – 1/3

Gives an argument against the pessimistic Amdahl’s Law conclusion.

Rather than assume that the problem size is fixed, we should assume that the parallel execution time is fixed.

Define a Scaled Speedup for the case of increaseing the number of processors as well as the problem size

Gustafson’s Law – 2/3

)1()1(

1

:computer parallela for assumeNow

)1(1/)1(

1

/)(

1 that such Normalize

processor 1 on part time parallel

processor 1 on timeexecution serial

snssnnnpsps

npsS

npst

pst

ns

n

nssnps

psnS

ps

p

s

scaled

s

p

Gustafson’s Law – 3/3

An Example:

Assume we have n=20 and a serial fraction of s=0.05

S(scaled)=0.05+0.95*20=19.05, while the Speedup according to Amdahl’s Law is:

S=20/(0.05(20-1)+1)=10.26

תרגיל

מעבדים, לכ"א כוח 10מערך מחשבים מכיל . מהם ביצועי המערך 200MFLOPSחישוב של

מהקוד היה 10% אילו MFLOPSביחידות של מהקוד היה מקבילי?90%טורי ו-

פתרוןאילו כל הקוד היה מקבילי, כוח החישוב היה:•

10*200 = 2000MFLOPs מהקוד יבצע מחשב בודד 10%במקרה שלנו:

מחשבים, לכן:10 מהקוד יבצעו 90%ויתרת

MFLOPSF

XopsXops

F

Xops

F

Xopst

XopsXopst

10529.1

2000

20010

9.0

200

1.0

20010

9.0

200

1.0

?

?

?

Domain Decomposition

מיפוי הבעיה לפתרון על טופולוגית המערך •המקבילי

חלוקת הבעיה ליחידות חישוב נפרדות באופן •אופטימלי:

•Load Balance

•Granularity

Load Balance – 1/2

• All processors must be kept busy!

• The parallel cluster may not be homogenous

(CPUs, memory, users/jobs, network…)

Load Balance 2/2Static versus Dynamic techniques

Static:

•Algorithmic assignment based on input; won’t change

•Low runtime overhead

•Computation must be predictable

•Preferable when applicable (except in multiprogrammed/heterogeneous environment)

Dynamic:

•Adapt at runtime to balance load

•Can increase communication and reduce locality

•Can increase task management overheads

• Task granularity: amount of work associated with a task

• General rule:– Coarse-grained => often less load balance– Fine-grained => more overhead; often more comm.,

contention

Determining Task Granularity

Algorithms: Adding 8 Numbers

Summary – Terms Defined – 1

• Flynn Taxonomy• Message Passing• Shared Memory• Bandwidth• Latency• Bisection Width

• Diameter• Connectivity• Cost• Meshes, Trees,

Hypercubes…• Deadlock

Summary – Terms Defined - 2

• Embedding• Process• Amdahl’s Law• Speedup Factor

• Efficiency• Cost• Scalability• Gustafson’s Law• Load Balance

Next Week Class…

השיעור הבא יתקיים במעבדת המחשבים, •קומה ג' בבניין הנדסת חשמל ומחשבים

לא לשכוח לפתוח חשבון על המערך המקבילי ועל • + נטליה)!!! תלמיד Emailמחשבי כיתת הלימוד (

שלא יפתח חשבון במחשב לא יוכל לבצע התרגול!!!

Task #2

• Goto: http://www.lam-mpi.org/tutorials/

Download and print the file:

“MPI quick reference sheet “

• Linux Tutorial:

Goto: “http://www.ctssn.com/”, learn at least lessons 1,2 and 3.

Cluster Computing

• COTS – Commodities of The Shelf

• Free O/S, e.g. Linux

• LOBOS – Lots Of Boxes On the Shelf

• PCs connected by a fast network

סוף עידן הדינוזאורים

• Cray-J932

• 16 Processors

• 200 MFLOPS per CPU

• 3.2 GFLOPS

The Dwarves 1/5

• 12 PCs of several types

• Red Hat Linux 6.0-6.2

• Fast Ethernet – 100Mbps

• Myrinet Network 1.28+1.28Gbps, SAN

The Dwarves – 2/5

There are 12 computers with Linux operating system.

dwarf[1-12] or dwarf[1-12]m

dwarf1[m], dwarf3[m]-dwarf7[m] - Pentium II 300 MHz,

dwarf9[m]-dwarf12[m] - Pentium III 450 MHz (dual CPU),

dwarf2[m], dwarf8[m] - Pentium III 733 MHz (dual CPU).

The Dwarves – 3/5

• 6 PII at 300MHz processors

• 8 PIII at 450MHz processors

• 4 PIII at 733MHz processors

• Total: 18 processors, ~8GFlops

The Dwarves 4/5

• Dwarf1 ..dwarf12 – nodes names for the Fast Ethernet link

• Dwarf1m .. Dwarf12m – nodes names for the Myrinet network

The Dwarves 5/5

• GNU FORTRAN / C Compilers

• PVM / MPI

Cluster Computing - 1

Linux

http://www.ee.bgu.ac.il/~tel-zur/linux.html

LinuxIn Google:

Linux: 38,600,000

Microsoft: 21,500,000

Bible: 7,590,000

מבוא לעיבוד מקבילי דר' גיא תל-צור

Documents