performance. measure, report, and summarize make intelligent choices see through the marketing...

27
Performance

Upload: marcus-russell

Post on 01-Jan-2016

220 views

Category:

Documents


2 download

TRANSCRIPT

Performance

Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organiza-

tional motivation

Questions Why is some hardware better than others for dif-

ferent programs?

What factors of system performance are hardware related?(e.g., Do we need a new machine, or a new oper-ating system?)

How does the machine's instruction set affect per-formance?

Performance

2

Which of these airplanes has the best perfor-mance?

How much faster is the Concorde compared to the 747?

How much bigger is the 747 than the Douglas DC-8?

Performance Comparison

Airplane Passengers Range (mi) Speed (mph)

Boeing 737-100 101 630 598Boeing 747 470 4150 610BAC/Sud Concorde 132 4000 1350Douglas DC-8-50 146 8720 544

3

TIME, TIME, TIME!

Response Time (latency) How long does it take for my job to run? How long does it take to execute a job? How long must I wait for the database query?

Throughput How many jobs can the machine run at once? What is the average execution rate? How much work is getting done?

If we upgrade a machine with a new processor, what do we increase?

If we add a new machine to the lab, what do we in-crease?

Computer Performance

4

Elapsed Time Counts everything (disk and memory accesses,

I/O , etc.) A useful number, but often not good for compari-

son purposes

CPU time Doesn't count I/O or time spent running other pro-

grams Can be broken up into system time, and user time

Our focus: user CPU time Time spent executing the lines of code that are

“in” our program

Execution Time

5

For some program running on machine X,

PerformanceX = 1 / Execution timeX

“X is n times faster than Y”

PerformanceX / PerformanceY = n (rela-

tive performance)

Problem: Machine A runs a program in 20 seconds Machine B runs the same program in 25 seconds

Is this program faster?

Definition of Performance

6

Clock Cycles Instead of reporting execution time in seconds,

We often use cycles

Clock “ticks” indicate when to start activities (one abstraction):

Cycle time = time between ticks = seconds per cy-cle

Clock rate (frequency) = cycles per second (1 Hz. = 1 cycle/sec)

A 4 Ghz. clock has a cycle time

time

seconds

program

cycles

program

seconds

cycle

(ps) spicosecond 2501210 9104

1

7

How to Improve Performance Performance

So, to improve performance (everything else be-ing equal) You can either increase or decrease ________ the number of required cycles for a program, or ________ the clock cycle time or, said another way, ________ the clock rate.

8

seconds

program

cycles

program

seconds

cycle

How Many Cycles Are Required? Could assume that number of cycles equals number

of instructions

This assumption is incorrect Different instructions take different amounts of time on dif-

ferent machines Why?

(hint: remember that these are machine instructions, not lines of C code)9

time

1st inst

ruct

ion

2n

d inst

ruct

ion

3rd inst

ruct

ion

4th

5th

6th ...

Different Cycles for Different Instructions An instruction can take more cycles than others

Multiplication takes more cycles than addition Floating point instructions take longer than inte-

ger ones Accessing memory takes more time than access-

ing registers

Important point Changing the cycle time often changes the number of cy-

cles required for various instructions (more later) 10

time

One favorite program runs in 10 seconds on computer A This machine runs at 4 GHz clock frequency

Computer designer wants to build a new ma-chine B, that will run it in 6 seconds. The designer can use new (or perhaps more ex-

pensive) technology to substantially increase the clock rate.

But he has informed that the clock rate increase will affect the rest of the CPU design, causing ma-chine B to require 1.2 times as many clock cycles as machine A for the same program.

What clock rate should we tell the designer to tar-get?

Don't Panic, can easily work this out from ba-sic principles

Example

11

Now that we understand cycles A given program will require

Some number of instructions (machine instructions) Some number of cycles Some number of seconds

We have a vocabulary that relates these quanti-ties: Cycle time (seconds per cycle) Clock rate (cycles per second) CPI (cycles per instruction)

a floating point intensive application might have a higher CPI MIPS (millions of instructions per second)

this would be higher for a program using simple instructions

12

Performance Performance is determined by execution time

Do any of the other variables equal perfor-mance? # of cycles to execute program? # of instructions in program? # of cycles per second? Average # of cycles per instruction? Average # of instructions per second?

Common pitfall: thinking one of the variables is indicative of performance when it really isn’t.13

cycle

seconds

ninsturctio

cycles

program

nsinstructio

cycle

seconds

program

cycles

program

seconds

CPI Example Suppose two implementations of the same in-

struction set architecture (ISA). For some pro-gram,

Machine A has a clock cycle time of 250 ps and a CPI of 2.0 Machine B has a clock cycle time of 500 ps and a CPI of 1.2

What machine is faster for this program, and by how much?

If two machines have the same ISA, which of our quantities will always be identical? Clock rate, CPI, execution time, # of instructions, MIPS

14

Compiler designer wants to decide between two code sequences for a particular machine. Based on the hardware implementation, three

classes of instructions exist Instructions in Class A require one cycle Instructions in Class B require two cycles Instructions in Class C require three cycles

Two code sequences The 1st code sequence has 5 instructions: 2 of A, 1 of B, and 2 of C

The 2nd code sequence has 6 instructions: 4 of A, 1 of B, and 1 of C

Which sequence will be faster? How much? What is the CPI for each sequence?

Number of Instructions Example

15

Two different compilers are being tested for a 4 GHz machine. Three different classes of instructions: Class A,

Class B, and Class C, which require one, two, and three cycles (respectively). Both compilers are used to produce code for a large piece of software.

The first compiler's code uses 5 million Class A in-structions, 1 million Class B instructions, and 1 mil-lion Class C instructions.

The second compiler's code uses 10 million Class A instructions, 1 million Class B instructions, and 1 million Class C instructions.

Which sequence will be faster according to MIPS?

Which sequence will be faster according to execution time?

MIPS Example

16

Benchmarks Performance best determined by running a real

application Use programs typical of expected workload Or, typical of expected class of applications

e.g., compilers/editors, scientific applications, graphics, etc.

Small benchmarks Nice for architects and designers Easy to standardize Can be abused

SPEC (Standard Performance Evaluation Corpora-tion) Companies have agreed on a set of real program and in-

puts Valuable indicator of performance (and compiler technol-

ogy) Can still be abused

17

Benchmark Games An embarrassed Intel Corp. acknowledged Friday that a bug in a

software program known as a compiler had led the company to over-state the speed of its microprocessor chips on an industry benchmark by 10 percent. However, industry analysts said the coding error…was a sad commentary on a common industry practice of “cheating” on standardized performance tests…The error was pointed out to In-tel two days ago by a competitor, Motorola …came in a test known as SPECint92…Intel acknowledged that it had “optimized” its com-piler to improve its test scores. The company had also said that it did not like the practice but felt to compelled to make the optimizations because its competitors were doing the same thing…At the heart of Intel’s problem is the practice of “tuning” compiler programs to rec-ognize certain computing problems in the test and then substituting special handwritten pieces of code…

Saturday, January 6, 1996 New York Times

18

SPEC ‘89 Compiler “enhancements” and performance

0

100

200

300

400

500

600

700

800

tomcatvfppppmatrix300eqntottlinasa7doducspiceespressogcc

BenchmarkCompiler

Enhanced compiler

SP

EC

pe

rfo

rman

ce r

atio

19

SPEC CPU2000 CINT200 – integer programs written in C and C++ CFP2000 – floating-point programs written in Fortran & C Now, SPEC CPU2006 replaces CPU2000.

20

Integer benchmarks (CINT2000)Name Type

FP benchmarks (CFP2000)Name Type

gzip data compression wupwise quantum chromodynamicsvpr FPGA circuit placement &

routingswim shallow water modeling

gcc C compiler mgrid multi-grid solver in 3D potential fieldmcf minimum cost network flow

solver (combinatorial opt.)applu parabolic/elliptic partial differential equa-

tionscrafty chess program mesa 3D graphics libraryparser natural language process-

inggalgel fluid dynamics

eon ray tracing art neural network simulationperlbmk perl application equake seismic wave propagation: finite element

simulationgap group theory, interpreter facerec face image recognitionvortex object oriented database ammp computational chemistrybzip2 data compression lucas primality testingtwolf place and route simulator fma3d finite element crash simulation

sixtrack particle accelerator modelapsi meteorology: pollutant distribution

SPEC CPU2000 Does doubling the clock rate double the per-

formance? Can a machine with a slower clock rate have

better performance?

21

Clock rate in MHz

500 1000 1500 30002000 2500 35000

200

400

600

800

1000

1200

1400

Pentium III CINT2000

Pentium 4 CINT2000

Pentium III CFP2000

Pentium 4 CFP2000

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

SPECINT2000 SPECFP2000 SPECINT2000 SPECFP2000 SPECINT2000 SPECFP2000

Always on/maximum clock Laptop mode/adaptiveclock

Minimum power/minimumclock

Benchmark and power mode

Pentium M @ 1.6/0.6 GHz

Pentium 4-M @ 2.4/1.2 GHz

Pentium III-M @ 1.2/0.8 GHz

SPECweb99 Throughput benchmark for web servers

SPEC CPU focuses on the execution time, but SPECweb fo-cuses on the maximum number of connections a web server can support.

SPEC CPU targets uni-processor performance, SPECweb often uses multiprocessors to measure web-server throughput

SPEC CPU primarily targets the performance of CPU and main memory, SPECweb includes disk system and net-work

Now, SPECweb2005 replaces SPECweb99

22

Other Benchmarks EEMBC

Applications on embedded systems such as communica-tion devices, automobiles, etc.

Mediabench Set of multimedia applications (i.e., codec, graphics, etc.)

NAS Parallel benchmarks from NASA

SPLASH, PARSEC Multithreaded benchmarks for mutiprocessors

Etc.

23

Execution time after improvement (Exec_timenew) Exec_timenew

= Exec_timeunaffected + (Exec_timeaffected /

Amount_of_improvement)

= Exec_timeorg X ( (1 – f )+ f / S )

Example:Suppose a program runs in 100 seconds on a machine, with multiply responsible for 80 seconds of this time.

How much do we have to improve the speed of multiplication if we want the program to run 4 times faster?

How about making it 5 times faster?

Amdahl's Law

24

1- f f

Exec_timeorg

Exec_timenew

Improved by S

25

Speedup and Amdahl’s Law Speedup

= Exec_timeorg / Exec_timenew

= 1 ( (1 – f )+ f / S )

Principles Make the common case fast

As f → 1, speedup → S Speedup is limited by the fraction of code that can be op-

timized As S → ∞, speedup → 1 / (1 – f )

Uncommon case can become the common one after im-provement

Suppose we enhance a machine making all floating-point instructions run five times faster. If the execution time of some benchmark before the floating-

point enhancement is 10 seconds What will the speedup be, if half of the 10 seconds is spent ex-

ecuting floating-point instructions?

We are looking for a benchmark to show off the new float-ing-point unit described above, and want the overall benchmark to show a speedup of 3. One benchmark we are considering runs for 100 seconds with

the old floating-point hardware. How much of the execution time would floating-point instruc-

tions have to account for in this program in order to yield our desired speedup on this benchmark?

Example

26

Performance is specific to a particular pro-gram(s) Total execution time is a consistent summary of

the performance

For a given architecture, performance in-creases come from: Increases in clock rate (without adverse CPI af-

fects) Improvements in processor organization that lower

CPI Compiler enhancements that lower CPI and/or in-

struction count Algorithm/Language choices that affect instruction

count

Pitfall: Expecting improvement in one aspect of a ma-

chine’s performance to affect the total perfor-mance

Summary

27