ruby3x3: how are we going to measure 3x

71

Click here to load reader

Upload: matthew-gaudet

Post on 24-Jan-2018

1.946 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Ruby3x3: How are we going to measure 3x

Ruby3x3: How are we going to measure 3x?

Page 2: Ruby3x3: How are we going to measure 3x

Matthew Gaudet

Developer on the

Eclipse OMR project

Page 3: Ruby3x3: How are we going to measure 3x

Cross platform components for

building reliable, high performance language

runtimes

github.com/eclipse/omr

@eclipseOMR

Page 4: Ruby3x3: How are we going to measure 3x

0

0.5

1

1.5

2

2.5

3

3.5

Ruby 2.0 Ruby 3.0

Ruby 3x3: The Goal.

Performance

Page 5: Ruby3x3: How are we going to measure 3x

Agenda

Let’s talk about benchmarking! • Some definitions• Some philosophy• Some pitfalls

Ruby 3x3• Some Thoughts from Me.

Page 6: Ruby3x3: How are we going to measure 3x

Benchmarking.

6

Art + Science

Page 7: Ruby3x3: How are we going to measure 3x

Definition

Benchmark:

A piece of computer code run in

order to gather measurements

for comparison.

Page 8: Ruby3x3: How are we going to measure 3x

Definition

Benchmark:

Comparing the execution time of different

interpreters, or options.

Comparing the execution time of algorithms

Comparing the accuracy of different machine

learning algorithms

Page 9: Ruby3x3: How are we going to measure 3x

The Art of Benchmarking: What do you run?

Page 10: Ruby3x3: How are we going to measure 3x

MicrobenchmarkFull

Application

The Benchmark Continuum

Application Kernel

Page 11: Ruby3x3: How are we going to measure 3x

Microbenchmarks

Pros

Often easy to setup and run.

Targeted to a particular

aspect.

Fast acquisition of data.

Cons

Exaggerates effects.

Not typically generalizable.

A very small program written to explore the performance of one aspect of the system under test.

Page 12: Ruby3x3: How are we going to measure 3x

Full Applications

Pros

Immediate and obvious real

world impact!

Cons

Small effects can be

swamped in natural

application variance.

Can be complicated to

setup, or slow to run!

Benchmarking a whole application

Page 13: Ruby3x3: How are we going to measure 3x

Application Kernel

Pros

Tight connection to real

world code.

Typically more

generalizable.

Cons

Difficult to know how much

of a an application should

be included vs. mocked.

A particular part of an application extracted for the express purpose of constructing a benchmark.

Page 14: Ruby3x3: How are we going to measure 3x

Pitfalls in benchmark design

Un-Ruby-Like Code:

Code that looks like another language.

“You can write FORTRAN in any language”

Code that never produces garbage.

Code without exceptions

Page 15: Ruby3x3: How are we going to measure 3x

Pitfalls in benchmark design

Input Data is a key part of many benchmarks: Watch out

for weird input data!

Imagine an MP3 compressor benchmark

– Inputs are

1. Silence. weird because most mp3s are not silence.

2. White noise. weird because most mp3s have some

structure.

– Reduces the generalizability of the results!

Page 16: Ruby3x3: How are we going to measure 3x

The Art of Benchmarking: What do you run?

What do you measure?

Page 17: Ruby3x3: How are we going to measure 3x

Time?

Throughput?

Latency?

Page 18: Ruby3x3: How are we going to measure 3x

Definition

Wall-clock time:

The measurement of relative to

a clock independent of the

process being timed.

$ time sleep 1

real 0m1.003suser 0m0.000ssys 0m0.000s

Page 19: Ruby3x3: How are we going to measure 3x

Definition

CPU time:

Measurement of how much of

the CPU the process actually

used

$ time sleep 1

real 0m1.003suser 0m0.000ssys 0m0.000s

Page 20: Ruby3x3: How are we going to measure 3x

Definition

Throughput:

A count of operations that occur

per unit of time.

Page 21: Ruby3x3: How are we going to measure 3x

Definition

Latency:

The time it takes for a response

to occur after stimulus.

Page 22: Ruby3x3: How are we going to measure 3x

The Art of Benchmarking: What do you run?

What do you measure?

What do you report?

Page 23: Ruby3x3: How are we going to measure 3x

Raw Measurements?

Speedup?

Page 24: Ruby3x3: How are we going to measure 3x

Definition

Speedup:

A ratio computed between a

baseline and experimental time

measurement.

𝑇𝑏𝑎𝑠𝑒𝑙𝑖𝑛𝑒𝑇𝑒𝑥𝑝𝑒𝑟𝑖𝑚𝑒𝑛𝑡𝑎𝑙

Page 25: Ruby3x3: How are we going to measure 3x

The Science of Benchmarking

Page 26: Ruby3x3: How are we going to measure 3x

An aside on misleading with speedup.

Speedup:

A ratio computed between a

baseline and experimental time

measurement.

Page 27: Ruby3x3: How are we going to measure 3x

An aside on misleading with speedup.

Speedup:

A ratio computed between a

baseline and experimental time

measurement.

Page 28: Ruby3x3: How are we going to measure 3x

An aside on misleading with speedup.

“He who controls the baseline

controls the speedup”

Page 29: Ruby3x3: How are we going to measure 3x

An aside on misleading with speedup.

“Our parallelization system shows

linear speedup as the number of

threads increases”

Page 30: Ruby3x3: How are we going to measure 3x

An aside on misleading with speedup.

0

1

2

3

4

5

6

7

8

9

1 thread 2 thread 4 thread 8 thread

SPEEDUPSpeedup

Page 31: Ruby3x3: How are we going to measure 3x

An aside on misleading with speedup.

Measurement Time (s)

Original Sequential Program 10.0

Parallelized 1 thread 100.0

Parallelized 2 thread 50.0

Parallelized 4 thread 25.0

Parallelized 8 thread 12.5

The distinction between relative speedupand absolute speedup.

Page 32: Ruby3x3: How are we going to measure 3x

How you measure affects whatyou measure.

Page 33: Ruby3x3: How are we going to measure 3x

Both of these are valid benchmarks!

$ cat test.rb

...

puts Benchmark.measure {

1_000_000.times {

compute_foo()

}

}

$ for i in `seq 1 10`; do ruby t.rb ; done;

...

10.times {

puts Benchmark.measure {

1_000_000.times {

compute_foo()

}

}

}

vs.

But they’re going to measure (and may encourage the optimization of ) two different things!

Page 34: Ruby3x3: How are we going to measure 3x

Definition

Warmup:

The time from application start

until it hits peak performance.

100

64 69

3625 30 25 26 25 26 25

1 2 3 4 5 6 7 8 9 10 11

Time per Iteration (s)

Page 35: Ruby3x3: How are we going to measure 3x

When has warmup finished?

Despite this, even knowing warmup exists is important: It allows us to choose methodologies that can accommodate the possibility!

Page 36: Ruby3x3: How are we going to measure 3x

Definition

Run-to-Run Variance

The observed effect that

identical runs do not have

identical times. $ for i in `seq 1 5`; do ruby -I../../lib/ string-equal.rb--loopn 1 1000; done; 1.3473345581.3483506321.306904781.3147649771.323862345

Page 37: Ruby3x3: How are we going to measure 3x

Methodology:

An incomplete list of decisions that need to be made when

developing benchmarking methodology:

1. Does your methodology account for warmup?

2. How are you accounting for run-to-run variance?

3. How are you accounting for the effects of the garbage

collector?

Page 38: Ruby3x3: How are we going to measure 3x

Pitfalls in benchmark design

Accounting for warmup often means producing

intermediate scores, so you can see when they stabilize.

If you aren’t accounting for warmup, you may find

that you miss out on peak performance.

Page 39: Ruby3x3: How are we going to measure 3x

Pitfalls in benchmark design

Account for run to run variance by running multiple times,

and presenting confidence intervals!

Be sure you’re methodology doesn’t encourage wild

variations in performance though!

Page 40: Ruby3x3: How are we going to measure 3x

Be aware, benchmarks can act Weird

Page 41: Ruby3x3: How are we going to measure 3x

Garbage Collector Impact

ruby -J-Xmx330m -J-Xms330m -I../../lib/ connected.rb --loopn10 1

0.4264123000029940.354429644008632750.34847818300477230.362810398000874560.35657457199704370.361791819988866340.317137328005628660.33650193299399690.3053975360089680.3006619710067753

ruby -J-Xmx33m -J-Xms33m -I../../lib/ connected.rb --loopn10 1

0.54314418800640850.84104106100858190.79751591700187420.84587562699744010.99742122599855071.08870255399961021.0670530100032921.0570035319979071.07081619399832561.0480617069988512

Page 42: Ruby3x3: How are we going to measure 3x

Garbage Collector Impact

Garbage collector impact can make benchmarks incredibly difficult to

compare:

The Ruby+OMR Preview uses the OMR GC technology, including a

change to move off heap data on heap.

Side effect of this is that it’s crazy difficult to compare against the default

ruby: there’s an entirely different set of data on the heap!

If heap size adapts to machine memory, you’ll need to figure out how to

lock it to give good comparisons across machines

42

string malloc string OMRBuffer

Page 43: Ruby3x3: How are we going to measure 3x

Benchmarking:

Page 44: Ruby3x3: How are we going to measure 3x
Page 45: Ruby3x3: How are we going to measure 3x

User Error

$ time ruby their_implementation.rb 100000

real 0m10.003suser 0m08.001ssys 0m02.007s

$ time ruby my_implementation.rb 10000

real 0m1.003suser 0m0.801ssys 0m0.206s

10x speedup!

Page 46: Ruby3x3: How are we going to measure 3x

User Error

$ time ruby their_implementation.rb 100000

real 0m10.003suser 0m08.001ssys 0m02.007s

$ time ruby my_implementation.rb 10000

real 0m1.003suser 0m0.801ssys 0m0.206s

10x speedup!

Pro Tip: Use a harness that keeps you out of the benchmarking process.

Aim for reproducibility!

Page 47: Ruby3x3: How are we going to measure 3x

Tim

e (

s)

Iterations

Unplugs Laptop

Return Power

Power Saving

Mode

Page 48: Ruby3x3: How are we going to measure 3x

Other Hardware Effects to watch for!

TurboBoost (and similar): Frequency scaling based

on the season.

Page 49: Ruby3x3: How are we going to measure 3x

Other Hardware Effects to watch for!

TurboBoost (and similar): Frequency scaling based

on the season location.

Page 50: Ruby3x3: How are we going to measure 3x

Other Hardware Effects to watch for!

TurboBoost (and similar): Frequency scaling based

on the season location rack

Page 51: Ruby3x3: How are we going to measure 3x

Other Hardware Effects to watch for!

TurboBoost (and similar): Frequency scaling based

on the season location rack CPU temperature.

Even in the cloud! [1]

[1]: http://www.brendangregg.com/blog/2014-09-15/the-msrs-of-ec2.html

Page 52: Ruby3x3: How are we going to measure 3x

Software Pitfalls

What about your backup service?

Long sequence of benchmarks… do you have

automatic software updates installed?

Do your system administrators know you are

benchmarking?

Page 53: Ruby3x3: How are we going to measure 3x

What about your screensaver?

Page 54: Ruby3x3: How are we going to measure 3x

Paranoia is a matter of Effect Sizes

Hardware Changes:

– Disable turbo boost,

– Disable hyperthreading.

Krun tool:

– Set ulimit for heap and stack.

– Reboot machine before execution

– Monitor dmesg for unexpected output

– Monitor temperature of machine.

– Disable pstates

– CPU Governor set to performance mode.

– Perf sample rate control.

– Disable ASLR.

– Create a new user account for each run

http://arxiv.org/pdf/1602.00602v1.pdf

Page 55: Ruby3x3: How are we going to measure 3x

Performance improvements compound!

55

is 10 increases of 11% is 25 increases of 4.5% is 100 increases of 1.1%

3x

Page 56: Ruby3x3: How are we going to measure 3x

0

0.5

1

1.5

2

2.5

3

3.5

Ruby 2.0 Ruby 2.1 Ruby 2.2 Ruby 2.3 Ruby 2.4 Ruby 2.? Ruby 2.? Ruby 2.? Ruby 2.? Ruby 2.? Ruby 2.? Ruby 3.0

Ruby 3x3: The Process

Performance

(Made up data for illustration only)

Page 57: Ruby3x3: How are we going to measure 3x

Philosophizing

Page 58: Ruby3x3: How are we going to measure 3x

Philosophy

Benchmarks drive change.

– What you measure is what

people try to change.

–What you don’t measure, may

not change how you want.

Page 59: Ruby3x3: How are we going to measure 3x

Squeezing a Water Balloon

Be sure to measure associated metrics to have a

clear headed view of tradeoffs:

For example: JIT Compilation:

Trade startup speed for peak speed.

Trade footprint for speed.

Page 60: Ruby3x3: How are we going to measure 3x

Benchmarks age!

Benchmarks can be wrung of all their possible

performance at some point.

Using the same benchmarks for too long can lead to

shortsighted decisions driven by old benchmarks.

Idiomatic code evolves in a language.

Benchmark use of language features can help drive

adoption!

–Be sure to benchmark desirable new language features!

60

Page 61: Ruby3x3: How are we going to measure 3x

Benchmarking 3x3

Page 62: Ruby3x3: How are we going to measure 3x

https://twitter.com/tenderlove/status/765288219931881472

62

Page 63: Ruby3x3: How are we going to measure 3x

Ruby Community has some great starting points!

Page 64: Ruby3x3: How are we going to measure 3x

Recall: Benchmarks drive change

Thought: Choose 9 application kernels that represent what we want from a future CRuby!

• Why 9?• Too many benchmarks can diffuse effort.• Also! 3x3 = 9!

¯\_(ツ)_/¯

Page 65: Ruby3x3: How are we going to measure 3x

Brainstorming on the nine?

1. Some CPU intensive applications: • OptCarrot, Neural Nets, Monte Carlo Tree

Search, PSD filter pipeline? 2. Some memory intensive application:

• Large tree mutation benchmark? 3. A startup benchmark:

• time ruby -e “def foo; ‘100’; end; puts foo”? 4. Some web application framework benchmarks.

Page 66: Ruby3x3: How are we going to measure 3x

Choose a methodology that drives the change we want in CRuby.

Want great performance, but not huge warmup

times?

–Only run 5 iterations, and score the last one?

Don’t want to deal with warmup?

–Don’t run iterations: Score the first run!

I Error Bars

Page 67: Ruby3x3: How are we going to measure 3x

One last idea…

Page 68: Ruby3x3: How are we going to measure 3x

What about a more ambitious choice?

Page 69: Ruby3x3: How are we going to measure 3x

Use the ecosystem!

Add a standard performance harness to RubyGems.

Would allow VM developers to sample popular gems, and

run a perf suite written by gem authors.

With effort, time and $$$, we could make broad statements

about performance impact on the gem ecosystem.

Page 70: Ruby3x3: How are we going to measure 3x

Use the ecosystem!

Doesn’t just help VM developers

Gem authors get

1. Enabled for performance tracking!

2. Easier performance reporting with VM developers.

Page 71: Ruby3x3: How are we going to measure 3x

Credits Headache: https://en.wikipedia.org/wiki/Headache#/media/File:Cruikshank_-

_The_Head_Ache.png

@MattStudies

[email protected]

For more on software systems evaluation, be sure to visitThe Evaluate Collaboratory @

http://evaluate.inf.usi.ch/