scaling ingest pipelines with high performance computing principles - rajiv kurian

84

Upload: signalfx

Post on 10-Aug-2015

254 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

SignalFx

Page 2: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

SignalFx

Scaling ingest pipelines withhigh performance computing principles

Rajiv Kurian, Software [email protected]

Page 3: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Agenda

1. Why we need to scale ingest2. Basic properties and limitations of modern

hardware3. Optimization techniques inspired by HPC4. Results!5. Q&A (hopefully!)

Page 4: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

SignalFx

Why we need to scale ingest

Page 5: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

• High resolution: • Up to 1 sec

• Streaming analytics: • Charts/analytics update @1sec• Real time

• Multidimensional metrics: • Dimensions : representing customer, server etc• Filter, aggregate : 99th-pct-latency-by-service,customer

SignalFx is an advanced monitoring platform for modern applications

Page 6: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Ingest pipeline

ROLLUPS PERSISTREST/RATE CONTROL

Raw time series data Processed data to

analytics

Page 7: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

SignalFx ingest library

Raw data in Rollup data out

TimeSeries 0 rollup

TimeSeries 1 rollup

TimeSeries 2 rollup

TimeSeries 3 rollup

TimeSeries 4 rollup

TimeSeries 5 rollup

TimeSeries 6 rollup

TimeSeries 7 rollup

TimeSeries 8 rollup

Page 8: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Issues identified (before applying HPC techniques)

• Expensive - too many servers

• Exhibits parallel slow down• More threads = worse performance

• What did the profile say? • Death by a thousand cuts• The core library = 35% of profile

Page 9: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

SignalFx

Basic properties and limitationsof modern hardware

Page 10: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

SignalFx

L1 Data L1 Instruction

L3

L1 Data L1 Instruction

L2L2

Core 1 Core 2

Main memory

Page 11: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Cache Lines

• Data is transferred between memory and cache in blocks of fixed size, called cache lines. Usually 64 bytes

• When the processor needs to read or write a location in main memory, it first checks for a corresponding entry in the cache. In the case of:

• a cache hit, the processor immediately reads or writes the data in the cache line

• a cache miss, the cache allocates a new entry and copies in data from main memory, then the request (read or write) is fulfilled from the contents of the cache

• The memory subsystem makes two kinds of bets to help us:• Temporal locality• Spatial locality

Page 12: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Reference latency numbers for comparison

By Jeff Dean: http://research.google.com/people/jeff/

L1 Cache 0.5ns

Branch mispredict 5 ns

L2 Cache 7 ns 14x L1 Cache

Mutex lock/unlock 25 ns

Main memory 100 ns 20x L2 Cache, 200x L1 Cache

Compress 1K bytes (Zippy) 3,000 ns

Send 1K bytes over 1Gbps 10,000 ns 0.01 ms

Read 4K randomly from SSD 150,000 ns 0.15 ms

Read 1MB sequentially from memory 250,000 ns 0.25 ms

Round trip within same DC 500,000 ns 0.5 ms

Read 1MB sequentially from SSD 1,000,000 ns 1 ms 4x memory

Disk seek 10,000,000 ns 10 ms 20x DC roundtrip

Read 1MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20x SSD

Send packet CA->Netherlands->CA 150,000,000 ns 150 ms

Page 13: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

SignalFx

L1 CORE

Page 14: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

SignalFx

L2 CORE

Page 15: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

SignalFx

MainMemory CORE

Page 16: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Our optimization goal

Convert a memory bandwidth bound application to a CPU bound application

Page 17: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Things we kept in mind

• Measure, measure, measure!

• Don’t rely on micro benchmarks alone

Page 18: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

SignalFx

Benchmark

Page 19: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

SignalFx library benchmark

Rollup data out

Key Value

ID 0 TimeSeries rollup 0

ID 1 TimeSeries rollup 1

ID 2 TimeSeries rollup 2

ID 3 TimeSeries rollup 3

ID 4 TimeSeries rollup 4

….. …..

….. …..

Key 1M TimeSeries rollup 1M

Raw data in,in random order,

one per Time Series.50x

Page 20: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

SignalFx library benchmark

Rollup data out

Key Value

ID 0 TimeSeries rollup 0

ID 1 TimeSeries rollup 1

ID 2 TimeSeries rollup 2

ID 3 TimeSeries rollup 3

ID 4 TimeSeries rollup 4

….. …..

….. …..

Key 1M TimeSeries rollup 1M

Raw data in,in random order,

one per Time Series.50x

35% of the profile of the entire application

Page 21: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

SignalFx

Techniques inspired by HPC that have improved our pipeline

Single threaded, event based architectures:parallelize by running multiple copies of

single threaded code

Page 22: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Single threaded event based architectures

• Threads work on their own private data (as much as possible)

• Communicate with other threads using events/messages

Page 23: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

SignalFx

local data

Network In thread Processor thread(s) Network out thread

Receive data

Process data

Write batcheddata

Events Events

Key Value

key 1 value 1

key 2 value 2

key 3 value 3

key 4 value 4

Page 24: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

SignalFx

Network In thread Processor thread(s) Network out thread

Receive data

Process data Write batcheddata

local data

Ring Buffer

Ring Buffer

Key Value

key 1 value 1

key 2 value 2

key 3 value 3

key 4 value 4

Page 25: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Single threaded event based architectures advantages

• It enables many other optimal choices like• Compact array based data structures• Buffer/object re-use

• Loosely coupled - easy to test

• Run multiple copies for parallelism

Page 26: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

SignalFx

Ring Buffer

Network In thread Worker thread(s) Network out thread

Receive data

Process data Write batcheddata

local data

1

2

3 4

Ring Buffer

Ring Buffer

Key Value

key 1 value 1

key 2 value 2

key 3 value 3

key 4 value 4

Page 27: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

SignalFx

Worker thread

Receive data using Async IO

Process data synchronously

Write data using Async IO

Receive data using Async IO

Process data synchronously

Write data using Async IO

Receive data using Async IO

Process data synchronously

Write data using Async IO

Worker thread Worker thread

local data local data local data

Key Value

key 5 value 5

key 6 value 6

key 7 value 7

key 8 value 8

Key Value

key 1 value 1

key 2 value 2

key 3 value 3

key 4 value 4

Key Value

key 9 value 9

key 10 value 10

key 11 value 11

key 12 value 12

Page 28: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

SignalFx

Ring Buffer

Network thread Processor thread(s) Async IO thread

Receive data

Process data

Batched IO calls

local data

1

2

3 4

Ring Buffer

Ring Buffer

5

Ring Buffer

6

7

Key Value

key 1 value 1

key 2 value 2

key 3 value 3

key 4 value 4

Page 29: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Advice for threaded applications

• Threads should ideally reflect the actual parallelism of the system.• Avoid gratuitous over subscribing• Exception: IO threads?

• DO NOT communicate unless you have to

Page 30: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

SignalFx

Techniques inspired by HPC that have improved our pipeline

Use compact, cache-conscious, array based data structures with minimal indirection

Page 31: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

SignalFx

L1 Data L1 Instruction

L3

L1 Data L1 Instruction

L2L2

Core 1 Core 2

Main memory

Page 32: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Basic principles

• Strive for smaller data structures• Extra computation is ok• E.g. Compressing network data

• Design data structures that facilitate processing multiple entries—big arrays!

• Layout should reflect access patterns

Page 33: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Hash maps

• Hash maps look ups are NOT free!

• A lookup in a well implemented hash map is by definition a cache miss

• Popular implementations like java.util.HashMap can cause multiple cache misses

Page 34: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

valuekey

key* value*key* value*

List

List

List

List

Typical hash map implemented as an array of lists of key* | value*

Page 35: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

valuekey

key* value*key* value*

List

List

List

List

1

Cache misses in a typical hash-map implementation

Page 36: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

valuekey

key* value*key* value*

List

List

List

List

1

2

Cache misses in a typical hash-map implementation

Page 37: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

valuekey

key* value*key* value*

List

List

List

List

1

2

3

Cache misses in a typical hash-map implementation

Page 38: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

valuekey

key* value*key* value*

List

List

List

List

1

2

3 4

Cache misses in a typical hash-map implementation

Page 39: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

valuekey

key* value*key* value*

List

List

List

List

1

2

3 4

Cache misses in a typical hash-map implementation

Page 40: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

valuekey

key* value*key* value*

Hash map implemented as an array of lists of key* | value*

List

List

List

List

Page 41: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Array of co-located key/value

Key Value

Key 0 Value 0

Key 1 Value 1

Key 2 Value 2

Key 3 Value 3

Key 4 Value 4

Key 5 Value 5

Key 6 Value 6

Key 7 Value 7

Page 42: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Cache misses with no collision

Key Value

Key 0 Value 0

Key 1 Value 1

Key 2 Value 2

Key 3 Value 3

Key 4 Value 4

Key 5 Value 5

Key 6 Value 6

Key 7 Value 7

1

Page 43: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Cache misses with collisions

Key Value

Key 0 Value 0

Key 1 Value 1

Key 2 Value 2

Key 3 Value 3

Key 4 Value 4

Key 5 Value 5

Key 6 Value 6

Key 7 Value 7

1

2

Page 44: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Hash map of key to index to an array of structs

Value 0

Value 1

Value 2

Value 3

Value 4

Value 5

Value 6

Value 7

Value 8

Key Index

Key 0 1

Key 1 6

Key 2 4

Key 3 8

Page 45: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Cache misses with collision

Value 0

Value 1

Value 2

Value 3

Value 4

Value 5

Value 6

Value 7

Value 8

Key Index

Key 0 1

Key 1 6

Key 2 4

Key 3 8

1

Page 46: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Cache misses with collision

Value 0

Value 1

Value 2

Value 3

Value 4

Value 5

Value 6

Value 7

Value 8

Key Index

Key 0 1

Key 1 6

Key 2 4

Key 3 8

1 2

Page 47: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

New library memory layout

TimeSeries rollup 0

TimeSeries rollup 1

TimeSeries rollup 2

TimeSeries rollup 3

TimeSeries rollup 4

TimeSeries rollup 5

TimeSeries rollup 6

TimeSeries rollup 7

TimeSeries rollup 8

ID Index

ID 0 1

ID 1 6

ID 2 4

ID 3 8

Raw data in Rollup out

Page 48: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Changing hash map implementations

• java.util.HashMap (uses separate chaining and boxes primitives) to make a long -> int lookup • Allocations galore

• net.openhft.koloboke primitive open hash map• 45% improvement

For the JVM use libraries like https://github.com/OpenHFT/Koloboke.For C++ try https://github.com/preshing/CompareIntegerMaps or similar.

Page 49: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Access patterns

Hot data

Cold data

object 0

object 1

object 2

object 3

Field 0 Field 1 Field 2 Field 3 Field 4

Field 0 Field 1 Field 2 Field 3 Field 4

Field 0 Field 1 Field 2 Field 3 Field 4

Field 0 Field 1 Field 2 Field 3 Field 4

Page 50: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Group fields accessed together

Hot fields

Cold fields

object 1

Field 0 Field 1 Field 2

Field 0 Field 1 Field 2

Field 0 Field 1 Field 2

Field 0 Field 1 Field 2

Field 3 Field 4

Field 3 Field 4

Field 3 Field 4

Field 3 Field 4

Page 51: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Results of separating hot and cold data

A hot loop run about once every 500 ms

• Old - Hot and cold data kept together• 5 cache lines per time series• Took anywhere between 62-70 ms

• New - Hot and cold data kept separate• 3 cache lines of hot data per time series• Took anywhere between 40-45 ms

• 35% improvement

Page 52: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

SignalFx

Results (library)!

Page 53: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Old vs New

• Concurrent -> single threaded• Locks gone• Array based data structures• Zero allocations• Extensive batching and hardware prefetching

• Multiple hash maps -> a single hash map look up

Page 54: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Old vs New

Page 55: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Old vs New

76 K/secVS

2.1 M/sec

Page 56: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Old vs New

27x

Page 57: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Old vs New

35 %

Page 58: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

SignalFx

Results (application)!

Page 59: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Amdahl’s law - 35%O

vera

ll sp

eedu

p

0

0.4

0.8

1.2

1.6

Library speed up

1 4 8 12 16 20 24 28 ∞

Page 60: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

CPU

Page 61: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

CPU

3.4x

Page 62: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

35% of the profile but 3.4x improvement?

• Amdahl’s law• Max 1.54x improvement if 35% => 0%

• Why 3.4x ? • When you use less cache, you leave more for

others - thus speeding up other code too

• Lesson• A profiler is a necessary tool, but not a substitute for

informed design

Page 63: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Heap growth

Page 64: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Closing remarks / rant

• “Write code first, profile later” = BAD

• Excessive encapsulation leads to myopic decisions being made re: perf • allocations• “thread safe” code

• Beware of micro benchmarks

Page 66: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

SignalFx

Bonus slides

Page 67: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Composition in C

Struct A Struct B

Struct B1embedded

Struct B2embedded

int

int

int

int

int

int

int

int

Page 68: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

int

int

int

int

Composition in JavaObject A Object B

Object B1 Object B2

int

int

int

int

B1

B2

Page 69: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Actual layout?

int

int

int

int

Object B

Object B1 Object B2

B1

B2

Page 70: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Actual layout?Object B

B1

B2

B (header)

B1*

B2*

Page 71: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Actual layout?

int

int

Object B

Object B1

B1

B2

B (header)

B1*

B2*

Page 72: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Actual layout?

int

int

Object B

Object B1

B1

B2

B (header)

B1*

B2*

B1 (header)

int

int

Page 73: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Actual layout?

int

int

Object B

Object B1

B1

B2

B (header)

B1*

B2*

B1 (header)

int

int

Page 74: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Actual layout?

int

int

int

int

Object B

Object B1 Object B2

B1

B2

B (header)

B1*

B2*

B1 (header)

int

int

Page 75: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Actual layout?

int

int

int

int

Object B

Object B1 Object B2

B1

B2

B (header)

B1*

B2*

B1 (header)

int

int

B2 (header)

int

int

Page 76: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Actual layout?

int

int

int

int

Object B

Object B1 Object B2

B1

B2

B (header)

B1*

B2*

B1 (header)

int

int

B2 (header)

int

int

Page 77: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Potential layout after GC

int

int

int

int

Object B

Object B1 Object B2

B1

B2

B (header)

B1*

B2*

Other data

B1 (header)

int

int

Other data

B2 (header

int

int

Page 78: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

SignalFx

Techniques inspired by HPC that have improved our pipeline

Separate the control and data planes

Page 79: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Frequent

Infrequent

A networking concept

Routing table

Packets in Packets out

Routing data

Key Value

Page 80: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

What the control and data planes do

In networking terminology:• Data plane - Defines the part that

decides what to do with packets arriving on an inbound interface—Frequent

• Control plane - Defines the part that is concerned with drawing the network map or routing table—Infrequent

Page 81: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

The goal of control and data plane separation

DO NOT slow the frequent path because of the infrequent path

Page 82: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Runtime configuration variables

Worker threadConfiguration variables(volatile/atomic)

Setter thread

while (1) {process_data_using_configuration_variables();

}

Flag 0

Flag 1

Flag 2

Flag 3

Page 83: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Flag 0

Flag 1

Flag 2

Flag 3

Runtime configuration variables

Worker threadConfiguration variables(volatile/atomic)

Setter thread

Flag 0

Flag 1

Flag 2

Flag 3

while (1) {cache_configuration_variables();process_a_ton_of_stuff();

}

Cached configurationvariables

Page 84: Scaling ingest pipelines with high performance computing principles - Rajiv Kurian

Volatile/atomic flag vs cached local flag

• All run time flags (used on every data point) are volatile/atomic loads

• All run time flags are cached and refreshed on each run loop• About 8% improvement in datapoint/second. Others

might see more or less