scaling ingest pipelines with high performance computing principles - rajiv kurian

SignalFx

SignalFx

Scaling ingest pipelines withhigh performance computing principles

Rajiv Kurian, Software [email protected]

https://signalfx.com

Agenda

1. Why we need to scale ingest2. Basic properties and limitations of modern

hardware3. Optimization techniques inspired by HPC4. Results!5. Q&A (hopefully!)

SignalFx

Why we need to scale ingest

• High resolution: • Up to 1 sec

• Streaming analytics: • Charts/analytics update @1sec• Real time

• Multidimensional metrics: • Dimensions : representing customer, server etc• Filter, aggregate : 99th-pct-latency-by-service,customer

SignalFx is an advanced monitoring platform for modern applications

Ingest pipeline

ROLLUPS PERSISTREST/RATE CONTROL

Raw time series data Processed data to

analytics

SignalFx ingest library

Raw data in Rollup data out

TimeSeries 0 rollup

TimeSeries 1 rollup

TimeSeries 2 rollup

TimeSeries 3 rollup

TimeSeries 4 rollup

TimeSeries 5 rollup

TimeSeries 6 rollup

TimeSeries 7 rollup

TimeSeries 8 rollup

Issues identified (before applying HPC techniques)

• Expensive - too many servers

• Exhibits parallel slow down• More threads = worse performance

• What did the profile say? • Death by a thousand cuts• The core library = 35% of profile

SignalFx

Basic properties and limitationsof modern hardware

SignalFx

L1 Data L1 Instruction

L3


L2L2

Core 1 Core 2

Main memory

Cache Lines

• Data is transferred between memory and cache in blocks of fixed size, called cache lines. Usually 64 bytes

• When the processor needs to read or write a location in main memory, it first checks for a corresponding entry in the cache. In the case of:

• a cache hit, the processor immediately reads or writes the data in the cache line

• a cache miss, the cache allocates a new entry and copies in data from main memory, then the request (read or write) is fulfilled from the contents of the cache

• The memory subsystem makes two kinds of bets to help us:• Temporal locality• Spatial locality

Reference latency numbers for comparison

By Jeff Dean: http://research.google.com/people/jeff/

L1 Cache 0.5ns

Branch mispredict 5 ns

L2 Cache 7 ns 14x L1 Cache

Mutex lock/unlock 25 ns

Main memory 100 ns 20x L2 Cache, 200x L1 Cache

Compress 1K bytes (Zippy) 3,000 ns

Send 1K bytes over 1Gbps 10,000 ns 0.01 ms

Read 4K randomly from SSD 150,000 ns 0.15 ms

Read 1MB sequentially from memory 250,000 ns 0.25 ms

Round trip within same DC 500,000 ns 0.5 ms

Read 1MB sequentially from SSD 1,000,000 ns 1 ms 4x memory

Disk seek 10,000,000 ns 10 ms 20x DC roundtrip

Read 1MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20x SSD

Send packet CA->Netherlands->CA 150,000,000 ns 150 ms

SignalFx

L1 CORE

SignalFx

L2 CORE

SignalFx

MainMemory CORE

Our optimization goal

Convert a memory bandwidth bound application to a CPU bound application

Things we kept in mind

• Measure, measure, measure!

• Don’t rely on micro benchmarks alone

SignalFx

Benchmark

SignalFx library benchmark

Rollup data out

Key Value

ID 0 TimeSeries rollup 0





….. …..

….. …..

Key 1M TimeSeries rollup 1M

Raw data in,in random order,

one per Time Series.50x

SignalFx library benchmark

Rollup data out

Key Value






….. …..

….. …..

Key 1M TimeSeries rollup 1M

Raw data in,in random order,

one per Time Series.50x

35% of the profile of the entire application

SignalFx

Techniques inspired by HPC that have improved our pipeline

Single threaded, event based architectures:parallelize by running multiple copies of

single threaded code

Single threaded event based architectures

• Threads work on their own private data (as much as possible)

• Communicate with other threads using events/messages

SignalFx

local data

Network In thread Processor thread(s) Network out thread

Receive data

Process data

Write batcheddata

Events Events

Key Value

key 1 value 1

key 2 value 2

key 3 value 3

key 4 value 4

SignalFx

Network In thread Processor thread(s) Network out thread

Receive data

Process data Write batcheddata

local data

Ring Buffer

Ring Buffer

Key Value

key 1 value 1

key 2 value 2

key 3 value 3

key 4 value 4

Single threaded event based architectures advantages

• It enables many other optimal choices like• Compact array based data structures• Buffer/object re-use

• Loosely coupled - easy to test

• Run multiple copies for parallelism

SignalFx

Ring Buffer

Network In thread Worker thread(s) Network out thread

Receive data

Process data Write batcheddata

local data

1

2

3 4

Ring Buffer

Ring Buffer

Key Value

key 1 value 1

key 2 value 2

key 3 value 3

key 4 value 4

SignalFx

Worker thread

Receive data using Async IO

Process data synchronously

Write data using Async IO







Worker thread Worker thread

local data local data local data

Key Value

key 5 value 5

key 6 value 6

key 7 value 7

key 8 value 8

Key Value

key 1 value 1

key 2 value 2

key 3 value 3

key 4 value 4

Key Value

key 9 value 9

key 10 value 10

key 11 value 11

key 12 value 12

SignalFx

Ring Buffer

Network thread Processor thread(s) Async IO thread

Receive data

Process data

Batched IO calls

local data

1

2

3 4

Ring Buffer

Ring Buffer

5

Ring Buffer

6

7

Key Value

key 1 value 1

key 2 value 2

key 3 value 3

key 4 value 4

Advice for threaded applications

• Threads should ideally reflect the actual parallelism of the system.• Avoid gratuitous over subscribing• Exception: IO threads?

• DO NOT communicate unless you have to

SignalFx


Use compact, cache-conscious, array based data structures with minimal indirection

SignalFx


L3


L2L2

Core 1 Core 2

Main memory

Basic principles

• Strive for smaller data structures• Extra computation is ok• E.g. Compressing network data

• Design data structures that facilitate processing multiple entries—big arrays!

• Layout should reflect access patterns

Hash maps

• Hash maps look ups are NOT free!

• A lookup in a well implemented hash map is by definition a cache miss

• Popular implementations like java.util.HashMap can cause multiple cache misses

valuekey

key* value*key* value*

List

List

List

List

Typical hash map implemented as an array of lists of key* | value*

valuekey


List

List

List

List

1

Cache misses in a typical hash-map implementation

valuekey


List

List

List

List

1

2


valuekey


List

List

List

List

1

2

3


valuekey


List

List

List

List

1

2

3 4


valuekey


Hash map implemented as an array of lists of key* | value*

List

List

List

List

Array of co-located key/value

Key Value

Key 0 Value 0

Key 1 Value 1

Key 2 Value 2

Key 3 Value 3

Key 4 Value 4

Key 5 Value 5

Key 6 Value 6

Key 7 Value 7

Cache misses with no collision

Key Value

Key 0 Value 0

Key 1 Value 1

Key 2 Value 2

Key 3 Value 3

Key 4 Value 4

Key 5 Value 5

Key 6 Value 6

Key 7 Value 7

1

Cache misses with collisions

Key Value

Key 0 Value 0

Key 1 Value 1

Key 2 Value 2

Key 3 Value 3

Key 4 Value 4

Key 5 Value 5

Key 6 Value 6

Key 7 Value 7

1

2

Hash map of key to index to an array of structs

Value 0

Value 1

Value 2

Value 3

Value 4

Value 5

Value 6

Value 7

Value 8

Key Index

Key 0 1

Key 1 6

Key 2 4

Key 3 8

Cache misses with collision

Value 0

Value 1

Value 2

Value 3

Value 4

Value 5

Value 6

Value 7

Value 8

Key Index

Key 0 1

Key 1 6

Key 2 4

Key 3 8

1

Cache misses with collision

Value 0

Value 1

Value 2

Value 3

Value 4

Value 5

Value 6

Value 7

Value 8

Key Index

Key 0 1

Key 1 6

Key 2 4

Key 3 8

1 2

New library memory layout

TimeSeries rollup 0

TimeSeries rollup 1

TimeSeries rollup 2

TimeSeries rollup 3

TimeSeries rollup 4

TimeSeries rollup 5

TimeSeries rollup 6

TimeSeries rollup 7

TimeSeries rollup 8

ID Index

ID 0 1

ID 1 6

ID 2 4

ID 3 8

Raw data in Rollup out

Changing hash map implementations

• java.util.HashMap (uses separate chaining and boxes primitives) to make a long -> int lookup • Allocations galore

• net.openhft.koloboke primitive open hash map• 45% improvement

For the JVM use libraries like https://github.com/OpenHFT/Koloboke.For C++ try https://github.com/preshing/CompareIntegerMaps or similar.

https://github.com/OpenHFT/Koloboke

https://github.com/preshing/CompareIntegerMaps

Access patterns

Hot data

Cold data

object 0

object 1

object 2

object 3

Field 0 Field 1 Field 2 Field 3 Field 4




Group fields accessed together

Hot fields

Cold fields

object 1

Field 0 Field 1 Field 2




Field 3 Field 4

Field 3 Field 4

Field 3 Field 4

Field 3 Field 4

Results of separating hot and cold data

A hot loop run about once every 500 ms

• Old - Hot and cold data kept together• 5 cache lines per time series• Took anywhere between 62-70 ms

• New - Hot and cold data kept separate• 3 cache lines of hot data per time series• Took anywhere between 40-45 ms

• 35% improvement

SignalFx

Results (library)!

Old vs New

• Concurrent -> single threaded• Locks gone• Array based data structures• Zero allocations• Extensive batching and hardware prefetching

• Multiple hash maps -> a single hash map look up

Old vs New

Old vs New

76 K/secVS

2.1 M/sec

Old vs New

27x

Old vs New

35 %

SignalFx

Results (application)!

Amdahl’s law - 35%O

vera

ll sp

eedu

p

0

0.4

0.8

1.2

1.6

Library speed up

1 4 8 12 16 20 24 28 ∞

CPU

3.4x

35% of the profile but 3.4x improvement?

• Amdahl’s law• Max 1.54x improvement if 35% => 0%

• Why 3.4x ? • When you use less cache, you leave more for

others - thus speeding up other code too

• Lesson• A profiler is a necessary tool, but not a substitute for

informed design

Heap growth

Closing remarks / rant

• “Write code first, profile later” = BAD

• Excessive encapsulation leads to myopic decisions being made re: perf • allocations• “thread safe” code

• Beware of micro benchmarks

SignalFx

Thank You!Rajiv Kurian

[email protected]@rzidane360

WE’RE [email protected]

@SignalFx - signalfx.com

mailto:[email protected]

https://twitter.com/rzidane360

mailto:[email protected]

https://twitter.com/signalfx

https://signalfx.com

SignalFx

Bonus slides

Composition in C

Struct A Struct B

Struct B1embedded

Struct B2embedded

int

int

int

int

int

int

int

int

int

int

int

int

Composition in JavaObject A Object B

Object B1 Object B2

int

int

int

int

B1

B2

Actual layout?

int

int

int

int

Object B

Object B1 Object B2

B1

B2

Actual layout?Object B

B1

B2

B (header)

B1*

B2*

Actual layout?

int

int

Object B

Object B1

B1

B2

B (header)

B1*

B2*

Actual layout?

int

int

Object B

Object B1

B1

B2

B (header)

B1*

B2*

B1 (header)

int

int

Actual layout?

int

int

int

int

Object B

Object B1 Object B2

B1

B2

B (header)

B1*

B2*

B1 (header)

int

int

Actual layout?

int

int

int

int

Object B

Object B1 Object B2

B1

B2

B (header)

B1*

B2*

B1 (header)

int

int

B2 (header)

int

int

Potential layout after GC

int

int

int

int

Object B

Object B1 Object B2

B1

B2

B (header)

B1*

B2*

Other data

B1 (header)

int

int

Other data

B2 (header

int

int

SignalFx


Separate the control and data planes

Frequent

Infrequent

A networking concept

Routing table

Packets in Packets out

Routing data

Key Value

What the control and data planes do

In networking terminology:• Data plane - Defines the part that

decides what to do with packets arriving on an inbound interface—Frequent

• Control plane - Defines the part that is concerned with drawing the network map or routing table—Infrequent

The goal of control and data plane separation

DO NOT slow the frequent path because of the infrequent path

Runtime configuration variables

Worker threadConfiguration variables(volatile/atomic)

Setter thread

while (1) {process_data_using_configuration_variables();

}

Flag 0

Flag 1

Flag 2

Flag 3

Flag 0

Flag 1

Flag 2

Flag 3

Runtime configuration variables

Worker threadConfiguration variables(volatile/atomic)

Setter thread

Flag 0

Flag 1

Flag 2

Flag 3

while (1) {cache_configuration_variables();process_a_ton_of_stuff();

}

Cached configurationvariables

Volatile/atomic flag vs cached local flag

• All run time flags (used on every data point) are volatile/atomic loads

• All run time flags are cached and refreshed on each run loop• About 8% improvement in datapoint/second. Others

might see more or less

scaling ingest pipelines with high performance computing principles - rajiv kurian

Technology