scaling ingest pipelines with high performance computing principles - rajiv kurian
TRANSCRIPT
SignalFx
SignalFx
Scaling ingest pipelines withhigh performance computing principles
Rajiv Kurian, Software [email protected]
Agenda
1. Why we need to scale ingest2. Basic properties and limitations of modern
hardware3. Optimization techniques inspired by HPC4. Results!5. Q&A (hopefully!)
SignalFx
Why we need to scale ingest
• High resolution: • Up to 1 sec
• Streaming analytics: • Charts/analytics update @1sec• Real time
• Multidimensional metrics: • Dimensions : representing customer, server etc• Filter, aggregate : 99th-pct-latency-by-service,customer
SignalFx is an advanced monitoring platform for modern applications
Ingest pipeline
ROLLUPS PERSISTREST/RATE CONTROL
Raw time series data Processed data to
analytics
SignalFx ingest library
Raw data in Rollup data out
TimeSeries 0 rollup
TimeSeries 1 rollup
TimeSeries 2 rollup
TimeSeries 3 rollup
TimeSeries 4 rollup
TimeSeries 5 rollup
TimeSeries 6 rollup
TimeSeries 7 rollup
TimeSeries 8 rollup
Issues identified (before applying HPC techniques)
• Expensive - too many servers
• Exhibits parallel slow down• More threads = worse performance
• What did the profile say? • Death by a thousand cuts• The core library = 35% of profile
SignalFx
Basic properties and limitationsof modern hardware
SignalFx
L1 Data L1 Instruction
L3
L1 Data L1 Instruction
L2L2
Core 1 Core 2
Main memory
Cache Lines
• Data is transferred between memory and cache in blocks of fixed size, called cache lines. Usually 64 bytes
• When the processor needs to read or write a location in main memory, it first checks for a corresponding entry in the cache. In the case of:
• a cache hit, the processor immediately reads or writes the data in the cache line
• a cache miss, the cache allocates a new entry and copies in data from main memory, then the request (read or write) is fulfilled from the contents of the cache
• The memory subsystem makes two kinds of bets to help us:• Temporal locality• Spatial locality
Reference latency numbers for comparison
By Jeff Dean: http://research.google.com/people/jeff/
L1 Cache 0.5ns
Branch mispredict 5 ns
L2 Cache 7 ns 14x L1 Cache
Mutex lock/unlock 25 ns
Main memory 100 ns 20x L2 Cache, 200x L1 Cache
Compress 1K bytes (Zippy) 3,000 ns
Send 1K bytes over 1Gbps 10,000 ns 0.01 ms
Read 4K randomly from SSD 150,000 ns 0.15 ms
Read 1MB sequentially from memory 250,000 ns 0.25 ms
Round trip within same DC 500,000 ns 0.5 ms
Read 1MB sequentially from SSD 1,000,000 ns 1 ms 4x memory
Disk seek 10,000,000 ns 10 ms 20x DC roundtrip
Read 1MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20x SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150 ms
SignalFx
L1 CORE
SignalFx
L2 CORE
SignalFx
MainMemory CORE
Our optimization goal
Convert a memory bandwidth bound application to a CPU bound application
Things we kept in mind
• Measure, measure, measure!
• Don’t rely on micro benchmarks alone
SignalFx
Benchmark
SignalFx library benchmark
Rollup data out
Key Value
ID 0 TimeSeries rollup 0
ID 1 TimeSeries rollup 1
ID 2 TimeSeries rollup 2
ID 3 TimeSeries rollup 3
ID 4 TimeSeries rollup 4
….. …..
….. …..
Key 1M TimeSeries rollup 1M
Raw data in,in random order,
one per Time Series.50x
SignalFx library benchmark
Rollup data out
Key Value
ID 0 TimeSeries rollup 0
ID 1 TimeSeries rollup 1
ID 2 TimeSeries rollup 2
ID 3 TimeSeries rollup 3
ID 4 TimeSeries rollup 4
….. …..
….. …..
Key 1M TimeSeries rollup 1M
Raw data in,in random order,
one per Time Series.50x
35% of the profile of the entire application
SignalFx
Techniques inspired by HPC that have improved our pipeline
Single threaded, event based architectures:parallelize by running multiple copies of
single threaded code
Single threaded event based architectures
• Threads work on their own private data (as much as possible)
• Communicate with other threads using events/messages
SignalFx
local data
Network In thread Processor thread(s) Network out thread
Receive data
Process data
Write batcheddata
Events Events
Key Value
key 1 value 1
key 2 value 2
key 3 value 3
key 4 value 4
SignalFx
Network In thread Processor thread(s) Network out thread
Receive data
Process data Write batcheddata
local data
Ring Buffer
Ring Buffer
Key Value
key 1 value 1
key 2 value 2
key 3 value 3
key 4 value 4
Single threaded event based architectures advantages
• It enables many other optimal choices like• Compact array based data structures• Buffer/object re-use
• Loosely coupled - easy to test
• Run multiple copies for parallelism
SignalFx
Ring Buffer
Network In thread Worker thread(s) Network out thread
Receive data
Process data Write batcheddata
local data
1
2
3 4
Ring Buffer
Ring Buffer
Key Value
key 1 value 1
key 2 value 2
key 3 value 3
key 4 value 4
SignalFx
Worker thread
Receive data using Async IO
Process data synchronously
Write data using Async IO
Receive data using Async IO
Process data synchronously
Write data using Async IO
Receive data using Async IO
Process data synchronously
Write data using Async IO
Worker thread Worker thread
local data local data local data
Key Value
key 5 value 5
key 6 value 6
key 7 value 7
key 8 value 8
Key Value
key 1 value 1
key 2 value 2
key 3 value 3
key 4 value 4
Key Value
key 9 value 9
key 10 value 10
key 11 value 11
key 12 value 12
SignalFx
Ring Buffer
Network thread Processor thread(s) Async IO thread
Receive data
Process data
Batched IO calls
local data
1
2
3 4
Ring Buffer
Ring Buffer
5
Ring Buffer
6
7
Key Value
key 1 value 1
key 2 value 2
key 3 value 3
key 4 value 4
Advice for threaded applications
• Threads should ideally reflect the actual parallelism of the system.• Avoid gratuitous over subscribing• Exception: IO threads?
• DO NOT communicate unless you have to
SignalFx
Techniques inspired by HPC that have improved our pipeline
Use compact, cache-conscious, array based data structures with minimal indirection
SignalFx
L1 Data L1 Instruction
L3
L1 Data L1 Instruction
L2L2
Core 1 Core 2
Main memory
Basic principles
• Strive for smaller data structures• Extra computation is ok• E.g. Compressing network data
• Design data structures that facilitate processing multiple entries—big arrays!
• Layout should reflect access patterns
Hash maps
• Hash maps look ups are NOT free!
• A lookup in a well implemented hash map is by definition a cache miss
• Popular implementations like java.util.HashMap can cause multiple cache misses
valuekey
key* value*key* value*
List
List
List
List
Typical hash map implemented as an array of lists of key* | value*
valuekey
key* value*key* value*
List
List
List
List
1
Cache misses in a typical hash-map implementation
valuekey
key* value*key* value*
List
List
List
List
1
2
Cache misses in a typical hash-map implementation
valuekey
key* value*key* value*
List
List
List
List
1
2
3
Cache misses in a typical hash-map implementation
valuekey
key* value*key* value*
List
List
List
List
1
2
3 4
Cache misses in a typical hash-map implementation
valuekey
key* value*key* value*
List
List
List
List
1
2
3 4
Cache misses in a typical hash-map implementation
valuekey
key* value*key* value*
Hash map implemented as an array of lists of key* | value*
List
List
List
List
Array of co-located key/value
Key Value
Key 0 Value 0
Key 1 Value 1
Key 2 Value 2
Key 3 Value 3
Key 4 Value 4
Key 5 Value 5
Key 6 Value 6
Key 7 Value 7
Cache misses with no collision
Key Value
Key 0 Value 0
Key 1 Value 1
Key 2 Value 2
Key 3 Value 3
Key 4 Value 4
Key 5 Value 5
Key 6 Value 6
Key 7 Value 7
1
Cache misses with collisions
Key Value
Key 0 Value 0
Key 1 Value 1
Key 2 Value 2
Key 3 Value 3
Key 4 Value 4
Key 5 Value 5
Key 6 Value 6
Key 7 Value 7
1
2
Hash map of key to index to an array of structs
Value 0
Value 1
Value 2
Value 3
Value 4
Value 5
Value 6
Value 7
Value 8
Key Index
Key 0 1
Key 1 6
Key 2 4
Key 3 8
Cache misses with collision
Value 0
Value 1
Value 2
Value 3
Value 4
Value 5
Value 6
Value 7
Value 8
Key Index
Key 0 1
Key 1 6
Key 2 4
Key 3 8
1
Cache misses with collision
Value 0
Value 1
Value 2
Value 3
Value 4
Value 5
Value 6
Value 7
Value 8
Key Index
Key 0 1
Key 1 6
Key 2 4
Key 3 8
1 2
New library memory layout
TimeSeries rollup 0
TimeSeries rollup 1
TimeSeries rollup 2
TimeSeries rollup 3
TimeSeries rollup 4
TimeSeries rollup 5
TimeSeries rollup 6
TimeSeries rollup 7
TimeSeries rollup 8
ID Index
ID 0 1
ID 1 6
ID 2 4
ID 3 8
Raw data in Rollup out
Changing hash map implementations
• java.util.HashMap (uses separate chaining and boxes primitives) to make a long -> int lookup • Allocations galore
• net.openhft.koloboke primitive open hash map• 45% improvement
For the JVM use libraries like https://github.com/OpenHFT/Koloboke.For C++ try https://github.com/preshing/CompareIntegerMaps or similar.
Access patterns
Hot data
Cold data
object 0
object 1
object 2
object 3
Field 0 Field 1 Field 2 Field 3 Field 4
Field 0 Field 1 Field 2 Field 3 Field 4
Field 0 Field 1 Field 2 Field 3 Field 4
Field 0 Field 1 Field 2 Field 3 Field 4
Group fields accessed together
Hot fields
Cold fields
object 1
Field 0 Field 1 Field 2
Field 0 Field 1 Field 2
Field 0 Field 1 Field 2
Field 0 Field 1 Field 2
Field 3 Field 4
Field 3 Field 4
Field 3 Field 4
Field 3 Field 4
Results of separating hot and cold data
A hot loop run about once every 500 ms
• Old - Hot and cold data kept together• 5 cache lines per time series• Took anywhere between 62-70 ms
• New - Hot and cold data kept separate• 3 cache lines of hot data per time series• Took anywhere between 40-45 ms
• 35% improvement
SignalFx
Results (library)!
Old vs New
• Concurrent -> single threaded• Locks gone• Array based data structures• Zero allocations• Extensive batching and hardware prefetching
• Multiple hash maps -> a single hash map look up
Old vs New
Old vs New
76 K/secVS
2.1 M/sec
Old vs New
27x
Old vs New
35 %
SignalFx
Results (application)!
Amdahl’s law - 35%O
vera
ll sp
eedu
p
0
0.4
0.8
1.2
1.6
Library speed up
1 4 8 12 16 20 24 28 ∞
CPU
CPU
3.4x
35% of the profile but 3.4x improvement?
• Amdahl’s law• Max 1.54x improvement if 35% => 0%
• Why 3.4x ? • When you use less cache, you leave more for
others - thus speeding up other code too
• Lesson• A profiler is a necessary tool, but not a substitute for
informed design
Heap growth
Closing remarks / rant
• “Write code first, profile later” = BAD
• Excessive encapsulation leads to myopic decisions being made re: perf • allocations• “thread safe” code
• Beware of micro benchmarks
SignalFx
Thank You!Rajiv Kurian
[email protected]@rzidane360
WE’RE [email protected]
@SignalFx - signalfx.com
SignalFx
Bonus slides
Composition in C
Struct A Struct B
Struct B1embedded
Struct B2embedded
int
int
int
int
int
int
int
int
int
int
int
int
Composition in JavaObject A Object B
Object B1 Object B2
int
int
int
int
B1
B2
Actual layout?
int
int
int
int
Object B
Object B1 Object B2
B1
B2
Actual layout?Object B
B1
B2
B (header)
B1*
B2*
Actual layout?
int
int
Object B
Object B1
B1
B2
B (header)
B1*
B2*
Actual layout?
int
int
Object B
Object B1
B1
B2
B (header)
B1*
B2*
B1 (header)
int
int
Actual layout?
int
int
Object B
Object B1
B1
B2
B (header)
B1*
B2*
B1 (header)
int
int
Actual layout?
int
int
int
int
Object B
Object B1 Object B2
B1
B2
B (header)
B1*
B2*
B1 (header)
int
int
Actual layout?
int
int
int
int
Object B
Object B1 Object B2
B1
B2
B (header)
B1*
B2*
B1 (header)
int
int
B2 (header)
int
int
Actual layout?
int
int
int
int
Object B
Object B1 Object B2
B1
B2
B (header)
B1*
B2*
B1 (header)
int
int
B2 (header)
int
int
Potential layout after GC
int
int
int
int
Object B
Object B1 Object B2
B1
B2
B (header)
B1*
B2*
Other data
B1 (header)
int
int
Other data
B2 (header
int
int
SignalFx
Techniques inspired by HPC that have improved our pipeline
Separate the control and data planes
Frequent
Infrequent
A networking concept
Routing table
Packets in Packets out
Routing data
Key Value
What the control and data planes do
In networking terminology:• Data plane - Defines the part that
decides what to do with packets arriving on an inbound interface—Frequent
• Control plane - Defines the part that is concerned with drawing the network map or routing table—Infrequent
The goal of control and data plane separation
DO NOT slow the frequent path because of the infrequent path
Runtime configuration variables
Worker threadConfiguration variables(volatile/atomic)
Setter thread
while (1) {process_data_using_configuration_variables();
}
Flag 0
Flag 1
Flag 2
Flag 3
Flag 0
Flag 1
Flag 2
Flag 3
Runtime configuration variables
Worker threadConfiguration variables(volatile/atomic)
Setter thread
Flag 0
Flag 1
Flag 2
Flag 3
while (1) {cache_configuration_variables();process_a_ton_of_stuff();
}
Cached configurationvariables
Volatile/atomic flag vs cached local flag
• All run time flags (used on every data point) are volatile/atomic loads
• All run time flags are cached and refreshed on each run loop• About 8% improvement in datapoint/second. Others
might see more or less