[若渴計畫] studying concurrency

38
Studying Concurrency 2017.1.22 <[email protected]> AJMachine

Upload: aj0612

Post on 25-Jan-2017

1.328 views

Category:

Technology


1 download

TRANSCRIPT

Studying Concurrency

2017.1.22

<[email protected]>

AJMachine

迷失到收斂

Outline

• 為什麼寫concurrency不容易?

• Programmer-observable behavior

• 來點concurrency performance 撰寫技巧例子

• 來點concurrency security 例子

為什麼寫Concurrency不容易?

• Hardware optimizations

• Compiler optimizations

無法預期行為

Hardware Optimizations - Write Buffer

• On a write, a processor simply inserts the write operation into the write buffer and proceeds without waiting for the write to complete

• In order to effectively hide the latency of write operations • Therefore, P1, P2 are all in critical sections

Sarita V. Adve, Kourosh Gharachorloo, “Shared Memory Consistency Models: A Tutorial”

Hardware Optimizations - Overlapped Writes

• Assume the Data and Head variables reside in different memory modules • Since the write to Head may be injected into the network before the write to Data

has reached its memory module • Therefore, it is possible for another processor to observe the new value of Head

and yet obtain the old value of Data • Reordering of write operations

Sarita V. Adve, Kourosh Gharachorloo, “Shared Memory Consistency Models: A Tutorial”

(coalesced write)

Hardware Optimizations - Non−blocking Reads

• If P2 is allowed to issue its read operations in an overlapped fashion, there is the possibility for the read of Data to arrive at its memory module before the write from P1 while the read of Head reaches its memory module after the write from P1 => P2.Data =2000/ P2.Head = 0

Sarita V. Adve, Kourosh Gharachorloo, “Shared Memory Consistency Models: A Tutorial”

(coalesced read)

如果更想仔細了解運作,可參考Memory Barriers: a Hardware View for

Software Hackers

所以怎麼辦? 理想上

• Sequential Consistency (單核operations順序=多核operation順序)

– The result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program

• There is no local reordering

• Each write becomes visible to all threads

Sarita V. Adve, Kourosh Gharachorloo, “Shared Memory Consistency Models: A Tutorial”

Luc Maranget, etc., “A Tutorial Introduction to the ARM and POWER Relaxed Memory Models”

事實上,不保證SC

Memory model Local ordering Multiple-copy atomic model

Total store ordering Intel x86 X O

Relaxed memory model

ARM X X

Luc Maranget, etc., “A Tutorial Introduction to the ARM and POWER Relaxed Memory Models”

Developers需自己寫code管理記憶體操作順序

Hardware Optimizations這麼多,我要怎知道程式的運作行為(Programmer-

observable Behavior)? • Mathematically rigorous architecture

definitions – Luc Maranget, etc., “A Tutorial Introduction to the

ARM and POWER Relaxed Memory Models”

• Hardware semantics – Shaked Flur, etc., “Modelling the ARMv8

Architecture, Operationally Concurrency and ISA”

• C/C++11 memory model

• …?

Mathematically Rigorous Architecture Definitions – For Example

• Message Passing (MP)

Luc Maranget, etc., “A Tutorial Introduction to the ARM and POWER Relaxed Memory Models”

Y=1; r1=y; r2=x; x=1 r1=1 ∧ r2=0 x86-TSO : forbidden ARM: allowed

Partial-order Propagation

?

Partial-order Propagation是否一定會影響程式行為? 不一定會

• MP test harness

• m is the number of times that the final outcome of r1=1 ∧ r2=0 was observed in n trials

Hardware Semantics

Shaked Flur, etc., “Modelling the ARMv8 Architecture, Operationally Concurrency and ISA”

撰寫

撰寫

Web Site of Hardware Semantics

http://www.cl.cam.ac.uk/~sf502/popl16/help.html

Result of Hardware Semantics

http://www.cl.cam.ac.uk/~sf502/popl16/help.html

如果有同時存取某位置(lock沒寫好),可以看result資訊可提早看出。

C/C++11 Memory Model

• 從language層面,制定keywords,來使各個硬體必須符合此language memory model。 – https://www.youtube.com/watch?v=S-x-23lrRnc

• 此影片中有提到ARM為了滿足C11 memory model,complier會有double barrier狀況

• Reinoud Elhorst, “Lowering C11 Atomics for ARM in LLVM”

– Torvald Riegel, “Modern C/C++ concurrency”

• Semantics – Mark Barry, “Mathematizing C++ concurrency”

Mathematizing C++ Concurrency

• 利用 Isabelle/HOL 來撰寫C++ memory model的semantics

• For example:定義release sequence

來點Concurrency Performance撰寫技巧例子

• LMAX

• RCU

• Concurrent malloc(3)

• An Analysis of Linux Scalability to Many Cores

LMAX: New Financial Trading Platform

https://martinfowler.com/articles/lmax.html

LMAX Lock-free技巧

http://mechanitis.blogspot.tw/2011/06/dissecting-disruptor-how-do-i-read-from.html

• 應用Barrier就是把原本lock改成lock-free,lock-free可以想成lock是硬體管理。基本上實作概念跟lock差不多。排隊。

• RingBuffer: 增快反應時間

Read Copy Update (RCU)

• Read-mostly situations

• Typical RCU: update into removal and reclamation (disrupt) – Removal and Replacing references to data items can run concurrently with readers

– Remove pointers to a data structure, so that subsequent readers cannot gain a reference to it

– RCU provides implicit low-overhead communication between readers and reclaimers (synchronize_rcu())

https://www.kernel.org/doc/Documentation/RCU/whatisRCU.txt https://lwn.net/Articles/262464/

Grace Period 時間太長?

https://lwn.net/Articles/253651/

有一堆RCU,只能有緣再唸了

https://lwn.net/Articles/264090/

Concurrent malloc(3)

• How to false cache sharing

– Modern multi-processor systems preserve a coherent view of memory on a per-cache-line basis

• How to reduce lock contention

Jason Evans, “a scalable concurrent malloc implementation for freebsd”

jemalloc

• Phk-malloc was specially optimized to minimize the working set of pages, jemalloc must be more concerned with cache locality

• jemalloc first tries to minimize memory usage, and tries to allocate contiguously (weaker security)

• One way of fixing this issue is to pad allocations, but padding is in direct opposition to the goal of packing objects as tightly as possible; it can cause severe internal fragmentation. jemalloc instead relies on multiple allocation arenas to reduce the problem

• One of the main goals for this allocator was to reduce lock contention for multi-threaded applications by using a single 2 allocator lock, each free list had its own lock

• The solution was to use multiple

arenas for allocation, and assign threads

to arenas via hashing of the thread identifiers

Jason Evans, “a scalable concurrent malloc implementation for freebsd”

Scalability Collapse Caused by Non-scalable Locks

Linux Scalability to Many Cores - Per-core Mount Caches

Silas Boyd-Wickizer, etc. , “An Analysis of Linux Scalability to Many Cores”

• Observation: mount table is rarely modified

• Common case: cores access per-core tables

• Modify mount table: invalidate per-core tables

Linux Scalability to Many Cores - Sloppy Counters

• Because reading reference count is slow

Silas Boyd-Wickizer, etc. , “An Analysis of Linux Scalability to Many Cores”

來點Concurrency Security 例子

• Concurrency fuzzer

– Sebastian Burckhardt, etc., “A Randomized Scheduler with Probabilistic Guarantees of Finding Bugs”

• Timing side channel attack

– Yeongjin Jang, etc., “Breaking Kernel Address Space Layout Randomization with Intel TSX”

Concurrency Fuzzer- Randomized Scheduler

Sebastian Burckhardt, etc., “A Randomized Scheduler with Probabilistic Guarantees of Finding Bugs”

Randomized Scheduler

基本上,Read/ Write reordering in hardware 是沒有模擬到的

Find Violation (Order/ Atomicity)

此投影片有整理幾個Fuzzer “Concurrency: A problem and

opportunity in the exploitation of memory corruptions”

Intel Transactional Synchronization Extensions

• the assembly instruction xbegin can return various results that represent the hardware's suggestions for how to proceed and reasons for failure: success, a suggestion to retry, a potential cause for the abort

• To effectively use TSX it's imperative to understand it's implementation and limitations. TSX is implemented using the cache coherence protocol, which x86 machines already implement. When a transaction begins, the processor starts tracking read and write sets of cache lines which have been brought into the L1 cache. If at any point during a logical core's execution of a transaction another core modifies a cache line in the read or write set then the transaction is aborted.

Nick Stanley, “Hardware Transactional Memory with Intel’s TSX”

Intel Transactional Synchronization Extensions - Suppressing exceptions

• a transaction aborts when such a hardware exception occurs during the execution of the transaction. However, unlike normal situations where the OS intervenes and handles these exceptions gracefully, TSX instead invokes a user-specified abort handler, without informing the underlying OS. More precisely, TSX treats these exceptions in a synchronous manner—immediately executing an abort handler while suppressing the exception itself. In other words, the exception inside the transaction will not be communicated to the underlying OS. This allows us to engage in abnormal behavior (e.g., attempting to access privileged, i.e., kernel, memory regions) without worrying about crashing the program. In DrK, we break KASLR by turning this surprising behavior into a timing channel that leaks the status (e.g., mapped or unmapped) of all kernel pages.

Timing Side Channel Attack

• TSX instead invokes a user-specified abort handler, without informing the underlying OS

• 也就是說我在User space就可以知道kennel address with random (!!!)

Yeongjin Jang, etc., “Breaking Kernel Address Space Layout Randomization with Intel TSX”

Reference • Sarita V. Adve, Kourosh Gharachorloo, “Shared Memory Consistency

Models: A Tutorial”

• Luc Maranget, etc., “A Tutorial Introduction to the ARM and POWER Relaxed Memory Models”

• Shaked Flur, etc., “Modelling the ARMv8 Architecture, Operationally Concurrency and ISA”

• https://www.youtube.com/watch?v=6QU37TwRO4w

• http://www.cl.cam.ac.uk/~sf502/popl16/help.html

• Jade Alglave, etc., “The Semantics of Power and ARM Multiprocessor Machine Code”

• Paul E. McKenney, “Memory Barriers: a Hardware View for Software Hackers”

Reference

C/C++ 11 memory model • https://www.youtube.com/watch?v=S-x-23lrRnc • Reinoud Elhorst, “Lowering C11 Atomics for ARM in LLVM” • Torvald Riegel, “Modern C/C++ concurrency” • Mark Barry, “Mathematizing C++ concurrency” LMAX • https://github.com/LMAX-Exchange/disruptor • https://martinfowler.com/articles/lmax.html • http://mechanitis.blogspot.tw/2011/06/dissecting-disruptor-how-do-i-read-

from.html RCU • https://www.kernel.org/doc/Documentation/RCU/whatisRCU.txt • https://lwn.net/Articles/262464/ • https://lwn.net/Articles/253651/ • https://lwn.net/Articles/264090/

Reference

Concurrent malloc(3) • Jason Evans, “a scalable concurrent malloc implementation

for freebsd” Concurrency security • Sebastian Burckhardt, etc., “A Randomized Scheduler with

Probabilistic Guarantees of Finding Bugs” • Ralf-Philipp Weinmann, etc., “Concurrency: A problem and

opportunity in the exploitation of memory corruptions” • Yeongjin Jang, etc., “Breaking Kernel Address Space Layout

Randomization with Intel TSX” • Nick Stanley, “Hardware Transactional Memory with Intel’s

TSX” (有建議的Intel concurrency寫法)