streaming multimedia applications development in mpsoc embedded platforms design flow issues...
Post on 22-Dec-2015
213 views
TRANSCRIPT
Streaming Multimedia Applications Development in MPSoC
Embedded Platforms
Design flow issues
Communication Middleware andDataflow support library
Alessandro Dalla Torre,
Martino Ruggiero,
Luca Benini
Andrea Acquaviva
• Peculiar Model of Computation (MoC)– Regular and repeating computation, applications based on
streaming data.– Easily fits to multimedia applications like Audio/Video
CODECs, DSP, networking…
• Independent processes (filters) with explicit communication.– Indipendent address spaces and multiple program counters.– Composable patterns.
• Natural expression of Parallelism:– Task, data and pipeline parallelism.
• Producer / Consumer dependencies are exposed.
Adder
Speaker
AtoD
FMDemod
LPF1
Scatter
Gather
LPF2 LPF3
HPF1 HPF2 HPF3
Dataflow/Streaming
Types of parallelism: recall
• Which paradigm of communication?– Message passing, producer/consumer
• Efficiency of communication is a major requirement.
• Portability of code has to be kept in mind as well.– Low level optimization may compromise that.
• Compiler behaviour may interfere.
• Need for a support to programmer:– Communication middleware.
Streaming applications issues on embedded MPSoC
MP-Queue library
• Distributed queue middleware:– Support library for streaming channels in a
MPSoC environment.– Producer / consumer paradigm.– FIFOs are circular buffers.
• Three main contributions:
1. Configurability: MP-Queue library matches several heterogeneous architectural templates.
2. An architecture independent efficiency metric is given.
3. It achieves both high efficiency and portability:
• synch operations optimized for minimal interconnect utilization.
• data transfer optimized for performance through analyses of disassembled code.
• portable C is the final implementation language.
Prod
Cons
P2
P1C1
C2
Read counter Write counter
NINI
NINI
Message delivering
Communication library primitives1. Autonit_system()
1. Every core has to call it at the very beginning.2. Allocates data structures and prepares the semaphore arrays.
2. Autoinit_producer()1. To be called by a producer core only.2. Requires a queue id.3. Creates the queue buffers and signals its position to n consumers.
3. Autoinit_consumer()1. To be called by a consumer core only.2. Requires a queue id.3. Waits for n producers to be bounded to the consumer structures.
4. Read()1. Gets a message from the circular buffer (consumer only).
5. Write()1. Puts a message into the circular buffer (producer only).
Architectural Flexibility
1. Multi core architectures with distributed memories (scratchpad).
2. Purely shared memory based architectures.
3. Hybrid platforms (MPARM cycle accurate simulator, different settings).
Transaction Chart
• Shares bus accesses are minimized as much as possible:
– Local polling on scratchpad memories.
• Insertion and extraction indexes are stored into shared memory and protected by mutex.
• Data transfer section involves shared bus
– Critical for performance.
Sequence diagrams• Parallel activity of 1
producer and 1 consumer,– delivering 20 messages.
• Synch time vs pure data transfer.
– Local polling onto scratch semaphore:
– Signaling to remote core scratch:
– “Pure” data transfer to and from FIFO buffer in shared memory:
Message size 8 WORDS Message size 64 WORDS
Communication efficiency
• Comparison against ideal point-to-point communication:
– 1-1 queue,– 1-N queue,– Interrupt based
queue.
• For small messages the library synchronization overhead is prevailing,
• While for a 256 words size we reach good performance.
• Monotonicity, difference in absyntotic behaviour.
• Abrupt variation of the curve slope.
• Interrupt based notification adds overhead (~15%).
Ratio between:
Real number of bus cycles needed to complete 1 word transfer through the MP-QUEUE based message passing,
And:
number of bus cycles needed to perform 1 write + 1 read operation (ideal 1 word transfer, no overhead)
Not absolute timings, depending of the frequency clock of CPU, but normalized metric:• archictecture independent metric.
Low-level optimizations are critical!16 words per token 32 words per token
Not produced any more by compiler!!!
• Gcc compiler avoids to insert the multiple load /multiple store loop from 32 words on.
• Code size would be exponentially rising.
• Up to 256 words per token we can tolerate the code size growing.
Compiler-aware optimization benefits
• Compiler may be forced to unroll data transfer loops.
– New efficiency curves are obtained.
• About 15% improvement with 32 word sized messages.
• A Typical JPEG 8x8 block is encoded in a 32 word struct.
– 8x8x16 bit data.
• Above 256 words, we don’t get any improvement by using “loop unrolling”.
Shared cacheable memory management with flush• Flushing is
needed to avoid “thrashing”if no cache coherency support is given.
• With small messages, flush compromises performances.
• 64 words is the break-even size for cacheable-shared based communication.
• Efficiency is asymptotically rising to 98% !
JPEG Decoding parallelized through MP-Queue
• Starting from a sequencial JPEG decoding:
1. Huffman extraction.2. Inverse DCT.3. Dequantization.4. Frame reconstruction.
• Split –Join parallelization.
• After step 1, the reconstructed 240 x 160 image is:
– split by master core into slices;– delivered to N worker cores
through MP-Queue messages.
• 2 different architectural templates are explored.
Splitter
WORKER
Joiner
Experimental part: metrics
Serial 1 2 4 80
200000
400000
600000
800000
1000000
1200000
Communication overhead
Extra parallel overhead
Parallel ideal
Serial region
Initialization
Parallel cores
Cost [bus
transacti
ons]
Sequential partexec time
Parallelizablepart executiontime
Parallel partexectime
OverallExec time
Extra parallel overhead =Parallel part exec time –(Parallelizable part execution time /no. of parallel cores)
Communication overhead = Overall exec time –Sequential part exec time –Parallel part exec time.
Ideal paralleltime
cores
JPEG Split join with one 1-N buffer in shared memory (cachable)
Serial 1 2 4 80
200000
400000
600000
800000
1000000
1200000
Communication overhead
Extra parallel overhead
Parallel ideal
Serial region
Initialization
Parallel cores
Cost [b
us tra
nsactions]
•We explored configurations with 1, 2, 4 and 8 workers (parallel cores).
•Execution time (cost in terms of bus transaction) scales down.
•Shared bus becomes quite busy while using 8 parallel cores: destination bottleneck negates speed up.
.
JPEG Split join with locally mapped
(scratchpad) 1-1 queues
Serial 1 2 4 80
200000
400000
600000
800000
1000000
1200000
Communication overhead
Extra parallel overhead
Parallel ideal
Serial region
Initialization
Parallel cores
Co
st
[bu
s t
ran
sa
ctio
ns]
•This version performs significantly better when N increases beyond 2.
•It eliminates the waiting time due to contention:
• on the bus • on the shared memory.
Serial 1 2 4 80
200000
400000
600000
800000
1000000
1200000
Communication overhead
Extra parallel overhead
Parallel ideal
Serial region
Initialization
Parallel cores
Cos
t [bu
s tr
ansa
ctio
ns]
Serial 1 2 4 80
200000
400000
600000
800000
1000000
1200000
Communication overhead
Extra parallel overhead
Parallel ideal
Serial region
Initialization
Parallel coresC
ost
[bus
tra
nsa
ctio
ns]
Comparison of the two templates
•In the second experiment (using scratchpad located 1-1 queues) the communication overhead becomes negligible.
•Extra parallel overhead remains:• this is mostly due to a synchronization mismatch;• it would be removed through a pipelined execution of the JPEG decoding.
CACHEABLE SHARED SCRATCHPAD
Synchronous Dataflow• Each process has a firing rule:
– Consumes and produces a fixed number of tokens every time
• Predictable communication: easy scheduling
• Well-suited for multi-rate signal processing
• A subset of Kahn Networks: deterministic1 2
2
1
1
1
3 2
Initial token (delay)
MP-Queue Middleware evolution: API extension towards the Dataflow MoC1. Add_input_to_actor()
Transform the FIFO into an input channel for the specified actor
2. Add_output_to actor()Transform the FIFO into an output channel for the specified actor
3. Set_compute_function for actor()By mean of a function pointer, allows to associate a portion of code to an actor. That code will be executed whenever the firing conditions are met, by consuming
the available input tokens.
4. Try_and_fire()Cyclic checking of the firing conditions.
5. Fire()Invoked by try_and_fire
SDF3 tool• Developed by Eindhoven University of Technology (Tu/e)
– Sander Stuijk, Marc Geilen and Twan Basten, SDF3: SDF for Free. In ACDS ’06, Proc. (2006)
• Synchronous DataFlow extensive library– SDFG analysis.
– Transformation algorithms.
– Visualization functionality.
Analysis techniques: example 1• Compute the
repetition vector
– how often each actor must fire with respect to other actors without a net chance in the token distribution.
– in this example, the repetition vector indicates:
• actor A should fire 3 times,
• actor B 2 times • actor C 1 time
Analysis techniques: example 2
• Checking the consistency
– A non-consistent SDFG requires unbounded memory to execute or it deadclocks during its execution.
– An SDFG is called consistent if its repetition vector is not equal to the null-vector.
– This example shows an SDFG that is inconsistent.
– Actor A produces too few tokens on the channel between A and B to allow an infinite execution of the graph.
Analysis techniques: example 3
• SDFG to HSDFG conversion (Homogeneous SDFG)
– Any SDFG can be converted to a corresponding HSDFG in which all rates are equal to one.
– The number of actors in the resulting HSDFG depends on the entries of the actors in the repetition vector
– the conversion can lead to an exponential increase in the number of actors in the HSDFG
Analysis techniques: example 4• MCM (Maximum Cycle Mean) Analysis
– The most commenly used means to compute the throughput of an SDFG.
– Each Cycle mean is computed as the total execution time over the number of tokens in that cycle.
– The critical cycle (the cycle limiting the throughput), is the upper cycle in the HSDFG.
– The execution time of the actors on the cycle is equal to two time-units and only one token is present on the cycle
Design flow
1. Give an XML representation of the dataflow.
2. Invoke the SDF tool in order to produce the abstract internal representation.
3. Produce a graphical ouput.
4. Compute statistics and execute some analysis techniques.
5. Generate C code.
6. Cross-compile.
7. Simulate into virtual platform.
Mapping and scheduling issues• How to map several actors into a single processor tile in an optimal way?
• Scheduling issues. Need for:– Either an embedded OS with preemption and timeslicing.
– Or some static analyses techniques in order to compute a feasible scheduling that is not going to provoke any dead-locks.
Streaming Multimedia Applications Development in MPSoC
Embedded Platforms
Design flow issues
Communication Middleware andDataflow support library
Alessandro Dalla Torre,
Martino Ruggiero,
Luca Benini
Andrea Acquaviva