tecniche di ottimizzazione per lo sviluppo di applicazioni

32
1 Tecniche di ottimizzazione per lo sviluppo di applicazioni embedded su piattatforme multiprocessore su singolo chip Luca Benini [email protected] DEIS Università di Bologna 1% 99% Embedded Systems General purpose systems Embedded systems Microprocessor market shares

Upload: others

Post on 13-Nov-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tecniche di ottimizzazione per lo sviluppo di applicazioni

1

Tecniche di ottimizzazione per lo sviluppo di applicazioni embeddedsu piattatforme multiprocessore su singolo chip

Luca [email protected]

DEIS Università di Bologna

1%

99%

Embedded Systems

General purpose systems Embedded systems

Microprocessor market shares

Page 2: Tecniche di ottimizzazione per lo sviluppo di applicazioni

2

Example Area: Automotive Electronics

What is “automotive electronics”?

Vehicle functions implemented with electronics

Body electronicsSystem electronics: chassis, engineInformation/entertainment

Automotive Electronics Market Size

8.9Market ($billions) 10.5 13.1 14.1 15.8 17.4 19.3 21.0

0200

400600

8001000

12001400

1998 1999 2000 2001 2002 2003 2004 2005

Cost of Electronics / Car ($)

90% of future innovations in vehicles:based on electronic embedded systems

2006: 25% of the total cost of a car will be electronics

Page 3: Tecniche di ottimizzazione per lo sviluppo di applicazioni

3

Automotive Electronics Platform Example

Source: Expanding automotive electronic systems, IEEE Computer, Jan. 2002

Digital Convergence – Mobile Example

Broadcasting

TelematicsImaging

Computing

CommunicationEntertainment

One device, multiple functionsCenter of ubiquitous media networkSmart mobile device: next drive for semicon. Industry

Page 4: Tecniche di ottimizzazione per lo sviluppo di applicazioni

4

4th Gen and Next-Gen Networks

Includes: 802.20, WiMAX (802.16), HSDPA, TDD UMTS, UMTS and future versions of UMTS

SoC: Enabler for Digital Convergence

Today

Future

> 100XPerformanceLow PowerComplexity

Storage

4G/5G, DMB, WiBro, etc.

SoCSoCSoC

Page 5: Tecniche di ottimizzazione per lo sviluppo di applicazioni

5

Application pull

Year of Introduction2005 2007 2009 2011 2013 2015

5 GOPS/W

100GOPS/W

Signrecognition

A/Vstreaming

Adaptiveroute

Collisionavoidance

Autonomousdriving

3D projecteddisplay

HMI by motionGesture detection

Ubiquitousnavigation

Si Xray

Gbit radio

UWB

802.11n

Structured encoding

Structured decoding

3D TV 3D gaming

H264encoding

H264decoding

Imagerecognition

Fully recognition(security)

Autopersonalization

dictation

3D ambientinteraction

LanguageEmotionrecognition

Gesturerecognition

Expressionrecognition

MobileBase-band

1TOPS/W

[IMEC]

MPSoC Platform Evolution

45 nm

<4mm

<1GHz

I/OPERIPHERALS

3D stacked

main

mem

ory

2

30MtrLocalMemory

hierarchy

NetInt

PowerTest

Mgmt

routerBus basedMulti Proc

Applications Software opt. Middleware, RTOS, API,Run-Time Controller

MappingV,Vt,Fclk,IL

Today’s SoCs could fit in 1 tile!!Tile-based design

Page 6: Tecniche di ottimizzazione per lo sviluppo di applicazioni

6

Multicores Are Here!

1985 199019801970 1975 1995 2000

4004

8008

80868080 286 386 486 Pentium P2 P3P4Itanium

Itanium 2

2005 20??

# of

cor

es

1

2

4

8

16

32

64

128256

512

Athlon

Raw

Power4 Opteron

Power6

Niagara

YonahPExtreme

Tanglewood

Cell

IntelTflops

Xbox360

CaviumOcteon

RazaXLR

PA-8800

CiscoCSR-1

PicochipPC102

Boardcom 1480 Opteron 4PXeon MP

AmbricAM2045

[Amarasinghe06]

MPSoC – 2005 ITRS roadmap

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 20200

200

400

600

800

1000

60

50

40

30

20

10

0

1200

Num

ber

of P

roce

ssin

g En

gine

s

Logi

c, M

emor

y Si

ze (N

orm

aliz

ed to

200

5)

Number of Processing Engines(Right Axis)

Total Logic Size(Normalized to 2005, Left Axis)

Total Memory Size(Normalized to 2005, Left Axis)

16 23 32 46 63 79101

133 161212

268

348

424

526

669

878

[Martin06]

Page 7: Tecniche di ottimizzazione per lo sviluppo di applicazioni

7

System / ServiceApplication S/W

Mobile TerminalMiddleware

ModuleRTOS

ChipHAL

ProcessS/W IP

Target System Application

Requires design of Hardware AND software

SoC Solution-on-a-Chip

+

SOCSOC

System e-SW

Chip

Design as optimizationDesign spaceThe set of “all” possible design choicesConstraintsSolutions that we are not willing to

acceptCost functionA property we are interested in

(execution time, power, reliability…)

Page 8: Tecniche di ottimizzazione per lo sviluppo di applicazioni

8

Hardware synthesisALGORITHM

HIGH-LEVEL SYNTHESIS

S1 S3 S4S2

0.0 200.0 4 00.0 600. 0Freq

-120 .0

-100 .0

-80 .0

-60 .0

-40 .0

-20 .0

Am

pl (

db)

++

++

D

D

++

++

D

D

c1 c2

c3

c4 c5

c6

kIN

+

+

D

D

++

+

D

D

+

++c1

c2 c3

c4

c5

c6 c7

c8

k

dIN OUT

APPLICATION

interconnect

ASICGP signal

MCM

processor

memory

ARCHITECTURE

LOGIC AND PHYSICAL SYNTHESIS

Behavioral synthesisC ontrol/D ataFlow G rap h

(C DFG )Implem en tation

RegReg

M ultiplier

Adder

RegReg2 1 1 ...2 3 2 ...

4 3 2 ...

0 4 7 ...4 7 9 ...

Page 9: Tecniche di ottimizzazione per lo sviluppo di applicazioni

9

Allocation, Assignment, and Scheduling

D

+

-

>>

>>

+

-

>>

+ >>

+

>>

+

Allocation: How Much?2 adders

Assignment: Where?

Schedule: When?

Shifter 1

Time Slot 4

1 shifter24 registers

D

Techniques Well Understood and Mature

Resource constraints

+

*3*2

3

+

*1

2

+1 1

2

3

3

4 4

+

*3*2

3

+2

+1 2

3

4

1

2 3

4 control steps

+ * * + *

*1Schedule 1 Schedule 2

1 +1

2 +2

3 +3 *1

4 *2 *3

Control Step

1 +3

2 +1 *2

3 +2 *3

4 *1

Control Step

Page 10: Tecniche di ottimizzazione per lo sviluppo di applicazioni

10

Scheduling under resource constraints

Intractable problemAlgorithms:

Exact:Integer linear programHu (restrictive assumptions)

Approximate :List schedulingForce-directed scheduling

Binary decision variables:X = { xil, i = 1,2,…. n; l = 1,2,…, λ + 1}xil, is TRUE only when operation vi starts in

step l of the schedule (i.e. l = ti)λ is an upper bound on latency

Start time of operation vi :Σl . xil

ILP formulation

l

Page 11: Tecniche di ottimizzazione per lo sviluppo di applicazioni

11

Operations start only onceΣ xil = 1 i = 1, 2,…, nSequencing relations must be satisfiedti ≥ tj + dj (vj, vi) є EΣ l • xil – Σ l • xil – dj ≥ 0 (vj, vi) є EResource bounds must be satisfiedSimple case (unit delay)Σ xil ≤ ak k = 1,2,…nres ; l

ILP formulation constraints

l

AA

A

l

l

i:T(vi)=k

ILP Formulation

min (Σ l • xnl) such that

Σ xil = 1 i = 1, 2, …, n

Σ l • xij - Σ l • xjl - dj ≥ 0 i, j = 1, 2, …, n, (vj, vi) є E

Σ Σ xim ≤ ak k = 1, 2, …, nres ; l = 0, 1, …, λl

ll

l

m=l-di+1i:T(vi)=k

l

Page 12: Tecniche di ottimizzazione per lo sviluppo di applicazioni

12

Example

Resource constraints:2 ALUs; 2 Multipliersa1 = 2; a2 = 2

Single-cycle operationdi = 1 i

* * + <

-

-

* * * * +

NOP

NOP

0

1 2

3

4

5

6

7

8

9

10

11

n

A

ExampleOperations start only oncex11 = 1x61 + x62 =1…

Sequencing relations must be satisfiedx61 + 2x62 – 2x72 – 3x73 + 1 ≤ 02x92 + 3x93 + 4x94 – 5xN5 + 1 ≤ 0…

Resource bounds must be satisfiedx11 + x21 +x61 + x81 ≤ 2x32 + x62 + x72 + x81 ≤ 2…

Page 13: Tecniche di ottimizzazione per lo sviluppo di applicazioni

13

Example

*

*

+

<

-

-

* *

*

*

+

NOP

NOP

0

1 2

3

4

5

6

78

9

10

11

n

TIME 1

TIME 2

TIME 3

TIME 4

Resource-EfficientApplication mapping for MPSoCs

Given a platform1. Achieve a specified throughput2. Minimize usage of shared resources

MULTIMEDIAAPPLICATIONS

Page 14: Tecniche di ottimizzazione per lo sviluppo di applicazioni

14

Optimization Development

The abstraction gap between high level optimization tools and standard application programming models can introduce unpredictableand undesired behaviours.Programmers must be conscious about simplified assumptions taken into account in optimization tools.New methodology for multi-task application development on MPSoCs.

Platform Modelling

Optimization Analysis

Optimal Solution

Starting Implementation

Platform Execution

Abstractiongap

(. .

Final Implementation

Application design flow

Resource assignment and scheduling

SHARED SYSTEM BUS

On-chipMemory

Node 1 Node N

Processor

Tightly-CoupledMemory

Bus Interface

.....

Task. A (WCET Ta)Task. B (WCET Tb)

Task. N (WCET Tn)

THE SYSTEM

LimitedSize Mem

Max busbandwidth

Maxtime

wheelperiod

T

AssumedTo be

infinite

Page 15: Tecniche di ottimizzazione per lo sviluppo di applicazioni

15

The application

T7T1 T2 T0 T3 …..

Signal Processing Pipeline

Queues for inter-processor communicationin TCM for efficiency reasons

Program datain TCM (if space) or on-chip memoryInternal statein TCM (if space) or on-chip memory

Each task is characterized by:• WCET• Memory requirements

ThroughputConstraint

Communication-aware Allocation and Scheduling for Stream-Oriented MPSoCs

T7T1 T2 T0 ….. Signal ProcessingPipeline

ARM7 LocalScratchpad

Memory BUS

PrivateMemory

ARM7

………………..

LocalScratchpad

Memory

PrivateMemory

……….

Message-orientedMPSoC

architecture

?Simplifying assumptions vs predictabilityEfficient solutions in reasonable timePure ILP formulations suitable for small task setsWidespread use of heuristics

Page 16: Tecniche di ottimizzazione per lo sviluppo di applicazioni

16

Master Problem modelAssignment of tasks and memory slotsAssignment of tasks and memory slots (master problem)

Tij= 1 if task i executes on processor j, 0 otherwise, Yij =1 if task i allocates program data on processor j memory, 0 otherwise, Zij =1 if task i allocates the internal state on processor j memory, 0 otherwise Xij =1 if task i executes on processor j and task i+1 does not, 0 otherwise

Each process should be allocated to one processor ∑ Tij= 1 for all j

Link between variables X and T: Xij = |Tij – Ti+1 j | for all i and j (can be linearized)

If a task is NOT allocated to a processor nor its required memories are:Tij= 0 ⇒ Yij =0 and Zij =0

Objective function ∑ ∑ memi (Tij - Yij) + statei (Tij - Yij) + datai Xij /2

i

i j

Improvement of the model

With the proposed model, the allocation problem solver tends to With the proposed model, the allocation problem solver tends to pack pack all tasks on a single processor and all memory required on the lall tasks on a single processor and all memory required on the local ocal memory so as to have a ZERO communication cost: TRIVIAL SOLUTIONmemory so as to have a ZERO communication cost: TRIVIAL SOLUTION

To improve the model we should add a relaxation of the subproblTo improve the model we should add a relaxation of the subproblem to em to the master problem model: the master problem model:

For each set S of consecutive tasks whose sum of durations exceeFor each set S of consecutive tasks whose sum of durations exceeds the ds the Real time requirement, we impose that their processors should noReal time requirement, we impose that their processors should not be the t be the same same

∑ WCETi > RT ⇒ ∑ Tij ≤ |S| -1i ∈ S i ∈ S

Page 17: Tecniche di ottimizzazione per lo sviluppo di applicazioni

17

Sub-Problem modelTask scheduling with static resource assignmentTask scheduling with static resource assignment (subproblem)

i

Sub-Problem modelTask scheduling with static resource assignmentTask scheduling with static resource assignment (subproblem)We have to schedule tasks so we have to decide when they start

Activity Starting Time: Starti::[0..Deadlinei]

Precedence constraints: Starti+Duri ≤ Startj

Real time constraints: for all activities running on the same processor∑ (Starti+Duri ) ≤ RT

Cumulative constraints on resourcesprocessors are unary resources: cumulative([Start], [Dur], [1],1)memories are additive resources: cumulative([Start],[Dur],[MR],C)

What about the bus??

i

Page 18: Tecniche di ottimizzazione per lo sviluppo di applicazioni

18

Bus model

BANDWIDTHBIT/SEC

TIME

Max busbandwidth

Taskistate read

Taskistate write

Execution timetaski and task j

Unary resource: granularity clock cycle

Arbitration mechanism that decides the bus allocation

Taskjstate read

TaskjState write

Bus modelBANDWIDTH

BIT/SEC

TIME

Max busbandwidthSize of program data

TaskExecTimeTask0 accessesinput data:

BW=MaxBW/NoProc

Taskistate read

Taskistate write

taskjtaski

Additive bus model

The model does not hold under heavy bus congestion(more than 65% of total bandwidth)

Bus traffic has to be minimized

Taskjstate read

Taskistate write

Page 19: Tecniche di ottimizzazione per lo sviluppo di applicazioni

19

No good generationAssignment of tasks and memory slotsAssignment of tasks and memory slots (master problem)Task scheduling with static resource assignmentTask scheduling with static resource assignment (subproblem)

If no feasible schedule exist for the allocation provided by the master a no-good is generated.

We use the simple BUT EFFECTIVE one: identify CONFLICTING RESOURCES CR. For each R ∈ CR, STR set of tasks allocated on R

Σ TiR ≤ | STR | - 1

Other cuts are also possible, [Hooker, Constraints 2005], but these are enough for our case and easy to extract

MasterProblem

solution Sub-Problem

no good

solution

IP solver CP solver

i ∈ STR

Computational efficiency

CP and IP formulations simplifiedHybrid approach clearly outperforms pure CP and IP techniquesSearch time bounded to 15 minutes

CP and IP can found a solution only in 50%- of the instancesHybrid approach always found a solution

Page 20: Tecniche di ottimizzazione per lo sviluppo di applicazioni

20

Validation of bus model

Requesting more than 65% of the theoretical maximumbandwidth causes the additive model to fail.Lower threshold in presence of communication hotspots (50%)Benefits of the additive model

task execution time almost indep. of bus utilizationPerformance predictability greatly enhanced

Validation of optimizersolutions

MAX error lower than 10%AVG error equal to 4.7%, with standard deviation of 0.08Optimizer turn out to be conservative in predicting infeasibilityThe flow was successfully applied to GSM benchmark

Page 21: Tecniche di ottimizzazione per lo sviluppo di applicazioni

21

Energy-EfficientApplication mapping for MPSoCs

Given a platform1. Achieve a specified throughput2. Minimize power consumption

MULTIMEDIAAPPLICATIONS

Application Mapping

The problem of allocating, scheduling and freq. selection for task graphs on multi-processors in a distributed real-time system is NP-hard.New tool flows for efficient mapping of multi-task applications onto hardware platforms

T1

T2 T3

T4 T5 T6

T7

T8

…Proc. 1 Proc. 2 Proc. N

INTERCONNECT

Private

Mem

Private

Mem

Private

Mem…

T1 T2 T3T4 T5 T6T8 T7

Time

Res

ourc

es

T1 T2

T3

T4

T5 T7

Deadline

T8

Allocation

Schedule&Freq.sel.

Page 22: Tecniche di ottimizzazione per lo sviluppo di applicazioni

22

Exploiting Voltage SupplySupply voltage impacts power and performance

Circuit slowdown T=1/f=K/(Vdd-Vt)a

Cubic power savings P=Ceff*Vdd2*f

Just-in-time computationStretch execution time up to the max tolerable

Available time

PowerFixed voltage + Shutdown

Variable voltage

Scheduling & Voltage Scaling

deadlinet

P

τ1 τ2 τ3

Energy/speed trade-offs:varying the voltagesVbs

CPUVdd

f1 f2 f3

Different voltages:different frequencies

Mapping and scheduling: given (fastest freq.)

Power

deadlinetτ1 τ2 τ3

Slack

Page 23: Tecniche di ottimizzazione per lo sviluppo di applicazioni

23

Target architecture - 2Homogeneous computation tiles:

ARM cores (including instruction and data caches);Tightly coupled software-controlled scratch-pad memories (SPM);

AMBA AHB;DMA engine;RTEMS OS;Technology homogeneous (0.13um) industrial power models (ST)

Variable Voltage/Frequency cores with discrete (Vdd,f) pairsFrequency dividers scale down the baseline 200 MHz system clockCores use non-cacheable sharedmemory to communicate;Semaphore and interrupt facilities are used for synchronization;Private on-chip memory to store data.

Tile TileTile Tile …Sync. Sync. Sync. Sync.

PrivateMem

PrivateMem

PrivateMem

PrivateMem

SharedMem

AMBA AHB INTERCONNECT

PrivateMem..

Prog.REG

CLOCK TREEGENERATOR

SystemC

LOC

K

CLOCK NCLOCK 3

CLOCK 2CLOCK 1

INTSlave

… Int_

CLKTile TileTile Tile …

Sync. Sync. Sync. Sync.

PrivateMem

PrivateMem

PrivateMem

PrivateMem

SharedMem

AMBA AHB INTERCONNECT

PrivateMem..

Prog.REG

CLOCK TREEGENERATOR

SystemC

LOC

K

CLOCK NCLOCK 3

CLOCK 2CLOCK 1

INTSlave

… Int_

CLK

A task graph represents:A group of tasks TTask dependenciesExecution times express in clock cycles: WCN(Ti)Communication time (writes & reads) expressed as: WCN(WTiTj) and WCN(RTiTj)These values can be back-annotated from functional simulation

Application model

Task1

Task2

Task3

Task4

Task5

Task6

WCN(WT1T2)WCN(RT1T2)WCN(T1)

WCN(WT1T3)WCN(RT1T3)

WCN(T2) WCN(WT2T4)WCN(RT2T4)

WCN(WT3T5)WCN(RT3T5)

WCN(WT4T6)WCN(RT4T6)

WCN(WT5T6)WCN(RT5T6)

WCN(T3)

WCN(T4)

WCN(T5)

WCN(T6)

Page 24: Tecniche di ottimizzazione per lo sviluppo di applicazioni

24

Efficient Application Development SupportIn optimization tools many simplifying assumptions are generally considered The neglecting of these assumptions in software implementation can generate:

unpredictable and not desired system-level interactions;make the overall system error-prone.

We propose an entire framework to help programmers in software implementation:

a generic customizable application template OFFLINE SUPPORT;a set of high-level APIs ONLINE SUPPORT.

The main goals of our development framework are:the exact and reliable application’s execution after the optimization step;guarantees about high performance and constraint satisfaction.

Customizable Application TemplateStarting from a high level task and data flow graph, software developers can easily and quickly build their application infrastructure.Programmer can intuitively translate high level representation into C-code using our facilities and library.

Users can specify:the number of tasks included in the target application;their nature (e.g. branch, fork, or-node, and-node);their precedence constraints (e.g. due to data communication);

….thus quickly drawing its CTG schema.Programmer can focus onto the functionalities of the tasks:

the main effort is given to the more specific and critic sections of the application.

Page 25: Tecniche di ottimizzazione per lo sviluppo di applicazioni

25

OS-level and Task-level APIsUsers can easily reproduce optimizer solutions, thus:

Indirectly neglecting optimizer’s abstractionsTask model;Communication model;OS overheads.

Obtaining the needed application constraint satisfaction.

Programmer can allocate to the right hardware resourcesTasks;Program data;Queues.

Scheduling support APIsFrequency and voltage selection;

Communication issuesShared queues;Semaphores;Interrupts.

//Node Behaviour: 0 AND ; 1 OR; 2 FORK; 3 BRANCHuint node_behaviour[TASK_NUMBER] = {2,3,3,..};

#define N_CPU 2uint task_on_core[TASK_NUMBER] = {1,1,2,1};int schedule_on_core[N_CPU][TASK_NUMBER] = {{1,2,4,8}..};

uint queue_consumer [..] [..] = {{0,1,1,0,..},{0,0,0,1,1,.},{0,0,0,0,0,1,1..},{0,0,0,0,..}..};

//Node Type: 0 NORMAL; 1 BRANCH ; 2 STOCHASTICuint node_type[TASK_NUMBER] = {1,2,2,1,..};

ExampleNumber of nodes : 12Graph of activitiesNode type

Normal, Branch, Conditional, Terminator

Node behaviourOr, And, Fork, Branch

Number of CPU : 2Task AllocationTask SchedulingArc prioritiesFreq. & Voltage

Time

Res

ourc

es

N1 B2

B3

C4

C7

Deadline

N8

T2 T3

T4 T5 T6 T7

T8 T9 T10

T11

T12

T1N1

B2 B3

C4 C5 C6 C7

N8 N9 N10

N11

T12

fork

or

or

and

branch branch

P1

P2

N11

N10

T12

a1a2

a3 a4 a5 a6

a7 a8 a9 a10

a11 a12

B3 C7 N10

T12

a13

a14

#define TASK_NUMBER 12

Page 26: Tecniche di ottimizzazione per lo sviluppo di applicazioni

26

Queue ordering optimization

Communication ordering affects system performances

T1

T2T4

CPU1 CPU2

C3C1

T3

…C

2

Wait!

RUN!

T5 T6… …

C4 C5

Queue ordering optimization

Communication ordering affects system performances

T1

T2T4

T5 T6

CPU1 CPU2

… … …

C3C1

T3

C2

Wait!

RUN!

C4 C5

Page 27: Tecniche di ottimizzazione per lo sviluppo di applicazioni

27

T4 re-activated

Synchronization among tasks

T1

T2 T4C2

T3

C1

C3

Proc. 1

T1

Proc. 2

T2T3 T4

T4 is suspended

Non blocked semaphores

Logic Based Benders DecompositionObj. Function:Communication cost

& energy consumption

Validallocation

Allocation& Freq. Assign.:

INTEGER PROGRAMMING

Scheduling:CONSTRAINT PROGRAMMING

No good: linearconstraint

Memory constraints

Real Timeconstraint

Decomposes the problem into 2 sub-problems:Allocation & Assignment of freq. settings → IP

Objective Function: minimizing energy consumption during execution and communication of tasks

Scheduling → CPObjective Function: minimizing energy consumption during frequency switching

Page 28: Tecniche di ottimizzazione per lo sviluppo di applicazioni

28

Solver Performance

Hundreds of of decision variablesMuch beyond ILP solver or CP solver capability

Allocation problem modelXtfp = 1 if task t executes on processor p at frequency f;Wijfp = 1 if task i and j run on different core.

Task i on core p writes data to j at freq. f;Rijfp = 1 if task i and j run on different core.

Task j on core p reads data to i at freq. f;

WriteadComp

P

p

M

fijfpijfp

P

p

M

fijfp

P

p

M

fijfp

P

p

M

ftfp

EnEnEnOF

TjiRW

TjiR

TjiW

tX

++=

∈∀=−

∈∀≤

∈∀≤

∀=

∑∑

∑∑

∑∑

∑∑

= =

= =

= =

= =

Re

1 1

1 1

1 1

1 1

,0)(

,1

,1

1 Each task can execute only on one processor at one freq.

Communication between tasks can execute only once for execution and one write corresponds to one read

The objective function: minimize energy consumption associated with task execution and communication

Page 29: Tecniche di ottimizzazione per lo sviluppo di applicazioni

29

adWriteComp

P

p

M

ff

T

tLocRijmRijijfpLocRijifpad

P

p

M

ff

T

tLocWijmWijijfpLocWijifpWrite

P

p

M

f

T

tfttfpComp

EnEnEnOF

EWCNWCNRWCNXEn

EWCNWCNWWCNXEn

EWCNXEn

Re

1 1 1ReRe

1 1 1Re

1 1 1

))((

))((

++=

−+=

−+=

=

∑∑∑

∑∑∑

∑∑∑

= = =

= = =

= = =

Communication energy forReads from shared memory.

Reads carried out at the same frequency of the task

Allocation problem model

Bus

Mem

CPU CPU

Computation energy forall tasks in the system

Communication energy forWrites to shared memory. Writes carried out at the same frequency

of the task

Five phases behaviourINPUT=input data reading; EXEC=computation activity;OUTPUT=output data writing.

Atomic activities

Scheduling problem modelINPUT EXEC OUTPUT

The objective function: minimize energy consumption associated with frequency switching

•Processors are modelled as unary resource•Bus is modelled as additive resource

Duration of task i is now fixed since mode is fixed:Reading phase

input

input

input

exec

output

output

output

forkjoin

Writing phase

jijijii

jiii

jii

StartadddWritedurationStart

StartTdurationStart

StartdurationStart

≤+++

≤++

≤+

Re

Task i Task j

Tasks running on the same processor at the same frequency

Tasks running on the same processor at different frequencies

Tasks running on different processors

Page 30: Tecniche di ottimizzazione per lo sviluppo di applicazioni

30

Application Development Methodology

CTGCharacterization

Phase

Simulator

OptimizationPhase

Optimizer

ApplicationProfiles

Optimal SWApplication

Implementation

ApplicationDevelopment

Support

Alloca

tion

Sched

uling

PlatformExecution

MAX error lower than 10%;AVG error equal to 4.51%, with standard deviation of 1.94;All the deadline constraints are satisfied.

Optimizer

Optimal Allocation & Schedule

Virtual Platform validation

-0.05

0

0.05

0.1

0.15

0.2

0.25

-5% -4% -3% -2% -1% 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11%

250 instances

Validation of optimizer solutions: Throughput

Prob

abili

ty (%

)

Throughput difference (%)

Page 31: Tecniche di ottimizzazione per lo sviluppo di applicazioni

31

MAX error lower than 10%;AVG error equal to 4.80%, with standard deviation of 1.71;

Optimizer

Optimal Allocation & Schedule

Virtual Platform validation

250 instances

Validation of optimizer solutions: Power

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

-5% -4% -3% -2% -1% 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11%

250 instances

Prob

abili

ty (%

)Energy consumption difference (%)

GSM Encoder

Throughput required: 1 frame/10ms.With 2 processors and 4 possible freq.&voltage settings:

Task Graph:10 computational tasks;15 communication tasks.

Without optimizations:50.9μJ

With optimizations:17.1 μJ - 66,4%

Page 32: Tecniche di ottimizzazione per lo sviluppo di applicazioni

32

Summary & future workEnergy-optimal task mapping

Strong optimization engine (complete)Programmer support (design & exec time)Validation: accuracy & optimality

Future workConditional task graphsDealing with multiple use casesVariable execution timesAggressive communication scheduling