oscar parallelizing and power reducing compiler for ... · power ∝frequency * voltage2 (voltage...

OSCAR Parallelizing and Power Reducing Compiler for Multicores

Hironori KasaharaProfessor, Dept. of Computer Science & Engineering

Director, Advanced Multicore Processor Research InstituteWaseda University (早稲田大学), Tokyo, Japan

IEEE Computer Society President Elect 2017, President 2018 URL: http://www.kasahara.cs.waseda.ac.jp/

Waseda Univ. GCSC

Core#2 Core#3

Core#1

Core#4 Core#5

Core#6 Core#7

SNC

0SN

C1

DBSC

DDRPADGCPGC

SM

LB

SC

SHWY

URAMDLRAM

Core#0ILRAM

D$

I$

VSWC

Multicores for Performance and Low Power

IEEE ISSCC08: Paper No. 4.5, M.ITO, … and H. Kasahara,

“An 8640 MIPS SoC with Independent Power-off Control of 8 CPUs and 8 RAMs by an Automatic

Parallelizing Compiler”

Power ∝ Frequency * Voltage2

(Voltage ∝ Frequency)Power ∝ Frequency3

If Frequency is reduced to 1/4(Ex. 4GHz1GHz),

Power is reduced to 1/64 and Performance falls down to 1/4 .<Multicores>If 8cores are integrated on a chip,Power is still 1/8 and Performance becomes 2 times.

2

Power consumption is one of the biggest problems for performance scaling from smartphones to cloud servers and supercomputers (“K” more than 10MW) .

With 128 cores, OSCAR compiler gave us 100 times speedup against 1 core execution and 211 times speedup against 1 core using Sun (Oracle) Studio compiler. 3

Earthquake Simulation “GMS” on Fujitsu M9000 Sparc CC-NUMA Server

Power Reduction of MPEG2 Decoding to 1/4 on 8 Core Homogeneous Multicore RP-2

by OSCAR Parallelizing Compiler

Avg. Power5.73 [W]

Avg. Power1.52 [W]

73.5% Power Reduction4

MPEG2 Decoding with 8 CPU cores

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

Without Power Control（Voltage：1.4V)

With Power Control （Frequency, Resume Standby: Power shutdown & Voltage lowering 1.4V-1.0V)

Demo of NEDO Multicore for Real Time Consumer Electronicsat the Council of Science and Engineering Policy on April 10, 2008

CSTP MembersPrime Minister: Mr. Y. FUKUDAMinister of State for Science, Technology and Innovation Policy:Mr. F. KISHIDAChief Cabinet Secretary: Mr. N. MACHIMURAMinister of Internal Affairs and Communications :Mr. H. MASUDAMinister of Finance :Mr. F. NUKAGAMinister of Education, Culture, Sports, Science and Technology: Mr. K. TOKAIMinister of Economy,Trade and Industry: Mr. A. AMARI

＜R & D Target＞Hardware, Software, Application for Super Low-Power Manycore ProcessorsMore than 64 coresNatural air cooling (No fan)

Cool, Compact, Clear, QuietOperational by Solar Panel<Industry, Government, Academia>Hitachi, Fujitsu, NEC, Renesas, Olympus,Toyota, Denso, Mitsubishi, Toshiba, etc＜Ripple Effect＞Low CO2 (Carbon Dioxide) EmissionsCreation Value Added Products

Consumer Electronics, Automobiles, Servers

Green Computing Systems R&D CenterWaseda University

Supported by METI (Mar. 2011 Completion)

Beside Subway Waseda Station,Near Waseda Univ. Main Campus

6

Hitachi SR16000:Power7 128coreSMP

Fujitsu M9000SPARC VII 256 core SMP

Solar PoweredSmart phones

Cameras

Robots

Cool desktop servers

Industry-government-academia collaboration in R&Dand target practical applications

Multicire Engine ECU, ADAS(Driver Asistance),Self Driving, HV, EV, FCV

Consumer electronicInternet TV/DVD

Solar Powered,Non-fan, cool, quiet servers

Operation/rechargingby solar cells

OSCAR Multicore

Green supercomputers

OSCAR

OSCAR

OSCARMany-core

Chip

7

Waseda University :R&DMany-core system technologies with

ultra-low power consumption

Disaster SurvivalSupercomputer (Earthquakes, tsunami, Fire-spreading)

National Institute of Radiological Sciences

Protect Lives

Protect EnvironmentFor smart life

Camcorders

Intelligent home appliances

Supercomputers and servers

Industry

Capsule inner cameras

Compiler, API

Medical servers

Heavy particle radiation planning, cerebral infarction)

OSCAR Technology

Vector Acc.

To improve effective performance, cost-performance and software productivity and reduce power

OSCAR Parallelizing Compiler

Multigrain Parallelizationcoarse-grain parallelism among loops and subroutines, near fine grain parallelism among statements in addition to loop parallelism

Data LocalizationAutomatic data management fordistributed shared memory, cacheand local memory

Data Transfer OverlappingData transfer overlapping using DataTransfer Controllers (DMAs)

Power ReductionReduction of consumed power bycompiler control DVFS and Powergating with hardware supports.

1

23 45

6 7 8910 1112

1314 15 16

1718 19 2021 22

2324 25 26

2728 29 3031 32

33Data Localization Group

dlg0dlg3dlg1 dlg2

Generation of Coarse Grain TasksＭacro-tasks (MTs) Block of Pseudo Assignments (BPA): Basic Block (BB) Repetition Block (RB) : natural loop Subroutine Block (SB): subroutine

Program

BPA

RB

SB

Near fine grain parallelization

Loop level parallelizationNear fine grain of loop bodyCoarse grainparallelizationCoarse grainparallelization

BPARBSB

BPARBSB

BPARBSBBPARBSBBPARBSBBPARBSB

1st. Layer 2nd. Layer 3rd. LayerTotalSystem

9

Earliest Executable Condition Analysis for Coarse Grain Tasks (Macro-tasks)

BPA

BPA

1 BPA

3 BPA2 BPA

4 BPA

5 BPA

6 RB

7 RB15 BPA

8 BPA

9 BPA 10 RB

11 BPA

12 BPA

13 RB

14 RB

END

RB

RB

BPA

RB

Data Dependency

Control flow

Conditional branch

Repetition Block

RB

BPA

Block of PsuedoAssignment Statements

7

11

14

1

2 3

4

5 6

15 7

8

9 10

12

13

Data dependency

Extended control dependencyConditional branch

6

ORAND

Original control flowA Macro Flow GraphA Macro Task Graph

10

Automatic processor assignment in 103.su2cor

• Using 14 processorsCoarse grain parallelization within DO400

11

MTG of Su2cor-LOOPS-DO400

DOALL Sequential LOOP BBSB

Coarse grain parallelism PARA_ALD = 4.3

12

Data-Localization: Loop Aligned Decomposition• Decompose multiple loop (Doall and Seq) into CARs and LRs

considering inter-loop data dependence.– Most data in LR can be passed through LM.– LR: Localizable Region, CAR: Commonly Accessed Region

DO I=69,101DO I=67,68DO I=36,66DO I=34,35DO I=1,33

DO I=1,33

DO I=2,34

DO I=68,100

DO I=67,67

DO I=35,66

DO I=34,34

DO I=68,100DO I=35,67

LR CAR CARLR LR

C RB2(Doseq)DO I=1,100

B(I)=B(I-1)+A(I)+A(I+1)

ENDDO

C RB1(Doall)DO I=1,101A(I)=2*I

ENDDO

RB3(Doall)DO I=2,100

C(I)=B(I)+B(I-1)ENDDO

C

13

Data Localization

MTG MTG after Division

1

2

3 4 56

7

8 9 10

11

12 1314

15

1

23 45

6 7 8910 1112

1314 15 16

1718 19 2021 22

2324 25 26

2728 29 3031 32

33Data Localization Group

dlg0dlg3dlg1 dlg2

3

4

2

5

6 7

8

9

10

11

13

14

15

16

17

18

19

20

21

22

23 24

25

26

27 28

29

30

31

32

PE0 PE1

12 1

A schedule for two processors

14

An Example of Data Localization for Spec95 SwimDO 200 J=1,NDO 200 I=1,M

UNEW(I+1,J) = UOLD(I+1,J)+1 TDTS8*(Z(I+1,J+1)+Z(I+1,J))*(CV(I+1,J+1)+CV(I,J+1)+CV(I,J)2 +CV(I+1,J))-TDTSDX*(H(I+1,J)-H(I,J))

VNEW(I,J+1) = VOLD(I,J+1)-TDTS8*(Z(I+1,J+1)+Z(I,J+1))1 *(CU(I+1,J+1)+CU(I,J+1)+CU(I,J)+CU(I+1,J))2 -TDTSDY*(H(I,J+1)-H(I,J))

PNEW(I,J) = POLD(I,J)-TDTSDX*(CU(I+1,J)-CU(I,J))1 -TDTSDY*(CV(I,J+1)-CV(I,J))

200 CONTINUE

DO 300 J=1,NDO 300 I=1,M

UOLD(I,J) = U(I,J)+ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J))VOLD(I,J) = V(I,J)+ALPHA*(VNEW(I,J)-2.*V(I,J)+VOLD(I,J))POLD(I,J) = P(I,J)+ALPHA*(PNEW(I,J)-2.*P(I,J)+POLD(I,J))

300 CONTINUE

DO 210 J=1,NUNEW(1,J) = UNEW(M+1,J)VNEW(M+1,J+1) = VNEW(1,J+1)PNEW(M+1,J) = PNEW(1,J)

210 CONTINUE

UNVOZ

PNVN UOPO CVCUH

UNPNVN

UNVOPNVN UO

PO

U PV

0 1 2 3 4MBcache size

(a) An example of target loop group for data localization

(b) Image of alignment of arrays on cache accessed by target loops

Cache line conflicts occurs among arrays which share the same location on cache

DO 200 J=1,NDO 200 I=1,M

UNEW(I+1,J) = UOLD(I+1,J)+1 TDTS8*(Z(I+1,J+1)+Z(I+1,J))*(CV(I+1,J+1)+CV(I,J+1)+CV(I,J)2 +CV(I+1,J))-TDTSDX*(H(I+1,J)-H(I,J))

VNEW(I,J+1) = VOLD(I,J+1)-TDTS8*(Z(I+1,J+1)+Z(I,J+1))1 *(CU(I+1,J+1)+CU(I,J+1)+CU(I,J)+CU(I+1,J))2 -TDTSDY*(H(I,J+1)-H(I,J))

PNEW(I,J) = POLD(I,J)-TDTSDX*(CU(I+1,J)-CU(I,J))1 -TDTSDY*(CV(I,J+1)-CV(I,J))

200 CONTINUE

DO 300 J=1,NDO 300 I=1,M

UOLD(I,J) = U(I,J)+ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J))VOLD(I,J) = V(I,J)+ALPHA*(VNEW(I,J)-2.*V(I,J)+VOLD(I,J))POLD(I,J) = P(I,J)+ALPHA*(PNEW(I,J)-2.*P(I,J)+POLD(I,J))

300 CONTINUE

DO 210 J=1,NUNEW(1,J) = UNEW(M+1,J)VNEW(M+1,J+1) = VNEW(1,J+1)PNEW(M+1,J) = PNEW(1,J)

210 CONTINUE

UNVOZ

PNVN UOPO CVCUH

UNPNVN

UNVOPNVN UO

PO

U PV

0 1 2 3 4MBcache size

(a) An example of target loop group for data localization

(b) Image of alignment of arrays on cache accessed by target loops

Cache line conflicts occurs among arrays which share the same location on cache

15

PARAMETER (N1=513, N2=513)

COMMON U(N1,N2), V(N1,N2), P(N1,N2),* UNEW(N1,N2), VNEW(N1,N2),1 PNEW(N1,N2), UOLD(N1,N2),* VOLD(N1,N2), POLD(N1,N2),2 CU(N1,N2), CV(N1,N2),* Z(N1,N2), H(N1,N2)

Data Layout for Removing Line Conflict Misses by Array Dimension Padding

Declaration part of arrays in spec95 swim

PARAMETER (N1=513, N2=544)

COMMON U(N1,N2), V(N1,N2), P(N1,N2),* UNEW(N1,N2), VNEW(N1,N2),1 PNEW(N1,N2), UOLD(N1,N2),* VOLD(N1,N2), POLD(N1,N2),2 CU(N1,N2), CV(N1,N2),* Z(N1,N2), H(N1,N2)

before padding after padding

Box: Access range of DLG0

4MB 4MB

padding

16

Low Power Heterogeneous Multicore Code

GenerationAPI

Analyzer(Available

from Waseda)

Existing sequential compiler

Multicore Program Development Using OSCAR API V2.0Sequential Application

Program in Fortran or C(Consumer Electronics, Automobiles, Medical, Scientific computation, etc.)

Low Power Homogeneous Multicore Code

GenerationAPI

AnalyzerExisting

sequential compiler

Proc0

Thread 0

Code with directives

Waseda OSCARParallelizing Compiler

Coarse grain task parallelization

Data Localization DMAC data transfer Power reduction using

DVFS, Clock/ Power gating

Proc1

Thread 1

Code with directives

Parallelized API F or C program

OSCAR API for Homogeneous and/or Heterogeneous Multicores and manycoresDirectives for thread generation, memory,

data transfer using DMA, power managements

Generation of parallel machine

codes using sequential compilers

Exe

cuta

ble

on v

ario

us m

ultic

ores

OSCAR: Optimally Scheduled Advanced MultiprocessorAPI： Application Program Interface

HomegeneousMulticore s

from Vendor A(SMP servers)

Server Code GenerationOpenMP Compiler

Shred memory servers

HeterogeneousMulticores

from Vendor B

Hitachi, Renesas, NEC, Fujitsu, Toshiba, Denso, Olympus, Mitsubishi, Esol, Cats, Gaio, 3 univ.

Accelerator 1Code

Accelerator 2Code

Hom

ogen

eous

Accelerator Compiler/ User Add “hint” directives

before a loop or a function to specify it is executable by

the accelerator with how many clocks

Het

ero

Manual parallelization / power reduction

Parallel Execution• Start of parallel execution

– #pragma omp parallel sections (C)– !$omp parallel sections (Fortran)

• Specifying critical section– #pragma omp critical (C)– !$omp critical (Fortran)

• Enforcing an order of the memory operations – #pragma omp flush (C)– !$omp flush (Fortran)

• These are from OpenMP.

18Kasahara-Kimura Lab., Waseda University

Kasahara-Kimura Lab., Waseda University 1919

#pragma omp parallel sections{

#pragma omp sectionmain_vpc0();




}

Thread Execution Model

main_vpc0()

main_vpc1()

main_vpc2()

main_vpc3()

Parallel Execution(A thread and a core are bound

one-by-one)

VPC: Virtual Processor Core


Memory Mapping• Placing variables on an onchip centralized

shared memory (onchipCSM)• #pragma oscar onchipshared (C)• !$oscar onchipshared (Fortran)

• Placing variables on a local data memory (LDM)• #pragma omp threadprivate (C)• !$omp threadprivate (Fortran)• This directive is an extension to OpenMP

• Placing variables on a distributed shared memory (DSM)• #pragma oscar distributedshared (C)• !$oscar distributedshared (Fortran)


Data Transfer• Specifying data transfer lists

– #pragma oscar dma_transfer (C)– !$oscar dma_transfer (Fortran)– Containing following parameter directives

• Specifying a contiguous data transfer– #pragma oscar dma_contiguous_parameter (C)– !$oscar dma_contiguous_parameter (Fortran)

• Specifying a stride data transfer– #pragma oscar dma_stride_parameter– !$oscar dma_stride_parameter– This can be used for scatter/gather data transfer

• Data transfer synchronization– #pragma oscsar dma_flag_check– !$oscar dma_flag_check

Kasahara-Kimura Lab., Waseda University2222

Power Control• Making a module into specifying frequency and

voltage state– #pragma oscar fvcontrol (C)– !$oscar fvcontrol (Fortran)– state examples

• 100: max frequency• 50: half frequency• 0: clock off• -1: power off

• Getting a frequency and voltage state of a module– #pragma oscar get_fvstatus (C)– !$oscar get_fvstatus (Fortran)

fvcontrol(100)

fvcontrol(-1)

fvcontrol(50)

fvcontrol(100)

8.9times speedup by 12 processorsIntel Xeon X5670 2.93GHz 12 core SMP (Hitachi HA8000)

55 times speedup by 64 processorsIBM Power 7 64 core SMP

(Hitachi SR16000)

National Institute of Radiological Sciences (NIRS)

Cancer Treatment Carbon Ion Radiotherapy

(Previous best was 2.5 times speedup on 16 processors with hand optimization)

Renesas-Hitachi-Waseda Low Power 8 core RP2 Developed in 2007 in METI/NEDO project

Process Technology

90nm, 8-layer, triple-Vth, CMOS

Chip Size 104.8mm2

(10.61mm x 9.88mm)CPU Core Size

6.6mm2

(3.36mm x 1.96mm)Supply Voltage

1.0V–1.4V (internal), 1.8/3.3V (I/O)

Power Domains

17 (8 CPUs, 8 URAMs, common)

Core#2 Core#3

Core#1

Core#4 Core#5

Core#6 Core#7

SNC

0SN

C1

DBSC

DDRPADGCPG

CSM

LB

SC

SHWY

URAMDLRAM

Core#0ILRAM

D$

I$

VSWC

IEEE ISSCC08: Paper No. 4.5, M.ITO, … and H. Kasahara, “An 8640 MIPS SoC with Independent Power-off Control of 8 CPUs and 8 RAMs by an Automatic Parallelizing Compiler”

24

Core #3

I$16K

D$16K

CPU FPU

User RAM 64K

Local memoryI:8K, D:32K

Core #2

I$16K

D$16K

CPU FPU

User RAM 64K


Core #1

I$16K

D$16K

CPU FPU

User RAM 64K


Core #0

I$16K

D$16K

CPU FPU

URAM 64K


CCNBAR

8 Core RP2 Chip Block Diagram

On-chip system bus (SuperHyway)

DDR2LCPG: Local clock pulse generatorPCR: Power Control RegisterCCN/BAR:Cache controller/Barrier RegisterURAM: User RAM (Distributed Shared Memory)

Snoo

p co

ntro

ller

1

Snoo

p co

ntro

ller

0

LCPG0

Cluster #0 Cluster #1

PCR3

PCR2

PCR1

PCR0

LCPG1

PCR7

PCR6

PCR5

PCR4

controlSRAM

controlDMA

control

Core #7

I$16K

D$16K

CPUFPU

User RAM 64K

I:8K, D:32K

Core #6

I$16K

D$16K

CPUFPU

User RAM 64K

I:8K, D:32K

Core #5

I$16K

D$16K

CPUFPU

User RAM 64K

I:8K, D:32K

Core #4

I$16K

D$16K

CPUFPU

URAM 64K


CCNBAR

Barrier Sync. Lines

Engine Control by multicore with Denso

26

Though so far parallel processing of the engine control on multicore has been very difficult, Denso and Waseda succeeded 1.95 times speedup on 2core V850 multicore processor.

1 core 2 cores

Hard real-time automobile engine

control by multicore

Macro Task Fusion for Static Task Scheduling

27

MFG of sample program before maro task fusion

MFG of sample program after macro task fusion

MTG of sample program after macro task fusion

Fuse branches and succeeded tasks

Fusedblock

: Data Dependency: Control Flow: Conditional Branch

Only data dependency

Evaluation Environment :Embedded Multi-core Processor RPX

SH-4A 648MHz * 8 As a first step, we use just two SH-4A cores because target

dual-core processors are currently under design for next-generation automobiles

28

Evaluation of Crankshaft Program with Multi-core Processors

Attain 1.54 times speedup on RPX There are no loops, but only many conditional branches and small basic

blocks and difficult to parallelize this program This result shows possibility of multi-core processor for

engine control programs29

1.00

1.54 0.57

0.37

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.000.200.400.600.801.001.201.401.601.80

1 core 2 core

executiontim

e(us)speedu

pratio

OSCAR Compile Flow for Simulink Applications

30

Simulink model C code

Generate C codeusing Embedded Coder

OSCAR Compiler

(1) Generate MTG→ Parallelism

(2) Generate gantt chart→ Scheduling in a multicore

(3) Generate parallelized C code using the OSCAR API→ Multiplatform execution(Intel, ARM and SH etc)

31

Road Tracking, Image Compression : http://www.mathworks.co.jp/jp/help/vision/examplesBuoy Detection : http://www.mathworks.co.jp/matlabcentral/fileexchange/44706‐buoy‐detection‐using‐simulinkColor Edge Detection : http://www.mathworks.co.jp/matlabcentral/fileexchange/28114‐fast‐edges‐of‐a‐color‐image‐‐actual‐color‐‐not‐converting‐to‐grayscale‐/Vessel Detection : http://www.mathworks.co.jp/matlabcentral/fileexchange/24990‐retinal‐blood‐vessel‐extraction/

Speedups of MATLAB/Simulink Image Processing on Various 4core Multicores

(Intel Xeon, ARM Cortex A15 and Renesas SH4A)

OSCAR Compiler Accelerates Various Multicores’ Performance Several Times Including Intel and IBM

32

0123456789

tom

catv

swim

su2c

or

hydr

o2d

mgr

id

appl

u

turb

3d apsi

fppp

p

wav

e5

swim

mgr

id

appl

u

apsi

SPEC95 SPEC2000

spee

dup

ratio

Intel Ver.10.1OSCAR

2.1 times Speedup

On Intel 4core Multicore, Compared with Intel Compiler

3.3 times Speedup

On IBM P6 595 Server, Compared with IBM Compiler

On Intel Quad-core XeonOn IBM Power Servers, such as Power 4, 5, 6(4.2GHz), 7 and later

Performance of OSCAR Compileron Intel Core i7 Notebook PC

• OSCAR Compiler accelerate Intel Compiler about 2.0 times on average

1.00 1.18

2.70

1.00 1.00

1.70

2.24

4.12

2.91

2.53

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

4.50

SPEC95 su2cor SPEC95hydro2d

SPEC95 mgrid SPEC95 turb3d AAC Encoder

Speesup Ra

tio

Intel Compiler Ver.14.0OSCAR

CPU: Intel Core i7 3720QM (Quad‐core)MEM: 32GB DDR3‐SODIMM PC3‐12800OS: Ubuntu 12.04 LTS

0

15

30

45

60

通常の1コア実⾏並列化3コア実⾏

DrawImage : FPS

Parallelization of 2D Rendering Engine SKIA on 3 cores of Google NEXUS7

0

15

30

45

60

通常の1コア実⾏並列化3コア実⾏

DrawRect :FPS

22.82

43.57

27.16

×1.91 ×1.95

for DrawRect 1.91 speedup

for DrawImage 1.95 speedupOn Nexus7, 3 core parallelization gave us

52.88

34

1 Core 3 cores 1 Core 3 cores

http://www.youtube.com/channel/UCS43lNYEIkC8i_KIgFZYQBQ

110 Times Speedup against the Sequential Processing for GMS Earthquake Wave

Propagation Simulation on Hitachi SR16000（Power7 Based 128 Core Linux SMP）

35

Automatic Parallelization of Still Image Encoding Using JPEG-XR for the Next Generation Cameras and Drinkable Inner Camera

Waseda U. & Olympus 36

TILEPro64

Ds

t0X4)

rt 1

al)

nal)

X4)

1.00 1.96 3.95 7.86

15.82

30.79

55.11

0.00

10.00

20.00

30.00

40.00

50.00

60.00

1 2 4 8 16 32 64

Speed up

コア数

Speed‐ups on TILEPro64 Manycore0.18[s]

1コア10.0[s]

55 times speedup with 64 coresagainst 1 core

Parallel Processing of Face Detection on Manycore, Highend and PC Server

37

OSCAR compiler gives us 11.55 times speedup for 16 cores against 1 core on SR16000 Power7 highend server.

1.00 1.72

3.01

5.74

9.30

1.00 1.93

3.57

6.46

11.55

1.00 1.93

3.67

6.46

10.92

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

1 2 4 8 16

速度

向上

率

コア数

速度向上率tilepro64 gcc

SR16k(Power7 8core*4cpu*4node) xlc

rs440(Intel Xeon 8core*4cpu) icc

Automatic Parallelization of Face Detection

OSCAR Heterogeneous Multicore

38

• DTU– Data Transfer

Unit• LPM

– Local Program Memory

• LDM– Local Data

Memory• DSM

– Distributed Shared Memory

• CSM– Centralized

Shared Memory• FVR

– Frequency/Voltage Control Register

39

An Image of Static Schedule for Heterogeneous Multi-core with Data Transfer Overlapping and Power Control

TIME

33 Times Speedup Using OSCAR Compiler and OSCAR API on RP-X

(Optical Flow with a hand-tuned library)

12.29 3.09

5.4

18.85

26.71

32.65

0

5

10

15

20

25

30

35

1SH 2SH 4SH 8SH 2SH+1FE 4SH+2FE 8SH+4FE

Speedu

ps against a single SH

processor

3.4[fps]

111[fps]

Power Reduction in a real-time execution controlled by OSCAR Compiler and OSCAR API on RP-X

(Optical Flow with a hand-tuned library)

Without Power Reduction With Power Reductionby OSCAR Compiler

Average:1.76[W] Average:0.54[W]

1cycle : 33[ms]→30[fps]

70% of power reduction

Low-Power Optimization with OSCAR API

42

MT1

VC0

MT2

MT4MT3

Sleep

VC1

Scheduled Resultby OSCAR Compiler void

main_VC0() {

MT1

voidmain_VC1() {

MT2

#pragma oscar fvcontrol ¥(1,(OSCAR_CPU(),100))

#pragma oscar fvcontrol ¥((OSCAR_CPU(),0))

Sleep

MT4MT3

} }

Generate Code Image by OSCAR Compiler

Automatic Power Reduction for MPEG2 Decode on Android Multicore

ODROID X2 ARM Cortex-A9４cores

• On 3 cores, Automatic Power Reduction control successfully reduced power to 1/7 against without Power Reduction control.

• 3 cores with the compiler power reduction control reduced power to 1/3 against ordinary 1 core execution.

0.97

1.88

2.79

0.63 0.46 0.37

0.00

0.50

1.00

1.50

2.00

2.50

3.00

1 2 3

Power Con

sumption[W

]

No. of Processor Cores

電力制御なし電力制御ありWithout Power Reduction

With Power Reduction

2/3(‐35.0%)

1/4（‐75.5%）

1/7(‐86.7%)

1/3(‐61.9%)

http://www.youtube.com/channel/UCS43lNYEIkC8i_KIgFZYQBQ

43

Power Reduction on Intel Haswell for Real-time Optical Flow

Power was reduced to 1/4 (9.6W) by the compiler power optimization on the same 3 cores(41.6W).Power with 3 core was reduced to 1/3 (9.6W) against 1 core (29.3W) .

Power was reduced to 1/4 by compiler on 3 cores

1/3

For HD 720p(1280x720) moving pictures15fps (Deadline66.6[ms/frame])

29.29

36.5941.58

24.17

12.21 9.60

05101520253035404550

1PE 2PE 3PEaverage po

wer con

sumption[W

]

number of PE

without power control with power control

Intel CPU Core i7 4770K

44

Performance of OSCAR Compiler Software Coherence Control

1.00

1.89

3.54

1.00

1.62

2.54

1.00

1.85

3.34

1.02

1.92

3.59

5.90

1.01 1.61

2.45

3.36

1.02

2.10

3.90

6.63

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

1 2 4 8 1 2 4 8 1 2 4 8

AAC Encoder MPEG2 Decoder MPEG2 Encoder

Seed

Up

agai

nst s

eque

ntia

l Pro

cess

ing

No. of processor cores

SMPNon-Coherent Cache

Faster or Equal Processing Performance up to 4cores with hardware coherent mechanism on RP2.

Software Coherence gives us correct execution without hardware coherence mechanism on 8 cores.

46

OSCAR TechnologyStarted up on Feb.28, 2013: Licensing the all patents and OSCAR compiler from Waseda Univ.

Copyright 2015 47

CEO: Dr. T. Ono (Ex- CEO of First Section-listed Company, VP of National Univ., Invited Prof. of Waseda U. ）

Executives: Mr. T. Ito (Visiting Prof. Tokyo Agricult. and Eng. U.)Prof. K. Shirai (Ex-President of Waseda U

Chairman of Japanese Open Univ.）CTO: Mr. M. Takamura (Ex-Fellow Fujitsu Lab.,

Fujitsu VPP500, 5000 & NWT Development Leader）Mr. K. Ashida(Ex-VP Sumitomo Trading,

Ashida Consult. CEO, A leader of Business WorldAuditor: Dr. S. Matsuda（Prof. Emeritus Waseda U.

Ex-President Ventures and Entrepreneurs Society）Advisors: Dr. T. Sato（Patent Attorney, Ex-President of

Patent Attorneys Assoc., Gov. IP Committee)Ms. K. Ishiguro（Lawyer, Supreme Court Trainer)Mr. A. Fukuda (Leader of Alumni Assoc.)Prof. K. Kimura (Waseda Univ.）Prof. H. Kasahara（Waseda Univ.）

Fujitsu VPP5000

Target:

Solar Powered with

compiler power reduction.

Fully automatic

parallelization and

vectorization including

local memory management

and data transfer.

OSCAR Vector Multicore and Compiler for Embedded to Severs with OSCAR Technology

Centralized Shared Memory

Compiler Co-designed Interconnection Network

Compiler co-designed Connection Network

On-chip Shared Memory

Multicore Chip

VectorData

TransferUnit

CPU

Local MemoryDistributed Shared Memory

Power Control Unit

Core

×４chips

Copyright 2008 FUJITSU LIMITED

49

Fujitsu VPP500/NWT: PE Unit CabinetCabinet (open)

Summary Waseda University Green Computing Systems R&D Center supported by METI has

been researching on low‐power high performance Green Multicore hardware, software and application with government and industry including Hitachi, Fujitsu, NEC, Renesas, Denso, Toyota, Olympus and OSCAR Technology.

OSCAR Automatic Parallelizing and Power Reducing Compiler has succeeded speedup and/or power reduction of scientific applications including “Earthquake Wave Propagation”, medical applications including “Cancer Treatment Using Carbon Ion”, and “Drinkable Inner Camera”, industry application including “Automobile Engine Control”, “Smartphone”, and “Wireless communication Base Band Processing” on various multicores from different vendors including Intel, ARM, IBM, AMD, Qualcomm, Freescale, Renesas and Fujitsu.

In automatic parallelization, 110 times speedup for “Earthquake Wave Propagation Simulation” on 128 cores of IBM Power 7 against 1 core, 55 times speedup for “Carbon Ion Radiotherapy Cancer Treatment” on 64cores IBM Power7, 1.95 times for “Automobile Engine Control” on Renesas 2 cores using SH4A or V850, 55 times for “JPEG‐XR Encoding for Capsule Inner Cameras” on Tilera 64 cores Tile64 manycore.

The compiler will be available on market from OSCAR Technology. In automatic power reduction, consumed powers for real-time multi-media

applications like Human face detection, H.264, mpeg2 and optical flow were reduced to 1/2 or 1/3 using 3 cores of ARM Cortex A9 and Intel Haswell and 1/4 using Renesas SH4A 8 cores against ordinary single core execution.

50

oscar parallelizing and power reducing compiler for ... · power ∝frequency * voltage2 (voltage...

Documents