oscar parallelizing and power reducing compiler for ... · power ∝frequency * voltage2 (voltage...
TRANSCRIPT
OSCAR Parallelizing and Power Reducing Compiler for Multicores
Hironori KasaharaProfessor, Dept. of Computer Science & Engineering
Director, Advanced Multicore Processor Research InstituteWaseda University (早稲田大学), Tokyo, Japan
IEEE Computer Society President Elect 2017, President 2018 URL: http://www.kasahara.cs.waseda.ac.jp/
Waseda Univ. GCSC
Core#2 Core#3
Core#1
Core#4 Core#5
Core#6 Core#7
SNC
0SN
C1
DBSC
DDRPADGCPGC
SM
LB
SC
SHWY
URAMDLRAM
Core#0ILRAM
D$
I$
VSWC
Multicores for Performance and Low Power
IEEE ISSCC08: Paper No. 4.5, M.ITO, … and H. Kasahara,
“An 8640 MIPS SoC with Independent Power-off Control of 8 CPUs and 8 RAMs by an Automatic
Parallelizing Compiler”
Power ∝ Frequency * Voltage2
(Voltage ∝ Frequency)Power ∝ Frequency3
If Frequency is reduced to 1/4(Ex. 4GHz1GHz),
Power is reduced to 1/64 and Performance falls down to 1/4 .<Multicores>If 8cores are integrated on a chip,Power is still 1/8 and Performance becomes 2 times.
2
Power consumption is one of the biggest problems for performance scaling from smartphones to cloud servers and supercomputers (“K” more than 10MW) .
With 128 cores, OSCAR compiler gave us 100 times speedup against 1 core execution and 211 times speedup against 1 core using Sun (Oracle) Studio compiler. 3
Earthquake Simulation “GMS” on Fujitsu M9000 Sparc CC-NUMA Server
Power Reduction of MPEG2 Decoding to 1/4 on 8 Core Homogeneous Multicore RP-2
by OSCAR Parallelizing Compiler
Avg. Power5.73 [W]
Avg. Power1.52 [W]
73.5% Power Reduction4
MPEG2 Decoding with 8 CPU cores
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
Without Power Control(Voltage:1.4V)
With Power Control (Frequency, Resume Standby: Power shutdown & Voltage lowering 1.4V-1.0V)
Demo of NEDO Multicore for Real Time Consumer Electronicsat the Council of Science and Engineering Policy on April 10, 2008
CSTP MembersPrime Minister: Mr. Y. FUKUDAMinister of State for Science, Technology and Innovation Policy:Mr. F. KISHIDAChief Cabinet Secretary: Mr. N. MACHIMURAMinister of Internal Affairs and Communications :Mr. H. MASUDAMinister of Finance :Mr. F. NUKAGAMinister of Education, Culture, Sports, Science and Technology: Mr. K. TOKAIMinister of Economy,Trade and Industry: Mr. A. AMARI
<R & D Target>Hardware, Software, Application for Super Low-Power Manycore ProcessorsMore than 64 coresNatural air cooling (No fan)
Cool, Compact, Clear, QuietOperational by Solar Panel<Industry, Government, Academia>Hitachi, Fujitsu, NEC, Renesas, Olympus,Toyota, Denso, Mitsubishi, Toshiba, etc<Ripple Effect>Low CO2 (Carbon Dioxide) EmissionsCreation Value Added Products
Consumer Electronics, Automobiles, Servers
Green Computing Systems R&D CenterWaseda University
Supported by METI (Mar. 2011 Completion)
Beside Subway Waseda Station,Near Waseda Univ. Main Campus
6
Hitachi SR16000:Power7 128coreSMP
Fujitsu M9000SPARC VII 256 core SMP
Solar PoweredSmart phones
Cameras
Robots
Cool desktop servers
Industry-government-academia collaboration in R&Dand target practical applications
Multicire Engine ECU, ADAS(Driver Asistance),Self Driving, HV, EV, FCV
Consumer electronicInternet TV/DVD
Solar Powered,Non-fan, cool, quiet servers
Operation/rechargingby solar cells
OSCAR Multicore
Green supercomputers
OSCAR
OSCAR
OSCARMany-core
Chip
7
Waseda University :R&DMany-core system technologies with
ultra-low power consumption
Disaster SurvivalSupercomputer (Earthquakes, tsunami, Fire-spreading)
National Institute of Radiological Sciences
Protect Lives
Protect EnvironmentFor smart life
Camcorders
Intelligent home appliances
Supercomputers and servers
Industry
Capsule inner cameras
Compiler, API
Medical servers
Heavy particle radiation planning, cerebral infarction)
OSCAR Technology
Vector Acc.
To improve effective performance, cost-performance and software productivity and reduce power
OSCAR Parallelizing Compiler
Multigrain Parallelizationcoarse-grain parallelism among loops and subroutines, near fine grain parallelism among statements in addition to loop parallelism
Data LocalizationAutomatic data management fordistributed shared memory, cacheand local memory
Data Transfer OverlappingData transfer overlapping using DataTransfer Controllers (DMAs)
Power ReductionReduction of consumed power bycompiler control DVFS and Powergating with hardware supports.
1
23 45
6 7 8910 1112
1314 15 16
1718 19 2021 22
2324 25 26
2728 29 3031 32
33Data Localization Group
dlg0dlg3dlg1 dlg2
Generation of Coarse Grain TasksMacro-tasks (MTs) Block of Pseudo Assignments (BPA): Basic Block (BB) Repetition Block (RB) : natural loop Subroutine Block (SB): subroutine
Program
BPA
RB
SB
Near fine grain parallelization
Loop level parallelizationNear fine grain of loop bodyCoarse grainparallelizationCoarse grainparallelization
BPARBSB
BPARBSB
BPARBSBBPARBSBBPARBSBBPARBSB
1st. Layer 2nd. Layer 3rd. LayerTotalSystem
9
Earliest Executable Condition Analysis for Coarse Grain Tasks (Macro-tasks)
BPA
BPA
1 BPA
3 BPA2 BPA
4 BPA
5 BPA
6 RB
7 RB15 BPA
8 BPA
9 BPA 10 RB
11 BPA
12 BPA
13 RB
14 RB
END
RB
RB
BPA
RB
Data Dependency
Control flow
Conditional branch
Repetition Block
RB
BPA
Block of PsuedoAssignment Statements
7
11
14
1
2 3
4
5 6
15 7
8
9 10
12
13
Data dependency
Extended control dependencyConditional branch
6
ORAND
Original control flowA Macro Flow GraphA Macro Task Graph
10
Automatic processor assignment in 103.su2cor
• Using 14 processorsCoarse grain parallelization within DO400
11
MTG of Su2cor-LOOPS-DO400
DOALL Sequential LOOP BBSB
Coarse grain parallelism PARA_ALD = 4.3
12
Data-Localization: Loop Aligned Decomposition• Decompose multiple loop (Doall and Seq) into CARs and LRs
considering inter-loop data dependence.– Most data in LR can be passed through LM.– LR: Localizable Region, CAR: Commonly Accessed Region
DO I=69,101DO I=67,68DO I=36,66DO I=34,35DO I=1,33
DO I=1,33
DO I=2,34
DO I=68,100
DO I=67,67
DO I=35,66
DO I=34,34
DO I=68,100DO I=35,67
LR CAR CARLR LR
C RB2(Doseq)DO I=1,100
B(I)=B(I-1)+A(I)+A(I+1)
ENDDO
C RB1(Doall)DO I=1,101A(I)=2*I
ENDDO
RB3(Doall)DO I=2,100
C(I)=B(I)+B(I-1)ENDDO
C
13
Data Localization
MTG MTG after Division
1
2
3 4 56
7
8 9 10
11
12 1314
15
1
23 45
6 7 8910 1112
1314 15 16
1718 19 2021 22
2324 25 26
2728 29 3031 32
33Data Localization Group
dlg0dlg3dlg1 dlg2
3
4
2
5
6 7
8
9
10
11
13
14
15
16
17
18
19
20
21
22
23 24
25
26
27 28
29
30
31
32
PE0 PE1
12 1
A schedule for two processors
14
An Example of Data Localization for Spec95 SwimDO 200 J=1,NDO 200 I=1,M
UNEW(I+1,J) = UOLD(I+1,J)+1 TDTS8*(Z(I+1,J+1)+Z(I+1,J))*(CV(I+1,J+1)+CV(I,J+1)+CV(I,J)2 +CV(I+1,J))-TDTSDX*(H(I+1,J)-H(I,J))
VNEW(I,J+1) = VOLD(I,J+1)-TDTS8*(Z(I+1,J+1)+Z(I,J+1))1 *(CU(I+1,J+1)+CU(I,J+1)+CU(I,J)+CU(I+1,J))2 -TDTSDY*(H(I,J+1)-H(I,J))
PNEW(I,J) = POLD(I,J)-TDTSDX*(CU(I+1,J)-CU(I,J))1 -TDTSDY*(CV(I,J+1)-CV(I,J))
200 CONTINUE
DO 300 J=1,NDO 300 I=1,M
UOLD(I,J) = U(I,J)+ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J))VOLD(I,J) = V(I,J)+ALPHA*(VNEW(I,J)-2.*V(I,J)+VOLD(I,J))POLD(I,J) = P(I,J)+ALPHA*(PNEW(I,J)-2.*P(I,J)+POLD(I,J))
300 CONTINUE
DO 210 J=1,NUNEW(1,J) = UNEW(M+1,J)VNEW(M+1,J+1) = VNEW(1,J+1)PNEW(M+1,J) = PNEW(1,J)
210 CONTINUE
UNVOZ
PNVN UOPO CVCUH
UNPNVN
UNVOPNVN UO
PO
U PV
0 1 2 3 4MBcache size
(a) An example of target loop group for data localization
(b) Image of alignment of arrays on cache accessed by target loops
Cache line conflicts occurs among arrays which share the same location on cache
DO 200 J=1,NDO 200 I=1,M
UNEW(I+1,J) = UOLD(I+1,J)+1 TDTS8*(Z(I+1,J+1)+Z(I+1,J))*(CV(I+1,J+1)+CV(I,J+1)+CV(I,J)2 +CV(I+1,J))-TDTSDX*(H(I+1,J)-H(I,J))
VNEW(I,J+1) = VOLD(I,J+1)-TDTS8*(Z(I+1,J+1)+Z(I,J+1))1 *(CU(I+1,J+1)+CU(I,J+1)+CU(I,J)+CU(I+1,J))2 -TDTSDY*(H(I,J+1)-H(I,J))
PNEW(I,J) = POLD(I,J)-TDTSDX*(CU(I+1,J)-CU(I,J))1 -TDTSDY*(CV(I,J+1)-CV(I,J))
200 CONTINUE
DO 300 J=1,NDO 300 I=1,M
UOLD(I,J) = U(I,J)+ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J))VOLD(I,J) = V(I,J)+ALPHA*(VNEW(I,J)-2.*V(I,J)+VOLD(I,J))POLD(I,J) = P(I,J)+ALPHA*(PNEW(I,J)-2.*P(I,J)+POLD(I,J))
300 CONTINUE
DO 210 J=1,NUNEW(1,J) = UNEW(M+1,J)VNEW(M+1,J+1) = VNEW(1,J+1)PNEW(M+1,J) = PNEW(1,J)
210 CONTINUE
UNVOZ
PNVN UOPO CVCUH
UNPNVN
UNVOPNVN UO
PO
U PV
0 1 2 3 4MBcache size
(a) An example of target loop group for data localization
(b) Image of alignment of arrays on cache accessed by target loops
Cache line conflicts occurs among arrays which share the same location on cache
15
PARAMETER (N1=513, N2=513)
COMMON U(N1,N2), V(N1,N2), P(N1,N2),* UNEW(N1,N2), VNEW(N1,N2),1 PNEW(N1,N2), UOLD(N1,N2),* VOLD(N1,N2), POLD(N1,N2),2 CU(N1,N2), CV(N1,N2),* Z(N1,N2), H(N1,N2)
Data Layout for Removing Line Conflict Misses by Array Dimension Padding
Declaration part of arrays in spec95 swim
PARAMETER (N1=513, N2=544)
COMMON U(N1,N2), V(N1,N2), P(N1,N2),* UNEW(N1,N2), VNEW(N1,N2),1 PNEW(N1,N2), UOLD(N1,N2),* VOLD(N1,N2), POLD(N1,N2),2 CU(N1,N2), CV(N1,N2),* Z(N1,N2), H(N1,N2)
before padding after padding
Box: Access range of DLG0
4MB 4MB
padding
16
Low Power Heterogeneous Multicore Code
GenerationAPI
Analyzer(Available
from Waseda)
Existing sequential compiler
Multicore Program Development Using OSCAR API V2.0Sequential Application
Program in Fortran or C(Consumer Electronics, Automobiles, Medical, Scientific computation, etc.)
Low Power Homogeneous Multicore Code
GenerationAPI
AnalyzerExisting
sequential compiler
Proc0
Thread 0
Code with directives
Waseda OSCARParallelizing Compiler
Coarse grain task parallelization
Data Localization DMAC data transfer Power reduction using
DVFS, Clock/ Power gating
Proc1
Thread 1
Code with directives
Parallelized API F or C program
OSCAR API for Homogeneous and/or Heterogeneous Multicores and manycoresDirectives for thread generation, memory,
data transfer using DMA, power managements
Generation of parallel machine
codes using sequential compilers
Exe
cuta
ble
on v
ario
us m
ultic
ores
OSCAR: Optimally Scheduled Advanced MultiprocessorAPI: Application Program Interface
HomegeneousMulticore s
from Vendor A(SMP servers)
Server Code GenerationOpenMP Compiler
Shred memory servers
HeterogeneousMulticores
from Vendor B
Hitachi, Renesas, NEC, Fujitsu, Toshiba, Denso, Olympus, Mitsubishi, Esol, Cats, Gaio, 3 univ.
Accelerator 1Code
Accelerator 2Code
Hom
ogen
eous
Accelerator Compiler/ User Add “hint” directives
before a loop or a function to specify it is executable by
the accelerator with how many clocks
Het
ero
Manual parallelization / power reduction
Parallel Execution• Start of parallel execution
– #pragma omp parallel sections (C)– !$omp parallel sections (Fortran)
• Specifying critical section– #pragma omp critical (C)– !$omp critical (Fortran)
• Enforcing an order of the memory operations – #pragma omp flush (C)– !$omp flush (Fortran)
• These are from OpenMP.
18Kasahara-Kimura Lab., Waseda University
Kasahara-Kimura Lab., Waseda University 1919
#pragma omp parallel sections{
#pragma omp sectionmain_vpc0();
#pragma omp sectionmain_vpc1();
#pragma omp sectionmain_vpc2();
#pragma omp sectionmain_vpc3();
}
Thread Execution Model
main_vpc0()
main_vpc1()
main_vpc2()
main_vpc3()
Parallel Execution(A thread and a core are bound
one-by-one)
VPC: Virtual Processor Core
Kasahara-Kimura Lab., Waseda University 20
Memory Mapping• Placing variables on an onchip centralized
shared memory (onchipCSM)• #pragma oscar onchipshared (C)• !$oscar onchipshared (Fortran)
• Placing variables on a local data memory (LDM)• #pragma omp threadprivate (C)• !$omp threadprivate (Fortran)• This directive is an extension to OpenMP
• Placing variables on a distributed shared memory (DSM)• #pragma oscar distributedshared (C)• !$oscar distributedshared (Fortran)
Kasahara-Kimura Lab., Waseda University 21
Data Transfer• Specifying data transfer lists
– #pragma oscar dma_transfer (C)– !$oscar dma_transfer (Fortran)– Containing following parameter directives
• Specifying a contiguous data transfer– #pragma oscar dma_contiguous_parameter (C)– !$oscar dma_contiguous_parameter (Fortran)
• Specifying a stride data transfer– #pragma oscar dma_stride_parameter– !$oscar dma_stride_parameter– This can be used for scatter/gather data transfer
• Data transfer synchronization– #pragma oscsar dma_flag_check– !$oscar dma_flag_check
Kasahara-Kimura Lab., Waseda University2222
Power Control• Making a module into specifying frequency and
voltage state– #pragma oscar fvcontrol (C)– !$oscar fvcontrol (Fortran)– state examples
• 100: max frequency• 50: half frequency• 0: clock off• -1: power off
• Getting a frequency and voltage state of a module– #pragma oscar get_fvstatus (C)– !$oscar get_fvstatus (Fortran)
fvcontrol(100)
fvcontrol(-1)
fvcontrol(50)
fvcontrol(100)
8.9times speedup by 12 processorsIntel Xeon X5670 2.93GHz 12 core SMP (Hitachi HA8000)
55 times speedup by 64 processorsIBM Power 7 64 core SMP
(Hitachi SR16000)
National Institute of Radiological Sciences (NIRS)
Cancer Treatment Carbon Ion Radiotherapy
(Previous best was 2.5 times speedup on 16 processors with hand optimization)
Renesas-Hitachi-Waseda Low Power 8 core RP2 Developed in 2007 in METI/NEDO project
Process Technology
90nm, 8-layer, triple-Vth, CMOS
Chip Size 104.8mm2
(10.61mm x 9.88mm)CPU Core Size
6.6mm2
(3.36mm x 1.96mm)Supply Voltage
1.0V–1.4V (internal), 1.8/3.3V (I/O)
Power Domains
17 (8 CPUs, 8 URAMs, common)
Core#2 Core#3
Core#1
Core#4 Core#5
Core#6 Core#7
SNC
0SN
C1
DBSC
DDRPADGCPG
CSM
LB
SC
SHWY
URAMDLRAM
Core#0ILRAM
D$
I$
VSWC
IEEE ISSCC08: Paper No. 4.5, M.ITO, … and H. Kasahara, “An 8640 MIPS SoC with Independent Power-off Control of 8 CPUs and 8 RAMs by an Automatic Parallelizing Compiler”
24
Core #3
I$16K
D$16K
CPU FPU
User RAM 64K
Local memoryI:8K, D:32K
Core #2
I$16K
D$16K
CPU FPU
User RAM 64K
Local memoryI:8K, D:32K
Core #1
I$16K
D$16K
CPU FPU
User RAM 64K
Local memoryI:8K, D:32K
Core #0
I$16K
D$16K
CPU FPU
URAM 64K
Local memoryI:8K, D:32K
CCNBAR
8 Core RP2 Chip Block Diagram
On-chip system bus (SuperHyway)
DDR2LCPG: Local clock pulse generatorPCR: Power Control RegisterCCN/BAR:Cache controller/Barrier RegisterURAM: User RAM (Distributed Shared Memory)
Snoo
p co
ntro
ller
1
Snoo
p co
ntro
ller
0
LCPG0
Cluster #0 Cluster #1
PCR3
PCR2
PCR1
PCR0
LCPG1
PCR7
PCR6
PCR5
PCR4
controlSRAM
controlDMA
control
Core #7
I$16K
D$16K
CPUFPU
User RAM 64K
I:8K, D:32K
Core #6
I$16K
D$16K
CPUFPU
User RAM 64K
I:8K, D:32K
Core #5
I$16K
D$16K
CPUFPU
User RAM 64K
I:8K, D:32K
Core #4
I$16K
D$16K
CPUFPU
URAM 64K
Local memoryI:8K, D:32K
CCNBAR
Barrier Sync. Lines
Engine Control by multicore with Denso
26
Though so far parallel processing of the engine control on multicore has been very difficult, Denso and Waseda succeeded 1.95 times speedup on 2core V850 multicore processor.
1 core 2 cores
Hard real-time automobile engine
control by multicore
Macro Task Fusion for Static Task Scheduling
27
MFG of sample program before maro task fusion
MFG of sample program after macro task fusion
MTG of sample program after macro task fusion
Fuse branches and succeeded tasks
Fusedblock
: Data Dependency: Control Flow: Conditional Branch
Only data dependency
Evaluation Environment :Embedded Multi-core Processor RPX
SH-4A 648MHz * 8 As a first step, we use just two SH-4A cores because target
dual-core processors are currently under design for next-generation automobiles
28
Evaluation of Crankshaft Program with Multi-core Processors
Attain 1.54 times speedup on RPX There are no loops, but only many conditional branches and small basic
blocks and difficult to parallelize this program This result shows possibility of multi-core processor for
engine control programs29
1.00
1.54 0.57
0.37
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.000.200.400.600.801.001.201.401.601.80
1 core 2 core
executiontim
e(us)speedu
pratio
OSCAR Compile Flow for Simulink Applications
30
Simulink model C code
Generate C codeusing Embedded Coder
OSCAR Compiler
(1) Generate MTG→ Parallelism
(2) Generate gantt chart→ Scheduling in a multicore
(3) Generate parallelized C code using the OSCAR API→ Multiplatform execution(Intel, ARM and SH etc)
31
Road Tracking, Image Compression : http://www.mathworks.co.jp/jp/help/vision/examplesBuoy Detection : http://www.mathworks.co.jp/matlabcentral/fileexchange/44706‐buoy‐detection‐using‐simulinkColor Edge Detection : http://www.mathworks.co.jp/matlabcentral/fileexchange/28114‐fast‐edges‐of‐a‐color‐image‐‐actual‐color‐‐not‐converting‐to‐grayscale‐/Vessel Detection : http://www.mathworks.co.jp/matlabcentral/fileexchange/24990‐retinal‐blood‐vessel‐extraction/
Speedups of MATLAB/Simulink Image Processing on Various 4core Multicores
(Intel Xeon, ARM Cortex A15 and Renesas SH4A)
OSCAR Compiler Accelerates Various Multicores’ Performance Several Times Including Intel and IBM
32
0123456789
tom
catv
swim
su2c
or
hydr
o2d
mgr
id
appl
u
turb
3d apsi
fppp
p
wav
e5
swim
mgr
id
appl
u
apsi
SPEC95 SPEC2000
spee
dup
ratio
Intel Ver.10.1OSCAR
2.1 times Speedup
On Intel 4core Multicore, Compared with Intel Compiler
3.3 times Speedup
On IBM P6 595 Server, Compared with IBM Compiler
On Intel Quad-core XeonOn IBM Power Servers, such as Power 4, 5, 6(4.2GHz), 7 and later
Performance of OSCAR Compileron Intel Core i7 Notebook PC
• OSCAR Compiler accelerate Intel Compiler about 2.0 times on average
1.00 1.18
2.70
1.00 1.00
1.70
2.24
4.12
2.91
2.53
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
SPEC95 su2cor SPEC95hydro2d
SPEC95 mgrid SPEC95 turb3d AAC Encoder
Speesup Ra
tio
Intel Compiler Ver.14.0OSCAR
CPU: Intel Core i7 3720QM (Quad‐core)MEM: 32GB DDR3‐SODIMM PC3‐12800OS: Ubuntu 12.04 LTS
0
15
30
45
60
通常の1コア実⾏ 並列化3コア実⾏
DrawImage : FPS
Parallelization of 2D Rendering Engine SKIA on 3 cores of Google NEXUS7
0
15
30
45
60
通常の1コア実⾏ 並列化3コア実⾏
DrawRect :FPS
22.82
43.57
27.16
×1.91 ×1.95
for DrawRect 1.91 speedup
for DrawImage 1.95 speedupOn Nexus7, 3 core parallelization gave us
52.88
34
1 Core 3 cores 1 Core 3 cores
http://www.youtube.com/channel/UCS43lNYEIkC8i_KIgFZYQBQ
110 Times Speedup against the Sequential Processing for GMS Earthquake Wave
Propagation Simulation on Hitachi SR16000(Power7 Based 128 Core Linux SMP)
35
Automatic Parallelization of Still Image Encoding Using JPEG-XR for the Next Generation Cameras and Drinkable Inner Camera
Waseda U. & Olympus 36
TILEPro64
Ds
t0X4)
rt 1
al)
nal)
X4)
1.00 1.96 3.95 7.86
15.82
30.79
55.11
0.00
10.00
20.00
30.00
40.00
50.00
60.00
1 2 4 8 16 32 64
Speed up
コア数
Speed‐ups on TILEPro64 Manycore0.18[s]
1コア10.0[s]
55 times speedup with 64 coresagainst 1 core
Parallel Processing of Face Detection on Manycore, Highend and PC Server
37
OSCAR compiler gives us 11.55 times speedup for 16 cores against 1 core on SR16000 Power7 highend server.
1.00 1.72
3.01
5.74
9.30
1.00 1.93
3.57
6.46
11.55
1.00 1.93
3.67
6.46
10.92
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
1 2 4 8 16
速度
向上
率
コア数
速度向上率tilepro64 gcc
SR16k(Power7 8core*4cpu*4node) xlc
rs440(Intel Xeon 8core*4cpu) icc
Automatic Parallelization of Face Detection
OSCAR Heterogeneous Multicore
38
• DTU– Data Transfer
Unit• LPM
– Local Program Memory
• LDM– Local Data
Memory• DSM
– Distributed Shared Memory
• CSM– Centralized
Shared Memory• FVR
– Frequency/Voltage Control Register
39
An Image of Static Schedule for Heterogeneous Multi-core with Data Transfer Overlapping and Power Control
TIME
33 Times Speedup Using OSCAR Compiler and OSCAR API on RP-X
(Optical Flow with a hand-tuned library)
12.29 3.09
5.4
18.85
26.71
32.65
0
5
10
15
20
25
30
35
1SH 2SH 4SH 8SH 2SH+1FE 4SH+2FE 8SH+4FE
Speedu
ps against a single SH
processor
3.4[fps]
111[fps]
Power Reduction in a real-time execution controlled by OSCAR Compiler and OSCAR API on RP-X
(Optical Flow with a hand-tuned library)
Without Power Reduction With Power Reductionby OSCAR Compiler
Average:1.76[W] Average:0.54[W]
1cycle : 33[ms]→30[fps]
70% of power reduction
Low-Power Optimization with OSCAR API
42
MT1
VC0
MT2
MT4MT3
Sleep
VC1
Scheduled Resultby OSCAR Compiler void
main_VC0() {
MT1
voidmain_VC1() {
MT2
#pragma oscar fvcontrol ¥(1,(OSCAR_CPU(),100))
#pragma oscar fvcontrol ¥((OSCAR_CPU(),0))
Sleep
MT4MT3
} }
Generate Code Image by OSCAR Compiler
Automatic Power Reduction for MPEG2 Decode on Android Multicore
ODROID X2 ARM Cortex-A94cores
• On 3 cores, Automatic Power Reduction control successfully reduced power to 1/7 against without Power Reduction control.
• 3 cores with the compiler power reduction control reduced power to 1/3 against ordinary 1 core execution.
0.97
1.88
2.79
0.63 0.46 0.37
0.00
0.50
1.00
1.50
2.00
2.50
3.00
1 2 3
Power Con
sumption[W
]
No. of Processor Cores
電力制御なし 電力制御ありWithout Power Reduction
With Power Reduction
2/3(‐35.0%)
1/4(‐75.5%)
1/7(‐86.7%)
1/3(‐61.9%)
http://www.youtube.com/channel/UCS43lNYEIkC8i_KIgFZYQBQ
43
Power Reduction on Intel Haswell for Real-time Optical Flow
Power was reduced to 1/4 (9.6W) by the compiler power optimization on the same 3 cores(41.6W).Power with 3 core was reduced to 1/3 (9.6W) against 1 core (29.3W) .
Power was reduced to 1/4 by compiler on 3 cores
1/3
For HD 720p(1280x720) moving pictures15fps (Deadline66.6[ms/frame])
29.29
36.5941.58
24.17
12.21 9.60
05101520253035404550
1PE 2PE 3PEaverage po
wer con
sumption[W
]
number of PE
without power control with power control
Intel CPU Core i7 4770K
44
45
Performance of OSCAR Compiler Software Coherence Control
1.00
1.89
3.54
1.00
1.62
2.54
1.00
1.85
3.34
1.02
1.92
3.59
5.90
1.01 1.61
2.45
3.36
1.02
2.10
3.90
6.63
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
1 2 4 8 1 2 4 8 1 2 4 8
AAC Encoder MPEG2 Decoder MPEG2 Encoder
Seed
Up
agai
nst s
eque
ntia
l Pro
cess
ing
No. of processor cores
SMPNon-Coherent Cache
Faster or Equal Processing Performance up to 4cores with hardware coherent mechanism on RP2.
Software Coherence gives us correct execution without hardware coherence mechanism on 8 cores.
46
OSCAR TechnologyStarted up on Feb.28, 2013: Licensing the all patents and OSCAR compiler from Waseda Univ.
Copyright 2015 47
CEO: Dr. T. Ono (Ex- CEO of First Section-listed Company, VP of National Univ., Invited Prof. of Waseda U. )
Executives: Mr. T. Ito (Visiting Prof. Tokyo Agricult. and Eng. U.)Prof. K. Shirai (Ex-President of Waseda U
Chairman of Japanese Open Univ.)CTO: Mr. M. Takamura (Ex-Fellow Fujitsu Lab.,
Fujitsu VPP500, 5000 & NWT Development Leader)Mr. K. Ashida(Ex-VP Sumitomo Trading,
Ashida Consult. CEO, A leader of Business WorldAuditor: Dr. S. Matsuda(Prof. Emeritus Waseda U.
Ex-President Ventures and Entrepreneurs Society)Advisors: Dr. T. Sato(Patent Attorney, Ex-President of
Patent Attorneys Assoc., Gov. IP Committee)Ms. K. Ishiguro(Lawyer, Supreme Court Trainer)Mr. A. Fukuda (Leader of Alumni Assoc.)Prof. K. Kimura (Waseda Univ.)Prof. H. Kasahara(Waseda Univ.)
Fujitsu VPP5000
Target:
Solar Powered with
compiler power reduction.
Fully automatic
parallelization and
vectorization including
local memory management
and data transfer.
OSCAR Vector Multicore and Compiler for Embedded to Severs with OSCAR Technology
Centralized Shared Memory
Compiler Co-designed Interconnection Network
Compiler co-designed Connection Network
On-chip Shared Memory
Multicore Chip
VectorData
TransferUnit
CPU
Local MemoryDistributed Shared Memory
Power Control Unit
Core
×4chips
Copyright 2008 FUJITSU LIMITED
49
Fujitsu VPP500/NWT: PE Unit CabinetCabinet (open)
Summary Waseda University Green Computing Systems R&D Center supported by METI has
been researching on low‐power high performance Green Multicore hardware, software and application with government and industry including Hitachi, Fujitsu, NEC, Renesas, Denso, Toyota, Olympus and OSCAR Technology.
OSCAR Automatic Parallelizing and Power Reducing Compiler has succeeded speedup and/or power reduction of scientific applications including “Earthquake Wave Propagation”, medical applications including “Cancer Treatment Using Carbon Ion”, and “Drinkable Inner Camera”, industry application including “Automobile Engine Control”, “Smartphone”, and “Wireless communication Base Band Processing” on various multicores from different vendors including Intel, ARM, IBM, AMD, Qualcomm, Freescale, Renesas and Fujitsu.
In automatic parallelization, 110 times speedup for “Earthquake Wave Propagation Simulation” on 128 cores of IBM Power 7 against 1 core, 55 times speedup for “Carbon Ion Radiotherapy Cancer Treatment” on 64cores IBM Power7, 1.95 times for “Automobile Engine Control” on Renesas 2 cores using SH4A or V850, 55 times for “JPEG‐XR Encoding for Capsule Inner Cameras” on Tilera 64 cores Tile64 manycore.
The compiler will be available on market from OSCAR Technology. In automatic power reduction, consumed powers for real-time multi-media
applications like Human face detection, H.264, mpeg2 and optical flow were reduced to 1/2 or 1/3 using 3 cores of ARM Cortex A9 and Intel Haswell and 1/4 using Renesas SH4A 8 cores against ordinary single core execution.
50