物網平台 - github pages · python), intel xdk, and wolfram.!22 execution model!23 execution...

物聯聯網平台

‣Chi-Sheng Shih

!2

Outline

‣ IoT Platforms

‣ Core of IoT Platforms

‣ Micro-Controller based Platforms

‣ Micro-Processor based Platforms

‣ Execution Model

!3

IoT Platforms

!4

Core of platforms - Microprocessor

‣ Microprocessor

‣ is an IC which has only CPU inside.

‣ does not have RAM, ROM, and other peripheral on the chip.

‣ A system designer needs to add peripherals externally to make them functional.

‣ Examples: Intel i3, i5, i7 processors, ARM700/710/720, ARM810, ARM920T, ARM940T, and Cortex-M0 processors.

!5

Core of platforms - Micro-controller

‣ Microcontroller

‣ has a processing unit, in addition with a fixed amount of RAM, ROM, and other peripherals all embedded on a single chip.

‣ is designed to perform specific tasks where the relationship between input and output are well defined.

‣ The amount of resources required for the applications are defined and can be packed together in a single chip.

‣ Examples:

‣ Micro-controllers for keyboard, mouse, washing machines, digicam, etc.

‣ Micro-controllers based on

‣ Cortex-M0: Cypress PSoC4, Infineon XMC 1000, STMicroelectronics STM32F0, NXP LPC 1100

‣ ARM7: ATmel AT91SAM7, NXP LPC2100/LPC2200,

‣ ARM9: ATmel AT91SAM9, NXP LPC 2900

!6

Core of the Platforms - SoC

‣ System-on-Chip (SoC)

‣ is a total solution for specific applications,

‣ has processing units (including CPU, GPU, FPGA or DSP), application specific peripherals (including modem, GPS, etc)

‣ There is no clear line between SoC and MCU.

‣ MCUs are often used for applications required limited computation power to reduce energy consumption. Controller for CNC machines are examples.

‣ SoC are often used for applications required full computers to accomplish. SmartPhones are examples.

!7

Micro-Controller Based Platform

!8

Arduino-based Platform

TinyDuino PanStamps Arduino Uno

RFduino XinoRF Arduino Yun

!9

Arduino‣ Arduino Uno: ‣ MCU: ATmega328P ‣ Flash: 32KB ‣ SRAM: 2KB ‣ EEPROM: 1KB

‣ Clock: 16Mhz ‣ Digital I/O: 14 ‣ Analog: 6

‣ Arduino Mega:

‣ MCU: ATmega2560

‣ Flash: 256KB

‣ SRAM: 8KB

‣ EEPROM: 4KB

‣ Clock: 16Mhz

‣ Digital I/O: 54

‣ Analog: 16

!10

Arduino Family

LiLyPad

!11

Performance and Power Consumption‣ The device achieves a throughput of 1

MIPS per MHz.

‣ ATmega 328 can operate at up to 20Mhz.

‣ Can we make it to run for six month on battery?

‣ Arduino Uno uses 45mA@5V and can run for less than one day on 9V battery (1,200mAh).

‣ To run for 6 months, the board can only drain 0.05mA@5V. On Arduino Pro Mini:

‣ Unmodified: 9.9mA@Active, 3.14mA@Sleep

‣ No Power LED: 16.9mA@Active, 0.0232mA@Sleep

‣ No Power LED, No power regulator: 12.7mA, 0.0058mA@Sleephttp://www.home-automation-community.com/arduino-low-power-how-to-run-atmega328p-for-a-year-on-coin-cell-battery/

!12

Curie‣ Announced on CSE 2015 to be used for wearable devices.

‣ Using Quark SE core

‣ Pentium x86 ISA compatible without x87 floating point unit

‣ 32-bit Processor with 32-bit Data Bus.

‣ 32 MHz clock frequency

‣ 384 KB of on-die flash.

‣ 80 KB of on-die SRAM

‣ Sensors: 6-axis accelerometer,

‣ Communication: BLE (Bluetooth low energy)

‣ Peripheral: USB, UART, I2C, SPA, GPIO, RTC, ADC

!14

MediaTek LinkIt One

‣ ARM Cortex-M4 with floating point architecture

‣ Comprehensive peripheral interface support, with a common Hardware Abstraction Layer API

‣ FreeRTOS with additional middleware components supporting

!15

Head to head

!16

General Purpose Processor Based Platform

!17

BeagleBone Black‣ BeagleBone Black is a low-cost, community-

supported development platform for developers and hobbyists.

‣ Boot Linux in under 10 seconds and get started on development in less than 5 minutes with just a single USB cable.

‣ Processor: AM335x 1GHz ARM® Cortex-A8

‣ 512MB DDR3 RAM

‣ 4GB 8-bit eMMC on-board flash storage

‣ 3D graphics accelerator

‣ NEON floating-point accelerator

‣ 2x PRU 32-bit micro-controllers

!18

Raspberry Pi 3‣ Hardware Spec:

‣ A 1.2GHz 64-bit quad-core ARMv8 CPU ‣ 802.11n Wireless LAN

‣ Bluetooth 4.1 ‣ Bluetooth Low Energy (BLE)

‣ 1GB RAM ‣ 4 USB ports ‣ 40 GPIO pins ‣ Full HDMI port ‣ Ethernet port

‣ Combined 3.5mm audio jack and composite video ‣ Camera interface (CSI)

‣ Display interface (DSI) ‣ Micro SD card slot (now push-pull rather than push-push)

‣ VideoCore IV 3D graphics core

!19

RASPBERRY PI ZERO

‣ Hardware Specification

‣ BCM2835, 1Ghz, Single-core CPU

‣ By Broadcomm

‣ Using ARM1176 (ARMv6)

‣ 512MB RAM

‣ Mini HDMI and USB On-The-Go ports

‣ Micro USB power

‣ HAT-compatible 40-pin header

‣ Composite video and reset headers

!20

Intel Galileo‣ Galileo (released in 2013)

‣ Arduino-certified development boards based on Intel X86 architecture

‣ Gen1: Intel Quark X1000 32-bit 400Mhz SoC

‣ Gen2: Intel Quark X1000 32-bit 400Mhz SoC + Ethernet PoE+12-bit PWM+USB UART adapter

‣ Single Core, single thread, Pentium instruction set

‣ Industry standard I/O interface: ACPI, PCI express, Micro SD, USB 2.0

!21

Edison‣ Release in 2014

‣ Hardware Specification:

‣ Intel Atom Tangier: 2 x Atom core at 500Mhz

‣ 1 x Intel Quark at 100Mhz (for RTOS)

‣ 1GB RAM

‣ 4GB Flash

‣ Onboard WiFi

‣ Bluetooth 4.0

‣ USB

‣ Software:

‣ Yocto Linux supporting Arduino IDE, Eclipse (C, C++, Python), Intel XDK, and Wolfram.

!22

Execution Model

!23

Execution Model‣ On general purpose computers,

‣ We assume that computers mostly connect to power sources or have time to shut down the computers.

‣ Data and binary code are loaded into memory to speed up the performance.

‣ On embedded systems and MCU-based systems,

‣ many of them are NOT connected to power sources and can be shut down or reboot without notice at any time.

‣ Only data are loaded to memory; binary code and state information are stored on flash or storage.

‣ Access latency on DRAM and SSD differs for 100x to 1000x.

‣ The difference greatly prolong the execution time and enlarge power consumption.

!24

Example‣ The SHIMMER wearable sensor platform:

‣ TI MSP430 mciroconbroller

‣ 802.15.4 by Chipcon CC2420

‣ Storage: MicroSD 2GB

‣ Battery: 250mAh rechargeable batter

‣ Triaxial MEMS accelerometer

‣ Gyroscope, ECG, EMG, and other sensors

‣ On such platforms,

‣ Microcontroller is not the major energy consumer.

‣ Sensor such as gyroscope and radio are.

‣ What if your code run 10x or 100x slower?Konrad Lorincz et al. “Mercury: a wearable sensor network platform for high-fidelity motion analysis”. In: Sensys ’09 Proceedings of the 7th ACM Conference on Embedded Net- worked Sensor Systems (Nov. 2009), pp. 183–196.

!25

Lifetime for Wearable Devices‣ With the node continuously sampling and logging

accelerometer and gyroscope data, maintaining time sync, but performing no data transfers to the base station, the achievable lifetime with a 250 mAh battery is 12.5 h.

‣ Adding activity filter on CPU can reduce the amount of transmission and prolong the lifetime up to 17h for 50% activity, to 89h (4x) when downloading features only.

!26

Sensor node VMs - Motivation

‣ Target device class: • MSP430, Cortex M0, AVR

• Up to a few hundred KB of flash

• Up to 10s KB RAM

‣ Advantages of running a VM: • Using higher level languages instead of C reduces

development cost.

• IoT applications are expected to consist of many different hardware platforms. Platform independence reduces deployment cost.

• VMs can offer a safe execution environment.

!27

Sensor node VMs - Related Works

‣ Application specific • Maté: first sensor node VM

• VM*: Java based framework to build VMs tailored to specific applications

‣ General purpose • Java: NanoVM, TakaTuka, Darjeeling, SwissQM, leJOS

• Python: pyMite, Micropython

• LISP: SensorScheme

• Others: DVM, TinyVM

‣ They differ in their system requirements and the features they do provide. Sensor node JVMs all sacrifice some features, most commonly reflection, and support for floating point operations.

!28

Sensor node VM performance

‣ Not many VMs publish detailed performance figures.

‣ For the ones that do, the exact performance depends on the code being executed,

‣ but the general picture is clear: all are 1 to 2 orders of magnitude slower than native code.

VM Source Platform Instructionset PerformancevsnativeC

Darjeeling DelftUniversityofTechnology ATmega128 JVM 30x-113xslower

TakaTuka UniversityofFreiburg Mica2(AVR)JCreate(MSP) JVM 230xslower

TinyVM YonseiUniversity,Seoul ATmega128 nesC 14x-72xslower

DVM UCLA ATmega128L SOS 108xslower

SensorScheme UniversityofTwente MSP430 LISP 4x-105xslowerN. Brouwers et al, “Darjeeling, a feature-rich VM for the resource poor,” Sensys '09 F. Aslam et al, "Introducing TakaTuka: a Java virtualmachine for motes", Sensys '08 K. Hong et al, “TinyVM: an energy-efficient execution infrastructure for sensor networks,” Technical Report CS-TR-9003 Department of Computer Science Yonsei University, Seoul, Korea R. Balani et al, “Multi-level Software Reconfiguration for Sensor Networks,” EMSOFT ’06 L. Evers, “Concise and Flexible Programming of Wireless Sensor Networks”, Phd thesis, University of Twente, 2010

!29

Does the slowdown matter?

‣ At 10-230x slowdown: • The energy consumption of

running application code increases accordingly.

(one of the main reasons to use tiny devices is their power consumption)

• Time critical tasks such as periodic sensing and data processing may not finish in time.

‣ For the example on the right, the ‘compute features’ and ‘activity filter’ become a large/the largest component when multiplied by 10 to 100. Doing the FFT on the node will most likely be impossible.

Lorenzetal.,“Mercury:awearablesensornetworkplatformforhigh-fidelitymotionanalysis”,Sensys'09

94600

71800

1292000

!30

Improving performance by compiling to native code‣ Compiling bytecode to native code has been a common

technique on desktops, especially since the JVM was released.

‣ Three types • Offline (before sending code to the device): better code can

be generated, but at the cost of losing platform independence.

• Just-in-time (at run-time): common on desktops, but impractical on a sensor node since many can only execute code from flash memory.

• Therefore we use Ahead-of-time compilation: compile the whole application to native code, on the device, at load time.

!31

Challenges

• Restricted flash memory means the size of the compiler shouldn’t be much larger than the interpreter

• Restricted RAM means we cannot store complex data structures to analyze the byte code

!32

AOT compiler for JVM bytecode‣ We build on earlier work done by Joshua Ellul: • When the JVM bytecode is loaded, replace each

instruction with a native equivalent,

• using the native stack as JVM operand stack,

• then do some simple but effective peephole optimizations.

J.Ellul,“Run-timecompilationtechniquesforwirelesssensornetworks”,Phdthesis,UniversityofSouthampton,2012.

1)InitialtranslationOperation JVM Nativeadding two shorts IADD POP R10

POP R11 POP R12 POP R13 ADD R10, R12 ADC R11, R13 PUSH R11 PUSH R10

duplicate the top stack element

IDUP POP R10 POP R11 PUSH R11 PUSH R10 PUSH R11 PUSH R10

2)PeepholeoptimisationBefore Cost After CostPUSH R10 POP R10

4 bytes, 4 cycles

0 bytes, 0 cycles

PUSH R10 POP R12

4 bytes, 4 cycles

MOV R10, R12 2 bytes, 1 cycle

MOV R10, R12 MOV R11, R13

4 bytes, 2 cycles

MOVW R10, R12 2 bytes 1 cycle,

!33

AOT performance

‣ This approach significantly improves performance: 4.27x slower than native C

‣ But the trade-off is a significant increase in code size: 3.15x larger than native C

• Since flash memory is scarce on a sensor node, this reduces the size of the programs we can load on a device.

‣ Causes of JVM slowdown • Interpreter overhead

• Stack overhead

- Push / pop overhead

- Load / store overhead

• Instruction set mismatch

All experiments done• using the Avrora cycle-accurate simulator,• simulating an ATmega 128 CPU,• for a set of 7 different benchmarks: bubble

sort, heap sort, binary search, FFT, MD5, RC5, XXTEA

Eliminated by Ellul’s AOT

Reduced by our optimizations

{Performance Code size

C 1x 1x

JVM 10x - 100x slower 0.5x

AOT 4.27x 3.15x

!34

Overhead after basic AOT‣ The remaining overhead can be measured by looking at the type of instructions executed in

the AOT compiled version, compared to the native C version.

‣ We group them into three categories:

• PUSH / POP: Stack overhead.

- Each instruction pops its operands from the stack, and pushes its result. Only a limited number can be eliminated by the peephole optimizer.

• LOAD / STORE: Stack overhead.

- The same variables are often used multiple times in succession, but since each instruction consumes it’s operands, we repeatedly have to load and store the same values.

• Others: Due to instruction set mismatch. - This includes many cases that can be expressed more efficiently in native AVR, such as

looping over arrays.

- In our optimizations we will only target one case: bit shifts by a constant number of bits.

- JVM shifts take the number of bits to shift by as an operand.

- Flexible, but inefficient if the number is fixed.

!35

Overhead after basic AOT

cyclestotal: 48

push/pop: 17load/store: 16

instr. set: 9total

overhead:42

native cost: 6

Example Java code: do {A>>>=1;} while(A>B);

!36

Overview of Proposed optimizations‣ Combined, they reduce ~80% of

remaining overhead

1. Push / pop overhead:

a) improved peephole optimizer

b) stack caching

2. Load / store overhead:

a) popped value caching b) mark loops

3. Instruction set overhead:

constant shift optimization

‣ However, they do increase the code size.

48 -> 8 cycles (only 2 more than native C)

Before Aftertotal: 48 8

push/pop: 17 0load/store: 16 2

instr. set: 9 0total overhead: 42 2

native cost: 6 6

do {A>>>=1;} while(A>B);

!37

Experimental setup

‣ AOT compiler built on top of Darjeeling VM • JVM for resource-constrained devices

• Modifies JVM bytecode to make it more suitable for sensor nodes

‣ Avrora cycle-accurate simulator, extended by adding traces to monitor the compilation process and runtime performance

• trace AOT compilation by writing debug output to specific memory location monitored by Avrora

• keep track of cycles spent in every memory location

‣ Simulating an ATmega128 CPU • 4KB ram, 128KB flash, 32 8-bit registers

‣ Set of 7 benchmark with different characteristics • bubble sort, heap sort, binary search, FFT,

MD5, RC5, XXTEA

‣ Simulation output

‣ darjeeling.S: disassembly of native C code

‣ profilerdata.xml: cycles spent per flash address

‣ rtcdata.xml: trace of AOT compilation showing how each instruction is translated

‣ stdoutlog.txt: to monitor benchmark success and timer output to count total cycles spent in native and AOT compiled code

‣ jlib_bm.debug: modified JVM bytecode after Darjeeling's transformations

‣ Combined by F# scripts to form a detailed performance report, split by opcode

!38

Results: Performance overhead per instruction type

-50

0

50

100

150

200

250

300

350

simple peephole

impr. peephole

stack caching

pop. val. caching

mark loops

const shift

Ove

rhea

d (%

of n

ative

C ru

n tim

e)push/pop

mov(w)load/store

othertotal

(average of 7 benchmarks)

!39

Results per benchmark

0

50

100

150

200

250

300

350

400

450

500

simple peephole

improved peephole

stack caching

popped value caching

mark loops const shift

Ove

rhea

d (%

of o

ptim

ised

nat

ive

C ru

n tim

e)

bubble sortheap sort

binary searchfft

xxteamd5

rc5

!40

Overhead per benchmark for 1-7 pinned registers

0

20

40

60

80

100

120

140

160

180

1 2 3 4 5 6 7

Ove

rhea

d (%

of o

ptim

ised

nat

ive

C ru

n tim

e)

bubble sortheap sort

binary searchfft

xxteamd5

rc5

!41

Xxtea overhead per instruction type for 1-7 pinned registers

0

10

20

30

40

50

60

70

1 2 3 4 5 6 7

Ove

rhea

d (%

of o

ptim

ised

nat

ive

C ru

n tim

e)

push/popmove

load/storeothertotal

‣ xxtea has lower average stack depth. More pinned registers can reduce load/store overhead but increase the cost of spilling push/pop stack values.

!42

Results: Code size overhead per instruction type

-50

0

50

100

150

200

250

simple peephole

impr. peephole

stack caching

pop. val. caching

mark loops

const shift

Incr

ease

in c

ode

size

as

a %

of n

ativ

e C

siz

e push/popmov(w)

load/storeothertotal

(average of 7 benchmarks)

!43

Conclusion‣ From a code size perspective, the interpreter

can’t be beaten, but it suffers from an often unacceptable performance penalty.

‣ Previous work on Ahead-of-Time compilation to native code improves performance, but at the expense of a large increase in code size, reducing the amount of code we can load in the limited memory available on a sensor node.

‣ Our optimizations further improve the performance of AOT compiled bytecode, and significantly reduce the code size overhead, leading to code that is only

• 68% slower than native C, and • 88% larger than native C.

Performance Code size

C 1x 1x

JVM 10x - 100x slower 0.5x

AOT 4.27x 3.15x

Optimised AOT 1.68x 1.88x

!44

Questions?https://github.com/wukong-m2m/wukong-darjeeling/tree/aot-compiler

!45

Causes of JVM slowdown

‣ Interpreter loop overhead

‣ Stack overhead

‣ Instruction set mismatch

1. Fetch

• Read instruction from memory

• from Flash, not RAM!

2. Decode

• Need to find the right label in a switch statement with about 200 cases

• probably uses a jump table

3. Execute

• Load operands from stack into registers

• Do the operation

• Store result back on the operand stack

4. Loop

• Update pc

• Should we terminate?

For each JVM instruction:

!46


RAM CPUregs

LD r1, &a

LD r2, &b

ADD r1, r2

ST &a, r1

a

b

a

NativeAVRa+=b;

Assumingaandbaren’tinregistersalready,andaneedstobestoredbacktomemory.


‣ Stack overhead


RAM CPUregs

ILOAD_0

ILOAD_1

IADD

ISTORE_0

a

push

b

push

pop

poppush

pop

a

!47



‣ Stack overhead


JVMa+=b;

Pushesandpopsaren’ttotherealstack,buttotheJVMstack:pusha->operand_stack[sp++]=a

!48


• JVM is very simple, but some things can’t be expressed efficiently. For example, lack of pointers slows down array processing:

• AVR: LDI Rd, X+

• Native code would use the AVR’s “auto increment load” instruction: loads byte at the location pointed to by register X into Rd, and increments X by 1. 2 cycles. 

• JVM: ALOAD, ILOAD, IALOAD, IINC

• Stack overhead: ALOAD and ILOAD

• IALOAD needs to calculate the address for every access


‣ Stack overhead


!49

Overview of Proposed optimizations

‣ Combined, they reduce ~80% of remaining overhead

1. Push / pop overhead:

a) improved peephole optimizer b) stack caching

2. Load / store overhead:

a) popped value caching b) mark loops

3. Instruction set overhead:

constant shift optimization

Optimisation Overhead reductionas % native C

Baseline 327%

improvedpeephole -86%(243%)

stackcaching -61%(182%)

poppedvaluecaching -45%(137%)

markloops -40%(97%)

constantshik -29%(68%)

!50

1a: improved peephole optimizer

ImprovedpeepholeoptimisationBefore Cost After CostPUSH Rx POP Rx

4 bytes, 4 cycles

0 bytes, 0 cycles

PUSH Rsrc POP Rdest

4 bytes, 4 cycles

MOV Rdest, Rsrc

2 bytes, 1 cycle

‣ Earlier optimizer only replaces (blocks of) consecutive push/pop pairs

‣ More general: a push/pop pair can be replaced, iff: • the target register for the pop is not used between the

push and pop:

!51

1a: improved peephole optimizer

48 -> 41 cycles

!52

1b: stack caching‣ M. A. Ertl, “Stack caching for

interpreters,” presented PLDI ’95. Originally aimed at Forth interpreters

• Usually the last pushed value will be popped again soon.

• Keep the top of the stack in registers, so we can avoid memory accesses.

• Only push to the real stack if we run out of registers.

• Keep a ‘cache state’ which tells how much of the stack is in registers.

• For interpreters: can’t make stack state too complicated, or the overhead will be bigger than the gains, but for AOT it is only used at load time to generate better code

Stack slot 4: @SP+6

Stack slot 3: @SP+4

Stack slot 2: @SP+2

Stack slot 1: @SP

ADD pop rA pop rB add rA, rB push rA

Stack slot 4: @SP

Stack slot 3: cpu R1



ADD add rA, rB

cachestate := 2

cache state=3

In memory stack Stack caching

JVMa+=b;

!53

RAM CPUregs

ILOAD_0

ILOAD_1

IADD

ISTORE_0

load a

load b

push

pop

load a

1b: stack cachingJVMa+=b;

a : cpu R1

b : cpu R1

a+b : cpu R1

Stack caching

a : @SP

(assuming just 1 register available for stack caching)

b : @SP

a : @SP

a : @SP+2

a+b : @SP

RAM CPUregs

ILOAD_0

ILOAD_1

IADD

ISTORE_0

load a

push

load b

push

pop

poppush

pop

store a

In memory stack

Before After

!54

1b: stack caching

‣ AOT compiler implementation: • add a ‘cache manager’ component

- uint8_t sc_getfreereg();

- uint8_t sc_pop();

- void sc_push(uint8_t reg);

• cache manager will use registers when possible, or emit a real push or pop if necessary

• code generation is now split between the old instruction generators, and the cache manager

• note that the cache manager is only needed at load time, so we can spend more time manipulating it

!55

1b: stack caching

41 -> 31 cycles*. : register is used by the current instruction IntX : register is on the operand stack, at position X (higher = deeper)

!56

2a: popped value caching‣ Stack caching removes most push / pop overhead, but not

load / store overhead. Since each operand is popped off the stack, JVM code often repeatedly loads the same variables.

‣ Some loads may be unnecessary if the value is already in a register.

‣ We add a ‘value tag’ to the cache state of each register to remember which value is in a register, after it has been popped from the stack • the ‘value tag’ indicates what value is currently held

there. for example: “local short 0”, or “constant int 42”

• we can eliminate loads if the required value is already in a register

• need to clear the cache on every branch target, (non-taken branches are ok)

- the instruction could be reached from two places, so we can’t assume anything about the contents of the registers

RAM CPUregs

ILOAD_0

ILOAD_1

IADD

ISTORE_0

load a

load b

store a

Unnecessary if a is already in a register.

After this fragment, a and b are no longer on the stack but both are still in registers.

a : R1

b : R2

a : R1

a+b : R1

!57

2a: popped value caching

31 -> 27 cycles

!58

2b: mark loops

‣ Popped value caching helps, but needs to reset its cache on every branch target.

• Inner loops are often the most performance critical.

• Each iteration will often use the same variables, but will have to reload them every time.

‣ We add an extra instruction to the VM: MARKLOOP, placed at the beginning and end of every inner loop.

• MARKLOOP contains a list of variables used in the loop.

• The VM can decide to pin a number of variables to registers, for the duration of the loop.

• A loop prologue and epilogue are added to load and store the variables, if they are live at the loop start or end resp.

• Loads and stores now access those registers, instead of main memory, but they are no longer available for normal stack caching.

1: BRTARGET(1)

2: … 3: … 4: … 5: … 6: … 7: GOTO 1

Value tags get cleared at each branch target

Need to w

armup cache

for each iteration

!59

2b: mark loops

setup: 12 cycles loop: 27 -> 17 cycles

{inner loop

!60

3: constant shifts‣ Most transformations from JVM to more

optimal native code are often either too complex, or too specific and only help very specific cases

‣ Constant shifts however, are very common: 6 out of our 7 benchmarks.

• They appear both as real constant shifts, and as multiplication/division by powers of two.

• They are also easy to recognize and transform:

- We replace all constant loads directly followed by a bit shift instruction with a special case.

Java: a>>>1

JVM: SLOAD_0 SCONST_1 SUSHR

Unnecessary constant loadImplemented as a loop, which is inefficient if the number to shift by is already known.

!61

3: constant shifts‣ Transformations from JVM to more optimal native code are often either too hard or

too specific

‣ Constant shifts however, are very common: 6 out of our 7 benchmarks, and easy to recognise and transform

‣ We replace all constant loads directly followed by a bit shift with a special case, skipping normal code generator.

‣ avr-gcc doesn’t optimize all cases

17 -> 8 cycles (only 2 more than native C)

{inner loop

!62

Mark loops trade off

‣ This optimization has a tradeoff: registers used to pin variables can’t be used for stack caching.

• More pinned registers: cheap access to more pinned variables

• Less free registers: stack cache may have to spill to memory more often

!63

Summary

‣ Large number of IoT platforms are available for different needs.

‣ There is no single solution for all IoT applications.

‣ Choosing the right platforms is critical.

‣ Knowing the platforms is the first step on designing IoT systems.

‣ How to program IoT applications on same types of platforms remains open.

物網平台 - github pages · python), intel xdk, and wolfram.!22 execution model!23 execution...

Documents