物網平台 - github pages · python), intel xdk, and wolfram.!22 execution model!23 execution...

63
物網平台 Chi-Sheng Shih

Upload: others

Post on 11-Feb-2020

28 views

Category:

Documents


0 download

TRANSCRIPT

  • 物聯聯網平台

    ‣Chi-Sheng Shih

  • !2

    Outline

    ‣ IoT Platforms

    ‣ Core of IoT Platforms

    ‣ Micro-Controller based Platforms

    ‣ Micro-Processor based Platforms

    ‣ Execution Model

  • !3

    IoT Platforms

  • !4

    Core of platforms - Microprocessor

    ‣ Microprocessor

    ‣ is an IC which has only CPU inside.

    ‣ does not have RAM, ROM, and other peripheral on the chip.

    ‣ A system designer needs to add peripherals externally to make them functional.

    ‣ Examples: Intel i3, i5, i7 processors, ARM700/710/720, ARM810, ARM920T, ARM940T, and Cortex-M0 processors.

  • !5

    Core of platforms - Micro-controller

    ‣ Microcontroller

    ‣ has a processing unit, in addition with a fixed amount of RAM, ROM, and other peripherals all embedded on a single chip.

    ‣ is designed to perform specific tasks where the relationship between input and output are well defined.

    ‣ The amount of resources required for the applications are defined and can be packed together in a single chip.

    ‣ Examples:

    ‣ Micro-controllers for keyboard, mouse, washing machines, digicam, etc.

    ‣ Micro-controllers based on

    ‣ Cortex-M0: Cypress PSoC4, Infineon XMC 1000, STMicroelectronics STM32F0, NXP LPC 1100

    ‣ ARM7: ATmel AT91SAM7, NXP LPC2100/LPC2200,

    ‣ ARM9: ATmel AT91SAM9, NXP LPC 2900

  • !6

    Core of the Platforms - SoC

    ‣ System-on-Chip (SoC)

    ‣ is a total solution for specific applications,

    ‣ has processing units (including CPU, GPU, FPGA or DSP), application specific peripherals (including modem, GPS, etc)

    ‣ There is no clear line between SoC and MCU.

    ‣ MCUs are often used for applications required limited computation power to reduce energy consumption. Controller for CNC machines are examples.

    ‣ SoC are often used for applications required full computers to accomplish. SmartPhones are examples.

  • !7

    Micro-Controller Based Platform

  • !8

    Arduino-based Platform

    TinyDuino PanStamps Arduino Uno

    RFduino XinoRF Arduino Yun

  • !9

    Arduino‣ Arduino Uno: ‣ MCU: ATmega328P ‣ Flash: 32KB ‣ SRAM: 2KB ‣ EEPROM: 1KB

    ‣ Clock: 16Mhz ‣ Digital I/O: 14 ‣ Analog: 6

    ‣ Arduino Mega:

    ‣ MCU: ATmega2560

    ‣ Flash: 256KB

    ‣ SRAM: 8KB

    ‣ EEPROM: 4KB

    ‣ Clock: 16Mhz

    ‣ Digital I/O: 54

    ‣ Analog: 16

  • !10

    Arduino Family

    LiLyPad

  • !11

    Performance and Power Consumption‣ The device achieves a throughput of 1

    MIPS per MHz.

    ‣ ATmega 328 can operate at up to 20Mhz.

    ‣ Can we make it to run for six month on battery?

    ‣ Arduino Uno uses 45mA@5V and can run for less than one day on 9V battery (1,200mAh).

    ‣ To run for 6 months, the board can only drain 0.05mA@5V. On Arduino Pro Mini:

    ‣ Unmodified: 9.9mA@Active, 3.14mA@Sleep

    ‣ No Power LED: 16.9mA@Active, 0.0232mA@Sleep

    ‣ No Power LED, No power regulator: 12.7mA, 0.0058mA@Sleephttp://www.home-automation-community.com/arduino-low-power-how-to-run-atmega328p-for-a-year-on-coin-cell-battery/

  • !12

    Curie‣ Announced on CSE 2015 to be used for wearable devices.

    ‣ Using Quark SE core

    ‣ Pentium x86 ISA compatible without x87 floating point unit

    ‣ 32-bit Processor with 32-bit Data Bus.

    ‣ 32 MHz clock frequency

    ‣ 384 KB of on-die flash.

    ‣ 80 KB of on-die SRAM

    ‣ Sensors: 6-axis accelerometer,

    ‣ Communication: BLE (Bluetooth low energy)

    ‣ Peripheral: USB, UART, I2C, SPA, GPIO, RTC, ADC

  • !13

  • !14

    MediaTek LinkIt One

    ‣ ARM Cortex-M4 with floating point architecture

    ‣ Comprehensive peripheral interface support, with a common Hardware Abstraction Layer API

    ‣ FreeRTOS with additional middleware components supporting

  • !15

    Head to head

  • !16

    General Purpose Processor Based Platform

  • !17

    BeagleBone Black‣ BeagleBone Black is a low-cost, community-

    supported development platform for developers and hobbyists.

    ‣ Boot Linux in under 10 seconds and get started on development in less than 5 minutes with just a single USB cable.

    ‣ Processor: AM335x 1GHz ARM® Cortex-A8

    ‣ 512MB DDR3 RAM

    ‣ 4GB 8-bit eMMC on-board flash storage

    ‣ 3D graphics accelerator

    ‣ NEON floating-point accelerator

    ‣ 2x PRU 32-bit micro-controllers

  • !18

    Raspberry Pi 3‣ Hardware Spec:

    ‣ A 1.2GHz 64-bit quad-core ARMv8 CPU ‣ 802.11n Wireless LAN

    ‣ Bluetooth 4.1 ‣ Bluetooth Low Energy (BLE)

    ‣ 1GB RAM ‣ 4 USB ports ‣ 40 GPIO pins ‣ Full HDMI port ‣ Ethernet port

    ‣ Combined 3.5mm audio jack and composite video ‣ Camera interface (CSI)

    ‣ Display interface (DSI) ‣ Micro SD card slot (now push-pull rather than push-push)

    ‣ VideoCore IV 3D graphics core

  • !19

    RASPBERRY PI ZERO

    ‣ Hardware Specification

    ‣ BCM2835, 1Ghz, Single-core CPU

    ‣ By Broadcomm

    ‣ Using ARM1176 (ARMv6)

    ‣ 512MB RAM

    ‣ Mini HDMI and USB On-The-Go ports

    ‣ Micro USB power

    ‣ HAT-compatible 40-pin header

    ‣ Composite video and reset headers

  • !20

    Intel Galileo‣ Galileo (released in 2013)

    ‣ Arduino-certified development boards based on Intel X86 architecture

    ‣ Gen1: Intel Quark X1000 32-bit 400Mhz SoC

    ‣ Gen2: Intel Quark X1000 32-bit 400Mhz SoC + Ethernet PoE+12-bit PWM+USB UART adapter

    ‣ Single Core, single thread, Pentium instruction set

    ‣ Industry standard I/O interface: ACPI, PCI express, Micro SD, USB 2.0

  • !21

    Edison‣ Release in 2014

    ‣ Hardware Specification:

    ‣ Intel Atom Tangier: 2 x Atom core at 500Mhz

    ‣ 1 x Intel Quark at 100Mhz (for RTOS)

    ‣ 1GB RAM

    ‣ 4GB Flash

    ‣ Onboard WiFi

    ‣ Bluetooth 4.0

    ‣ USB

    ‣ Software:

    ‣ Yocto Linux supporting Arduino IDE, Eclipse (C, C++, Python), Intel XDK, and Wolfram.

  • !22

    Execution Model

  • !23

    Execution Model‣ On general purpose computers,

    ‣ We assume that computers mostly connect to power sources or have time to shut down the computers.

    ‣ Data and binary code are loaded into memory to speed up the performance.

    ‣ On embedded systems and MCU-based systems,

    ‣ many of them are NOT connected to power sources and can be shut down or reboot without notice at any time.

    ‣ Only data are loaded to memory; binary code and state information are stored on flash or storage.

    ‣ Access latency on DRAM and SSD differs for 100x to 1000x.

    ‣ The difference greatly prolong the execution time and enlarge power consumption.

  • !24

    Example‣ The SHIMMER wearable sensor platform:

    ‣ TI MSP430 mciroconbroller

    ‣ 802.15.4 by Chipcon CC2420

    ‣ Storage: MicroSD 2GB

    ‣ Battery: 250mAh rechargeable batter

    ‣ Triaxial MEMS accelerometer

    ‣ Gyroscope, ECG, EMG, and other sensors

    ‣ On such platforms,

    ‣ Microcontroller is not the major energy consumer.

    ‣ Sensor such as gyroscope and radio are.

    ‣ What if your code run 10x or 100x slower?Konrad Lorincz et al. “Mercury: a wearable sensor network platform for high-fidelity motion analysis”. In: Sensys ’09 Proceedings of the 7th ACM Conference on Embedded Net- worked Sensor Systems (Nov. 2009), pp. 183–196.

  • !25

    Lifetime for Wearable Devices‣ With the node continuously sampling and logging

    accelerometer and gyroscope data, maintaining time sync, but performing no data transfers to the base station, the achievable lifetime with a 250 mAh battery is 12.5 h.

    ‣ Adding activity filter on CPU can reduce the amount of transmission and prolong the lifetime up to 17h for 50% activity, to 89h (4x) when downloading features only.

  • !26

    Sensor node VMs - Motivation

    ‣ Target device class: • MSP430, Cortex M0, AVR

    • Up to a few hundred KB of flash

    • Up to 10s KB RAM

    ‣ Advantages of running a VM: • Using higher level languages instead of C reduces

    development cost.

    • IoT applications are expected to consist of many different hardware platforms. Platform independence reduces deployment cost.

    • VMs can offer a safe execution environment.

  • !27

    Sensor node VMs - Related Works

    ‣ Application specific • Maté: first sensor node VM

    • VM*: Java based framework to build VMs tailored to specific applications

    ‣ General purpose • Java: NanoVM, TakaTuka, Darjeeling, SwissQM, leJOS

    • Python: pyMite, Micropython

    • LISP: SensorScheme

    • Others: DVM, TinyVM

    ‣ They differ in their system requirements and the features they do provide. Sensor node JVMs all sacrifice some features, most commonly reflection, and support for floating point operations.

  • !28

    Sensor node VM performance

    ‣ Not many VMs publish detailed performance figures.

    ‣ For the ones that do, the exact performance depends on the code being executed,

    ‣ but the general picture is clear: all are 1 to 2 orders of magnitude slower than native code.

    VM Source Platform Instructionset PerformancevsnativeC

    Darjeeling DelftUniversityofTechnology ATmega128 JVM 30x-113xslower

    TakaTuka UniversityofFreiburg Mica2(AVR)JCreate(MSP) JVM 230xslower

    TinyVM YonseiUniversity,Seoul ATmega128 nesC 14x-72xslower

    DVM UCLA ATmega128L SOS 108xslower

    SensorScheme UniversityofTwente MSP430 LISP 4x-105xslowerN. Brouwers et al, “Darjeeling, a feature-rich VM for the resource poor,” Sensys '09 F. Aslam et al, "Introducing TakaTuka: a Java virtualmachine for motes", Sensys '08 K. Hong et al, “TinyVM: an energy-efficient execution infrastructure for sensor networks,” Technical Report CS-TR-9003 Department of Computer Science Yonsei University, Seoul, Korea R. Balani et al, “Multi-level Software Reconfiguration for Sensor Networks,” EMSOFT ’06 L. Evers, “Concise and Flexible Programming of Wireless Sensor Networks”, Phd thesis, University of Twente, 2010

  • !29

    Does the slowdown matter?

    ‣ At 10-230x slowdown: • The energy consumption of

    running application code increases accordingly.

    (one of the main reasons to use tiny devices is their power consumption)

    • Time critical tasks such as periodic sensing and data processing may not finish in time.

    ‣ For the example on the right, the ‘compute features’ and ‘activity filter’ become a large/the largest component when multiplied by 10 to 100. Doing the FFT on the node will most likely be impossible.

    Lorenzetal.,“Mercury:awearablesensornetworkplatformforhigh-fidelitymotionanalysis”,Sensys'09

    94600

    71800

    1292000

  • !30

    Improving performance by compiling to native code‣ Compiling bytecode to native code has been a common

    technique on desktops, especially since the JVM was released.

    ‣ Three types • Offline (before sending code to the device): better code can

    be generated, but at the cost of losing platform independence.

    • Just-in-time (at run-time): common on desktops, but impractical on a sensor node since many can only execute code from flash memory.

    • Therefore we use Ahead-of-time compilation: compile the whole application to native code, on the device, at load time.

  • !31

    Challenges

    • Restricted flash memory means the size of the compiler shouldn’t be much larger than the interpreter

    • Restricted RAM means we cannot store complex data structures to analyze the byte code

  • !32

    AOT compiler for JVM bytecode‣ We build on earlier work done by Joshua Ellul: • When the JVM bytecode is loaded, replace each

    instruction with a native equivalent,

    • using the native stack as JVM operand stack,

    • then do some simple but effective peephole optimizations.

    J.Ellul,“Run-timecompilationtechniquesforwirelesssensornetworks”,Phdthesis,UniversityofSouthampton,2012.

    1)InitialtranslationOperation JVM Nativeadding two shorts IADD POP R10

    POP R11 POP R12 POP R13 ADD R10, R12 ADC R11, R13 PUSH R11 PUSH R10

    duplicate the top stack element

    IDUP POP R10 POP R11 PUSH R11 PUSH R10 PUSH R11 PUSH R10

    2)PeepholeoptimisationBefore Cost After CostPUSH R10 POP R10

    4 bytes, 4 cycles

    0 bytes, 0 cycles

    PUSH R10 POP R12

    4 bytes, 4 cycles

    MOV R10, R12 2 bytes, 1 cycle

    MOV R10, R12 MOV R11, R13

    4 bytes, 2 cycles

    MOVW R10, R12 2 bytes 1 cycle,

  • !33

    AOT performance

    ‣ This approach significantly improves performance: 4.27x slower than native C

    ‣ But the trade-off is a significant increase in code size: 3.15x larger than native C

    • Since flash memory is scarce on a sensor node, this reduces the size of the programs we can load on a device.

    ‣ Causes of JVM slowdown • Interpreter overhead

    • Stack overhead

    - Push / pop overhead

    - Load / store overhead

    • Instruction set mismatch

    All experiments done• using the Avrora cycle-accurate simulator,• simulating an ATmega 128 CPU,• for a set of 7 different benchmarks: bubble

    sort, heap sort, binary search, FFT, MD5, RC5, XXTEA

    Eliminated by Ellul’s AOT

    Reduced by our optimizations

    {Performance Code size

    C 1x 1x

    JVM 10x - 100x slower 0.5x

    AOT 4.27x 3.15x

  • !34

    Overhead after basic AOT‣ The remaining overhead can be measured by looking at the type of instructions executed in

    the AOT compiled version, compared to the native C version.

    ‣ We group them into three categories:

    • PUSH / POP: Stack overhead.

    - Each instruction pops its operands from the stack, and pushes its result. Only a limited number can be eliminated by the peephole optimizer.

    • LOAD / STORE: Stack overhead.

    - The same variables are often used multiple times in succession, but since each instruction consumes it’s operands, we repeatedly have to load and store the same values.

    • Others: Due to instruction set mismatch. - This includes many cases that can be expressed more efficiently in native AVR, such as

    looping over arrays.

    - In our optimizations we will only target one case: bit shifts by a constant number of bits.

    - JVM shifts take the number of bits to shift by as an operand.

    - Flexible, but inefficient if the number is fixed.

  • !35

    Overhead after basic AOT

    cyclestotal: 48

    push/pop: 17load/store: 16

    instr. set: 9total

    overhead:42

    native cost: 6

    Example Java code: do {A>>>=1;} while(A>B);

  • !36

    Overview of Proposed optimizations‣ Combined, they reduce ~80% of

    remaining overhead

    1. Push / pop overhead:

    a) improved peephole optimizer

    b) stack caching

    2. Load / store overhead:

    a) popped value caching b) mark loops

    3. Instruction set overhead:

    constant shift optimization

    ‣ However, they do increase the code size.

    48 -> 8 cycles (only 2 more than native C)

    Before Aftertotal: 48 8

    push/pop: 17 0load/store: 16 2

    instr. set: 9 0total overhead: 42 2

    native cost: 6 6

    do {A>>>=1;} while(A>B);

  • !37

    Experimental setup

    ‣ AOT compiler built on top of Darjeeling VM • JVM for resource-constrained devices

    • Modifies JVM bytecode to make it more suitable for sensor nodes

    ‣ Avrora cycle-accurate simulator, extended by adding traces to monitor the compilation process and runtime performance

    • trace AOT compilation by writing debug output to specific memory location monitored by Avrora

    • keep track of cycles spent in every memory location

    ‣ Simulating an ATmega128 CPU • 4KB ram, 128KB flash, 32 8-bit registers

    ‣ Set of 7 benchmark with different characteristics • bubble sort, heap sort, binary search, FFT,

    MD5, RC5, XXTEA

    ‣ Simulation output

    ‣ darjeeling.S: disassembly of native C code

    ‣ profilerdata.xml: cycles spent per flash address

    ‣ rtcdata.xml: trace of AOT compilation showing how each instruction is translated

    ‣ stdoutlog.txt: to monitor benchmark success and timer output to count total cycles spent in native and AOT compiled code

    ‣ jlib_bm.debug: modified JVM bytecode after Darjeeling's transformations

    ‣ Combined by F# scripts to form a detailed performance report, split by opcode

  • !38

    Results: Performance overhead per instruction type

    -50

    0

    50

    100

    150

    200

    250

    300

    350

    simple peephole

    impr. peephole

    stack caching

    pop. val. caching

    mark loops

    const shift

    Ove

    rhea

    d (%

    of n

    ative

    C ru

    n tim

    e)push/pop

    mov(w)load/store

    othertotal

    (average of 7 benchmarks)

  • !39

    Results per benchmark

    0

    50

    100

    150

    200

    250

    300

    350

    400

    450

    500

    simple peephole

    improved peephole

    stack caching

    popped value caching

    mark loops const shift

    Ove

    rhea

    d (%

    of o

    ptim

    ised

    nat

    ive

    C ru

    n tim

    e)

    bubble sortheap sort

    binary searchfft

    xxteamd5

    rc5

  • !40

    Overhead per benchmark for 1-7 pinned registers

    0

    20

    40

    60

    80

    100

    120

    140

    160

    180

    1 2 3 4 5 6 7

    Ove

    rhea

    d (%

    of o

    ptim

    ised

    nat

    ive

    C ru

    n tim

    e)

    bubble sortheap sort

    binary searchfft

    xxteamd5

    rc5

  • !41

    Xxtea overhead per instruction type for 1-7 pinned registers

    0

    10

    20

    30

    40

    50

    60

    70

    1 2 3 4 5 6 7

    Ove

    rhea

    d (%

    of o

    ptim

    ised

    nat

    ive

    C ru

    n tim

    e)

    push/popmove

    load/storeothertotal

    ‣ xxtea has lower average stack depth. More pinned registers can reduce load/store overhead but increase the cost of spilling push/pop stack values.

  • !42

    Results: Code size overhead per instruction type

    -50

    0

    50

    100

    150

    200

    250

    simple peephole

    impr. peephole

    stack caching

    pop. val. caching

    mark loops

    const shift

    Incr

    ease

    in c

    ode

    size

    as

    a %

    of n

    ativ

    e C

    siz

    e push/popmov(w)

    load/storeothertotal

    (average of 7 benchmarks)

  • !43

    Conclusion‣ From a code size perspective, the interpreter

    can’t be beaten, but it suffers from an often unacceptable performance penalty.

    ‣ Previous work on Ahead-of-Time compilation to native code improves performance, but at the expense of a large increase in code size, reducing the amount of code we can load in the limited memory available on a sensor node.

    ‣ Our optimizations further improve the performance of AOT compiled bytecode, and significantly reduce the code size overhead, leading to code that is only

    • 68% slower than native C, and • 88% larger than native C.

    Performance Code size

    C 1x 1x

    JVM 10x - 100x slower 0.5x

    AOT 4.27x 3.15x

    Optimised AOT 1.68x 1.88x

  • !44

    Questions?https://github.com/wukong-m2m/wukong-darjeeling/tree/aot-compiler

  • !45

    Causes of JVM slowdown

    ‣ Interpreter loop overhead

    ‣ Stack overhead

    ‣ Instruction set mismatch

    1. Fetch

    • Read instruction from memory

    • from Flash, not RAM!

    2. Decode

    • Need to find the right label in a switch statement with about 200 cases

    • probably uses a jump table

    3. Execute

    • Load operands from stack into registers

    • Do the operation

    • Store result back on the operand stack

    4. Loop

    • Update pc

    • Should we terminate?

    For each JVM instruction:

  • !46

    Causes of JVM slowdown

    RAM CPUregs

    LD r1, &a

    LD r2, &b

    ADD r1, r2

    ST &a, r1

    a

    b

    a

    NativeAVRa+=b;

    Assumingaandbaren’tinregistersalready,andaneedstobestoredbacktomemory.

    ‣ Interpreter loop overhead

    ‣ Stack overhead

    ‣ Instruction set mismatch

  • RAM CPUregs

    ILOAD_0

    ILOAD_1

    IADD

    ISTORE_0

    a

    push

    b

    push

    pop

    poppush

    pop

    a

    !47

    Causes of JVM slowdown

    ‣ Interpreter loop overhead

    ‣ Stack overhead

    ‣ Instruction set mismatch

    JVMa+=b;

    Pushesandpopsaren’ttotherealstack,buttotheJVMstack:pusha->operand_stack[sp++]=a

  • !48

    Causes of JVM slowdown

    • JVM is very simple, but some things can’t be expressed efficiently. For example, lack of pointers slows down array processing:

    • AVR: LDI Rd, X+

    • Native code would use the AVR’s “auto increment load” instruction: loads byte at the location pointed to by register X into Rd, and increments X by 1. 2 cycles.


    • JVM: ALOAD, ILOAD, IALOAD, IINC

    • Stack overhead: ALOAD and ILOAD

    • IALOAD needs to calculate the address for every access

    ‣ Interpreter loop overhead

    ‣ Stack overhead

    ‣ Instruction set mismatch

  • !49

    Overview of Proposed optimizations

    ‣ Combined, they reduce ~80% of remaining overhead

    1. Push / pop overhead:

    a) improved peephole optimizer b) stack caching

    2. Load / store overhead:

    a) popped value caching b) mark loops

    3. Instruction set overhead:

    constant shift optimization

    Optimisation Overhead reductionas % native C

    Baseline 327%

    improvedpeephole -86%(243%)

    stackcaching -61%(182%)

    poppedvaluecaching -45%(137%)

    markloops -40%(97%)

    constantshik -29%(68%)

  • !50

    1a: improved peephole optimizer

    ImprovedpeepholeoptimisationBefore Cost After CostPUSH Rx POP Rx

    4 bytes, 4 cycles

    0 bytes, 0 cycles

    PUSH Rsrc POP Rdest

    4 bytes, 4 cycles

    MOV Rdest, Rsrc

    2 bytes, 1 cycle

    ‣ Earlier optimizer only replaces (blocks of) consecutive push/pop pairs

    ‣ More general: a push/pop pair can be replaced, iff: • the target register for the pop is not used between the

    push and pop:

  • !51

    1a: improved peephole optimizer

    48 -> 41 cycles

  • !52

    1b: stack caching‣ M. A. Ertl, “Stack caching for

    interpreters,” presented PLDI ’95. Originally aimed at Forth interpreters

    • Usually the last pushed value will be popped again soon.

    • Keep the top of the stack in registers, so we can avoid memory accesses.

    • Only push to the real stack if we run out of registers.

    • Keep a ‘cache state’ which tells how much of the stack is in registers.

    • For interpreters: can’t make stack state too complicated, or the overhead will be bigger than the gains, but for AOT it is only used at load time to generate better code

    Stack slot 4: @SP+6

    Stack slot 3: @SP+4

    Stack slot 2: @SP+2

    Stack slot 1: @SP

    ADD pop rA pop rB add rA, rB push rA

    Stack slot 4: @SP

    Stack slot 3: cpu R1

    Stack slot 2: cpu R2

    Stack slot 1: cpu R3

    ADD add rA, rB

    cachestate := 2

    cache state=3

    In memory stack Stack caching

    JVMa+=b;

  • !53

    RAM CPUregs

    ILOAD_0

    ILOAD_1

    IADD

    ISTORE_0

    load a

    load b

    push

    pop

    load a

    1b: stack cachingJVMa+=b;

    a : cpu R1

    b : cpu R1

    a+b : cpu R1

    Stack caching

    a : @SP

    (assuming just 1 register available for stack caching)

    b : @SP

    a : @SP

    a : @SP+2

    a+b : @SP

    RAM CPUregs

    ILOAD_0

    ILOAD_1

    IADD

    ISTORE_0

    load a

    push

    load b

    push

    pop

    poppush

    pop

    store a

    In memory stack

    Before After

  • !54

    1b: stack caching

    ‣ AOT compiler implementation: • add a ‘cache manager’ component

    - uint8_t sc_getfreereg();

    - uint8_t sc_pop();

    - void sc_push(uint8_t reg);

    • cache manager will use registers when possible, or emit a real push or pop if necessary

    • code generation is now split between the old instruction generators, and the cache manager

    • note that the cache manager is only needed at load time, so we can spend more time manipulating it

  • !55

    1b: stack caching

    41 -> 31 cycles*. : register is used by the current instruction IntX : register is on the operand stack, at position X (higher = deeper)

  • !56

    2a: popped value caching‣ Stack caching removes most push / pop overhead, but not

    load / store overhead. Since each operand is popped off the stack, JVM code often repeatedly loads the same variables.

    ‣ Some loads may be unnecessary if the value is already in a register.

    ‣ We add a ‘value tag’ to the cache state of each register to remember which value is in a register, after it has been popped from the stack • the ‘value tag’ indicates what value is currently held

    there. for example: “local short 0”, or “constant int 42”

    • we can eliminate loads if the required value is already in a register

    • need to clear the cache on every branch target, (non-taken branches are ok)

    - the instruction could be reached from two places, so we can’t assume anything about the contents of the registers

    RAM CPUregs

    ILOAD_0

    ILOAD_1

    IADD

    ISTORE_0

    load a

    load b

    store a

    Unnecessary if a is already in a register.

    After this fragment, a and b are no longer on the stack but both are still in registers.

    a : R1

    b : R2

    a : R1

    a+b : R1

  • !57

    2a: popped value caching

    31 -> 27 cycles

  • !58

    2b: mark loops

    ‣ Popped value caching helps, but needs to reset its cache on every branch target.

    • Inner loops are often the most performance critical.

    • Each iteration will often use the same variables, but will have to reload them every time.

    ‣ We add an extra instruction to the VM: MARKLOOP, placed at the beginning and end of every inner loop.

    • MARKLOOP contains a list of variables used in the loop.

    • The VM can decide to pin a number of variables to registers, for the duration of the loop.

    • A loop prologue and epilogue are added to load and store the variables, if they are live at the loop start or end resp.

    • Loads and stores now access those registers, instead of main memory, but they are no longer available for normal stack caching.

    1: BRTARGET(1)

    2: … 3: … 4: … 5: … 6: … 7: GOTO 1

    Value tags get cleared at each branch target

    Need to w

    armup cache

    for each iteration

  • !59

    2b: mark loops

    setup: 12 cycles loop: 27 -> 17 cycles

    {inner loop

  • !60

    3: constant shifts‣ Most transformations from JVM to more

    optimal native code are often either too complex, or too specific and only help very specific cases

    ‣ Constant shifts however, are very common: 6 out of our 7 benchmarks.

    • They appear both as real constant shifts, and as multiplication/division by powers of two.

    • They are also easy to recognize and transform:

    - We replace all constant loads directly followed by a bit shift instruction with a special case.

    Java: a>>>1

    JVM: SLOAD_0 SCONST_1 SUSHR

    Unnecessary constant loadImplemented as a loop, which is inefficient if the number to shift by is already known.

  • !61

    3: constant shifts‣ Transformations from JVM to more optimal native code are often either too hard or

    too specific

    ‣ Constant shifts however, are very common: 6 out of our 7 benchmarks, and easy to recognise and transform

    ‣ We replace all constant loads directly followed by a bit shift with a special case, skipping normal code generator.

    ‣ avr-gcc doesn’t optimize all cases

    17 -> 8 cycles (only 2 more than native C)

    {inner loop

  • !62

    Mark loops trade off

    ‣ This optimization has a tradeoff: registers used to pin variables can’t be used for stack caching.

    • More pinned registers: cheap access to more pinned variables

    • Less free registers: stack cache may have to spill to memory more often

  • !63

    Summary

    ‣ Large number of IoT platforms are available for different needs.

    ‣ There is no single solution for all IoT applications.

    ‣ Choosing the right platforms is critical.

    ‣ Knowing the platforms is the first step on designing IoT systems.

    ‣ How to program IoT applications on same types of platforms remains open.