caps team compilation et architecture pour les processeurs superscalaires et spécialisés compiler...

36
CAPS team Compilation et Architecture pour les Processeurs Superscalaires et Spécialisés Compiler and Architecture for superscalar and embedded processors

Upload: kristopher-charles-short

Post on 31-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

  • CAPS teamCompilation et Architecture pour les Processeurs Superscalaires et Spcialiss

    Compiler and Architecture for superscalar and embedded processors

    CAPS project

    CAPS members2 INRIA researchers: A. Seznec, P. Michaud 2 professors: F. Bodin, J. Lenfant

    11 Ph D students: R. Amicel, R. Dolbeau, A. Monsifrot , L. Bertaux, K. Heydemann, L. Morin, G. Pokam, A. Djabelkhir, A. Fraboulet, O. Rochecouste, E.Toullec

    3 engineers: S. Bihan, P. Villalon, J. Simonnet

    CAPS project

    CAPS themes

    Two interacting activities

    High performance microprocessor architecture

    Performance oriented compilation

    CAPS project

    CAPS GrailPerformance at the best cost

    Progress in computer science and applications are driven by performance

    CAPS project

    CAPS path to the GrailDefining the tradeoffs between:what should be done through hardwarewhat can be done by the compilerfor maximum performanceor for minimum costor for minimum size, power ..

    CAPS project

    Need for high-performance processorsCurrent applicationsgeneral purpose: scientific, multimedia, data bases embedded systems: cell phones, automotive, set-top boxes ..Future applicationsdont worry: users have a lot of imagination !

    New software engineering techniques are CPU hungry: reusability, generalityportability, extensibility (indirections, virtual machines)safety (run-time verifications)encryption/decryption

    CAPS project

    CAPS (ancient) background ancient background in hardware and software management of ILPdecoupled pipeline architectures OPAC, an hardware matrix floating-point coprocessorsoftware pipeline for LIW

    Supercomputing backgroundinterleaved memories Fortran-S

    CAPS project

    CAPS background in architectureSolid knowledge in microprocessor architecture technological watch on microprocessorsA. Seznec worked with Alpha Development Group in 1999-2000

    Researches in cache architecture

    Researches in branch prediction mechanisms

    CAPS project

    CAPS background in compilersSoftware optimizations for cache memories Numerical algorithms on dense structuresOptimizing data layout

    Many prototype environments for parallel compilers:CT++ (with CEA): image processing C++ library for a SIMD architecture, Menhir: a parallel compiler for MatLabIPF (with Thomson-LER): Fortran Compiler for image processing on MasparSage (with Indiana): Infrastusture for source level transformation

    CAPS project

    We build on

    SALTO: System for Assembly-Language Transformations and Optimizationsretargetable assembly source to source preprocessorErven Rohous Ph. D

    TSF:Scripting language for program transformation on top of ForeSys (Simulog)Yann Mevels Ph. D

    CAPS project

    Salto overviewAssembly source to source preprocessor Fine grain machine descriptionIndependent from compilers

    CAPS project

    Compiler activitiesCode optimizations for embedded applicationsinfrastructures rather than compilersoptimizing compiler strategies rather than new code optimizationsGlobal constraintsperformance /code sizes/ low power (starting)Focus on interactive tools rather than automatic code tuningcase based reasoningassembly code optimizations

    CAPS project

    Computer aided hand tuningAutomatic optimization has many shortcomingsrather provide the user with a testbed to hand-tune applicationsTarget applicationsFortran codes and embedded C applicationsOur approachcase based reasoningstatic code analysis and pattern matchingprofilinglearning techniquesthe user is the ultimate responsible

    CAPS project

    CAHTPrototype built onForesys: Fortran interactive front-end (from Simulog)TSF: Scripting language for program transformationSage++: Infrastusture for source level transformation

    CAPS project

    Analysis and Tuning tool for Low Level Assembly and Source code (with Thomson Multimedia)ATLLAS objectives : Has the compiler done a good job ? Try to match source and optimized assembly at fine grainDevelopment/analysis environment:Models for both source and assemblyGlobal and local analysis (WCET, ) at both levelsInteractive environment for codes visualization and manual/ automatic analysis and optimizationBuilt using Salto and Sage++:Retargetable with compilers and architectures

    CAPS project

    ATLLAS - Analysis and Tuning tool for Low Level Assembly and Source code : Tuning methodGood ?Half-Automatic or Manual Source OptimisationsAtllascompilationprofilingYesHalf-Automatic or Manual Assembly OptimisationsSource CodeAssembly CodePost-ProcessingProcessingSupport

    CAPS project

    Assembly Level Infrastrure for Software Enhancement (with STmicroelectonics)ALISEenhanced SALTO for code optimization:better integration with code generationinterface with front-endinterface for profiling datatargets global optimizationbased on component software optimization enginesAnswer to a real need from industry: A retargetable infrastructure

    CAPS project

    ALISEEnvironment for:global assembly code optimizationproviding optimization alternatives

    Support for new embedded processors ISAs with ILP support (VLIW, EPIC)Predicated instructionsFunctional unit clusters, ..

    CAPS project

    ALISEArchitectureDescriptionD to MArchitecture ModelIntermediate representationOpt 1Opt 2Opt nP to IRTextInputIR to Ass(Emit)OptimizedProgramHigh Level APIInterfacesExternal InfrastructureUser interfaceG.U.I.IntermediateCodeExternal Infrastructure

    CAPS project

    Preprocessor for media processors (MEDEA+ Mesa project)Multimedia instructions on embedded and general-purpose processors but :no consensus on MMD instructions among constructors:saturated arithmetic or not, different instructions,

    Multimedia instructions are not well handled by compilers:but performance is very dependent

    CAPS project

    Preprocessor for media processors:our approachC source to source preprocessoruser oriented idioms recognition:easy to retarget target dedicated recognition

    exploiting loop parallelismvectorization techniquesmultiprocessor systemsavailable soon

    Collaboration with Stmicroelectonics

    CAPS project

    Iterative compilationEmbedded systems:Compile time is not criticalPerformance/code size/power are criticalOne can often relate on profiling

    Classical compiler: local optimizationsbut constraints are GLOBAL

    Proof of concept for code sizes (Rohous Ph. D)new Ph. D. beginning in september 2000

    CAPS project

    High performance instruction set simulationEmbedded processors:// development of silicon, ISA, compiler and applications Need for flexible instruction set simulation:high performancesimulation of large codesdebuggingretargetable to experiment: new ISA various microarchitecture optionsFirst results: up to 50x faster than ad-hoc simulator

    CAPS project

    ABSCISS: Assembly Based System for Compiled Instruction Set Simulation

    C SourceTriMedia AssemblytmccTriMedia BinaryABSCISStmsimtmasgccC/C++ SourceCompiled simulatorArchitecture Description

    CAPS project

    Enabling superscalar processor simulationComplete O-O-O microprocessor simulation:10000-100000 slower than real hardwarecan not simulate realistic applications, but slices even fast mode emulation is slow (50-100x):simulation generally limited to slices at the beginning of the applicationrepresentativeness ?Calvin2 + DICE:combines direct execution with simulationreally fast mode: 1-2x slowdownenables simulating slices distributed over the whole application

    CAPS project

    Calvin2 + DICE

    CAPS project

    Moving tools to IA64New 64bit ISA from Intel/HP:Explicitly Parallel Instruction ComputingPredicated ExecutionAdvanced loads (i.e. speculative)A very interesting platform for research !!

    Porting SALTO and Calvin2+DICE approach to IA64

    Exploring new trade-offs enabled by instruction sets:predicting the predicates ?advanced loads against predicting dependenciesultimate out-of-order execution against compiler

    CAPS project

    Low power, compilation, architecture, (just beginning :=)

    Power consumption becomes a major issue:Embedded and general purpose

    Compilation (setting a collaboration with STmicroelectronics/Stanford/Milan):Is it different from performance optimization ?Global constraint optimizationInstruction Set Architecture support ?

    Architecture:High order bits are generally null, registers and memoryALUs

    CAPS project

    Caches and branch predictors

    International CAPS visibility in architecture =skewed associative cache + decoupled sectored cache+ multiple block ahead branch prediction+ skewed branch predictor

    Continue recurrent work on these topics:multiple block ahead + tradeoffs complexity/accuracy

    CAPS project

    Simultaneous MultithreadingSharing functional units among several processesAmong the first groups working on this topicS. Hilys Ph. D.SMT behavior well understood for independent threadsnow, focus on // threads from a single application

    Current research directions:speculative multithreadingultimate performance with a single thread through predicting threadsperformance/complexity tradeoffs: SMT/CMP/hybrid

    CAPS project

    Enlarging the instruction window (supported by Intel)In an O-O-O processor, fireable instructions are chosen in a window of a few tens of RISC-like instructions.Limitations are:size of the window number of physical registersPrescheduling: separate data flow scheduling from resource arbitration.coarser units of work ?Reducing the number of physical registers:how to detect when a physical register is dead ?Per group validation ? revisiting CISC/RISC war ?

    CAPS project

    Unwritten rule on superscalar processor designsFor general purpose registers:Any physical register can be the source or the result of any instruction executed on any functional unit

    CAPS project

    4-cluster WSRS architecture(supported by Intel) Half the read ports, onefourth the write portsRegister file: Silicon area x 1/8 Power x 1/2 Access time x 0.6Gains on:bypass networkselection logic

    CAPS project

    Multiprocessor on a chip

    Not just replicating board level solutions !

    A way to manage a large on-chip cache capacity:how can a sequential application use efficiently a distributed cache ?architectural supports for distributing a sequential application on several processors ?how should instructions and data be distributed ?

    CAPS project

    HIPSORHIgh Performance SOftware Random number generationNeed for unpredicable random number generation:sequences that cannot be reproduced

    State of the art:< 100 bit/s using the operating system75Kbit/s using hardware generator on Pentium III

    Internal state of a superscalar can not be reproduceduse this state to generate unpredictable random numbers

    CAPS project

    HIPSOR (2)1000s of unmonitorable states modified by OS interrupts

    Hardware clock counter to indirectly probe these states

    Combined with in-line pseudo-random number generation

    100 Mbit/s unpredictable random numbers

    ARC INRIA with CODES