bulldozer: an approach to multithreaded compute performance

13
by Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas This paper appears in: Micro, IEEE March/April 2011 (vol. 31 no. 2) pp. 6-15 Bulldozer: An Approach to multithreaded Compute Performance 마마마마 마마마마 마마 speaker: 마마마

Upload: ivana

Post on 23-Feb-2016

38 views

Category:

Documents


0 download

DESCRIPTION

Bulldozer: An Approach to multithreaded Compute Performance . by Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas This paper appears in: Micro, IEEE March/April 2011 (vol. 31 no. 2) pp. 6-15. 마이크로 프로세서 구조 speaker: 박세준. Contents. Motivation Introduction - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Bulldozer: An Approach to multithreaded Compute Performance

by Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas

This paper appears in: Micro, IEEEMarch/April 2011 (vol. 31 no. 2) pp. 6-15

Bulldozer:An Approach to multithreaded Compute Performance

마이크로 프로세서 구조 speaker: 박세준

Page 2: Bulldozer: An Approach to multithreaded Compute Performance

1. Motivation 2. Introduction 3. Block diagram 4. Key features5. Function block highlights6. Bulldozer-based SoC

Contents

Page 3: Bulldozer: An Approach to multithreaded Compute Performance

AMD has been focusing on the core count and highly parallel sever workloads

Two basic observations1. Future SoCs support multiple execution threads

• The smallest possible building module2. Core would operate in constrained power environment.

• Power reduction techniques:Filtering , speculation reduction, data movement minimization

Performance per watt!!

Motivation

Page 4: Bulldozer: An Approach to multithreaded Compute Performance

Bulldozer is New direction in microarchi-tecture

• Bulldozer is the first x86 design to share substantial hardware between multiple core

• Bulldozer is a hierarchical design with sharing at nearly every level

• Bulldozer is a high frequency optimized CPU

• Instead of peak performance, average performance increased.

Introduction

Page 5: Bulldozer: An Approach to multithreaded Compute Performance

Major contribution

• Scaling the core structures

• Aggressive frequency goal• low gates per clock

Introduction

Page 6: Bulldozer: An Approach to multithreaded Compute Performance

It combines two independent core as a module• implementation of a shared level 2 cache• Improved area and power efficiency

Block diagram

The module can fetch and decode up to four x86 in-struction per clock.

Each core can services two loads per cycle.

Shared Frontend• Decoupled predict and

fetch pipelines

Page 7: Bulldozer: An Approach to multithreaded Compute Performance

• ALU performance 33% decrease FPU performance 33% in-crease

• ALU performance 33% increase FPU performance 33% in-crease

Block diagram

Page 8: Bulldozer: An Approach to multithreaded Compute Performance

1. Multithreading microarchitecture• Appropriate use of replication and shared hardware• Main advantage to sharing instruction cache and branch • Enforcing frontend (increasing ROB, BTB)

2. Decoupled branch-prediction from instruction fetch pipelines• Enablement of instruction prefetch using the prediction queue• instruction control unit increased 128 (reorder buffer)

3. Register renaming and operand delivery• scheduler and operand-handling is the biggest power consumer in the integer execution

unit • PRF-based renaming microarchitecture for power efficiency

• Eliminates data replication

4. FMAC and media extension• FMAC(floating-point multiply-accumulate) deliver significant peak execution bandwidth • It made one per each module like coprocessor

Key features

Page 9: Bulldozer: An Approach to multithreaded Compute Performance

Branch predictionmultilevel BTB

Instruction cache64 Kbyte, two-way set-associative,cache shared between both threads

Function block highlights

Page 10: Bulldozer: An Approach to multithreaded Compute Performance

Decodebranch fusion (intel: macro fusion ), four x86 instruction per cycle

Bulldozer execution pipeline

Function block highlights

Page 11: Bulldozer: An Approach to multithreaded Compute Performance

Integer scheduler and executionrenaming by PRF(Physical Register Files)

Floating pointFPU is a coprocessor between two integer core

L2 cachethe two cores share the unified L2 cache

Function block highlights

Page 12: Bulldozer: An Approach to multithreaded Compute Performance

Summary 1. In single threading, sacrifice peak performance, throughput

increase2. In single threading, FPU is more important3. ALU performance need in server

Bulldozer can deliver a significant performance improvement in the same power.

Bulldozer-based SoC

Page 13: Bulldozer: An Approach to multithreaded Compute Performance

The end