マルチコア /multi-core

福永　力； Chikara Fukunaga 1

マルチコア /Multi-Core• マルチコア化の背景

Background of Multi-Core• CMOS トランジスタ

CMOS transistor• マルチコアプロセッサの一般的構成

Typical structures of Multi-Core• 技術的課題点

Technical issues• Cache 一貫性（コヒーレンス）制御

Cache coherence control• むずかしいパラレル化

Parallelization


マルチコア化の背景Background of Multi-Core

• トランジスタ微細構造化の限界Problems arisen from the fine structure process of transistors

• Un-ignorable Increase of Leak current （漏れ電流）（ coming from CMOS structure ）→ Upper limit of Drive Frequency （動作周波数）

• Core2 has made with CMOS of Gate length 45nm→22nm• 消費電力の限界

Problems arisen from the power consumption• TDP ： limit of Thermal Design Power （最大放熱量） with present drive

freq.• Such a processor will no longer be adopted for mobile devices• Heat generation （発熱量） >> Heat radiation power （放熱量）

• 単体プロセッサ設計の問題Problems arisen from the single core design

• Limit of h/w design complexity （複雑化設計） beyond Super-scalar/pipeline

• IPC will not be exceeded over four （ IPC>4 は無理か？）

CMOS 構造と原理CMOS structure

• CMOS=Complementary Metal Oxide Semiconductor （相補的金属酸化膜半導体）

• pMOS と nMOS で論理回路を構成Both pMOS and nMOS together makes logical circuits


pMOS

nMOS

gate （ poly-silicon ）

Oxygen

Well

Source SourceDrain Drain

Substrate

Metal wiring

InsulatorGuarde


マルチコア化 /Towards Multi-Core• デザインルールが小さくなり多数

のコアを 1 チップに組み込める．Many cores can be put into a chip with lower design rule.

• マルチコアで性能向上を今までと同じように維持できる．Maintains performance upgrade with Multi-Core

• 単体に求められる演算性能は 1/（コアの数）と低く抑えられる．Performance requirement for a core =1/number of cores

• 低電圧電源で低消費電力Lower driving power and lower power consumption

• プロセスに余裕を持たせられる．例えばゲート酸化膜を厚くしリークカレントの低減をはかることができる．Sufficient space for a transistor (thick gate → low leak current)

日経エレクトロニクス（ Nikkei Electronics ） 2004.8.30

Pentium 4180nm （ 2000 ） Pentium D

90nm （ 2005）

Design rule (gate width)

Proc

esso

r per

form

ance

Speed up with design rule→Design rule helps no speed up→

Single core

Multi-core Parallel & freq. 30% lower

Parallel & freq. 50% lower


マルチコア実装技術Issues of Multi-Core Implementation

• 利点ばかりではなく技術的に注意すべき問題も山積している．Many issues for Multi-core designs beside various advantages

• マルチコア対応プログラミングについても課題が多くあるSoftware technology (parallelization) for Multi-Cores is still problematic Original:

日経エレクトロニクス（ Nikkei Electronics ） 2004 年 8 月 30 日


マルチコアの構成例Multi-Core configuration

• 共有バス結合Common bus coupling type

• 集中共有メモリ方式Shared memory type

• 分散メモリ方式Distributed memory type

• 相互結合ネットワークMutual coupling network

• 例えば TPcore のネットワークTpcore is a Flagship processor developed by Fukunaga’s lab. since 2005Tpcore とは福永研のフラッグシッププロセッサ； 2005


共有バス結合（ 1 ）Shared Bus coupling (1)

• 集中共有メモリ方式Shared Memory type

• データの共有によるプログラミングの容易さRelatively easier programming due to shared data handling

• バスの負荷増加によるスケジューリングとバス主導権の調停の困難さHeavy load of the shared bus and difficulty to control bus initiative among cores (Arbitration)

• cache のコヒーレンシ（各コア間，共有メモリのデータ一致度）Difficulty to maintain the cache coherency

• もし MPU1…n が同種のプロセッサであれば、これを対称マルチプロセッサ（ SMP ）構成と呼ぶ．あるいは UMA（ Uniform Memory Architecture ）This is called Symmetric Multi-processor (SMP) Architecture if all the MPUs are homogeny or UMA (Uniform Memory Architecture)

共有バス結合（ 2 ）Shared bus coupling (2)

• 分散メモリ方式Distributed Memory structure

• 共有バスのアクセス競合を減らすTry to reduce access conflict with own memory space

• プログラミングの負荷はやや増す．分散配置されているメモリは仮想的に統一されて扱う．Load of program will increase. Memory localized are treated as if a part of shared memory virtually.

• Called also as NUMA （ Non Uniform Memory Architecture)

• 多くは共有メモリと分散メモリ方式の混合として存在する．Normally actual chips are realized as mixture of shared memory and distributed memory architectures


Multi-Core バス構成例（ 1 ）Examples of Multi-Core Architecture (1)

• ルネサス /Renesas SH4 （ RISC ） Multi-Core SH7786SH-4A Core×2（ SMP or Anti-SMP configurable ）

• Local Memory & Shared Memory mixed architecture


26bit Address & 32bit Data bus

External Memories

533MHz

Multi-Core バス構成例（ 2 ）Examples of Multi-Core Architecture (2)

• CELL chip (IBM, Toshiba, Sony, Sony Computer Ent.; SCEI)• PowerPC Processor Element; PPE (main) (×1)• Synergetic Processor Element; SPE (sub) (×8)• Asymmetric Multi Processor (ASMP) configuration• EIB (Element Interconnect Bus) 128bit×4


CELL chip processor elements• PPE (64bit PowerPC)

• For execution of OS or Application main• Control of External main memory, IO and SPE

• In-order 2-way Super scalar, 2-way Multi-thred• SPE for Arithmetic calculation, multi-media

• 128 bit SIMD type RISC, In-order 2 way


32 kB 32kB

512kB

256kB Local Memory

for access of other SPE data


マルチコア下での Cache 構成の問題点Cache problem with Multi-Core

m1,m2 は MS m のそれぞれのプロセッサでcache コピーとする．Assume m1 and m2 as cache copies of m in MS.

（ 1 ）　 MPU1 は m1 を a に変更（ store ），

m2 はどうすべきか？What action MPU2 should take for m2 if MPU1 write “a” on m1?

（ 2 ）　 MPUn が共有メモリからオリジナルm のアドレスを cache に読み込みたいが（ 1 ）の後ではどれを参照すべきか？MPUn needs to refill m in MS into own cache. What it should do after (1)?

（ 3 ）　 MPUn がオリジナル m のアドレスへの write アクセスで cache ミスしたため直接（共有）メモり上で（ライトスルーなので）データを書き換えたい，どうするか？MPUn made cache miss at writing to original m (under the Write through mode), what should MPUn do to the original m?

Cache Coherency （一貫性）• プロセッサが任意のメモリ（共有 or 分散）を read access して常に最新

のデータが取得できることが必須．A processor should get always the newest data if it makes read access to memory (shared or distributed).

• これはプロセッサ h/w で保証されなければならない．This rule must be guarantied with the processor h/w.

キャッシュ書き込み制御機構： Restoring rules for cache1. Write Back が Multi-core cache で通常採用される．共有バスに負荷かか

らない．Write Back cache architecture is normally used for Multi-cores in order to reduce load to the shared bus.

2. Cache R/W miss での Refill時に /At refill for Cache R/W miss• Write Update ：その block をキャッシュにもつすべてのプロセッサに対して

update をリクエスト The “Update” request sent to processors which share the block.

• Write Invalidate: その block をキャッシュにもつ全プロセッサに invalidate リクエスト The “invalidate” request sent to all processors which share the block.



ディレクトリによる Cache 変更の連絡・確認

Communication and confirmation with Directory system for cache coherency control

• ディレクトリ方式（一元管理）Directory Control Method (unified control)

• 各プロセッサは自分の memory copy がどのプロセッサで共有されているか登録する table をもつ．Each processor has a table which contains the the processor numbers with which the own memory block is shared copied.

• もしあるプロセッサが block を変更したらどのプロセッサにその変更を連絡すればよいか素早く確認できる．If a processor modified a block, the processor can quickly identify the processors to whom this status modification should send.

• しかしこれは分散メモリ形式に有効．共有バス方式では次の方法 snooping による分散管理が主に利用される．This directory method is mainly applied to the multi-core with completely distributed memory architecture. Snooping is used normally for shared bus type architecture

Snoop による Cache状態の確認Check of a cache block with snoop

• ブロードキャスト・スヌープ / Broadcast Snoop（ Snoop=詮索，かぎ回り）

• Coherent request が r/w cache ミス時にバスを通してなされるCoherent request to all the processors via the shared bus at cache r/w miss.

• どのプロセッサも cache snoop を行いリフィルされる blockがあるかないか、あれば clean か dirty かチェックEvery processor makes cache snooping to check any block to be refilled is in the cache or is even clean or dirty if exists.

• もしその block が dirty で見つかったならそのデータを Write back で返すべき．その block がオーナ状態となる．If the block is found but dirty, the data should be written back, the block is transited to owner state.

• Clean であれば invalid か shared state としておく．If the block is found but clean, the block is transited into shared or invalid state. 福永　力； Chikara Fukunaga 15


ストアイン（ライトバック）キャッシュの

状態遷移（シングルコア）

Local cache Main Storage

Dark blue cells dirtyRight blue cells clean

Cache with Direct Map Architecture

Multi-core Cache状態遷移図による管理Management of Multi-core cache

with state transition diagram

• M(odified): Data mismatch btwn MS (Main Storage) and cache (dirty) the block not found in caches of other processors

• S(hared): Data match btwn MS and cache (clean)the block found in caches of other processors

• E(xclusive): Data match btwn MS and cache (clean)the block not found in caches of other processors

• I(nvalid): state right after reset or one with command “invalidate” （ no data available ）

• 0 （ wner or Owned): Data mismatch btwn MS and cache (dirty)the block found in caches of other processors


new →

new →


MSI プロトコルMSI protocol

• あるメモリブロックが Clean状態を他のプロセッサのキャッシュと共有している、していないを区別しない．Clean (Share) state is not distinguished with shared or not shared with cashes of other processors

• Read/Write Miss ともに bus snoop が必要Bus snoop is necessary at both R/W miss

• もし block keep が自分のみで他のプロセッサの cache にはないにもかかわらず bus snoop するので無駄なsnoop （バストラフィック）が発生する．Even if only this processor has a copy of block in cache, it asks always bus snoop with read cache miss. Many unnecessary snoop on shared bus.


MESI プロトコル• Clean状態を 2 つに分ける． Shared 、 Exclusiveほとんどが Exclusiveだと想定される．その場合 Read Miss時もsnoop せず Bus に無駄なトラフィックを発生させない．Two states Shared and Exclusive for Clean state.No snoop at Read Miss to keep reduce bus traffic.

• 多くの Multi-Core で採用されている． PowerPC, Intel Core 2Many Multi-Cores uses this protocol presently.


同時マルチスレッド Review（ Simultaneous Multi-Thread; SMT ）

• SMT は Single core内で OS あるいは専用 h/w が複数 Thread 実行を制御していた．OS or some specific h/w controls the multi-thread execution

• SMT より Super scalar の有効利用が進み眠っている各種資源を同時に独立に実行させることができ IPCが向上した．Effective usage of a super-scalar has been established by introduction of SMT, several independent resources can work in parallel for every purpose.

• スレッドレベル並列化（ TLP ）の推進がさらなる SMT プロセッサの効率を高めると期待される．Development of Thread level parallelization technique will enhance the effectiveness of an SMT processor.


同時マルチスレッドから Multi-CoreMulti-Core from SMT

• 元来マルチプロセス（タスク）システムは OS で制御され，複数プロセスで資源の取り合いなどを防ぐ技術が開発されてきた（スピンロック，セマフォア， CSP ）．Originally multi-task process execution has been controlled under an OS, and developed technology to avoid conflict in multi-process environment is applied in OS (Spin rock, semaphore, CSP etc.)

• この技術を OS レベルからハードウェアレベルに引き下げ，多くのスレッドを適切にマルチコアを構成するプロセッサに分散配置させて割当て最適化された並列処理環境を実現できるかどうかが課題This technology must be implemented in hardware of a Multi-Core system or individual core. There is an issue to make an optimized parallel processing system totally in h/w environment of MC

• もちろんこの技術開発には古くて新しい課題である並列処理システムのさまざまな問題を解決していかなければならないWe need to solve various old and new problems inherent in parallel processing system for the above issue ．


山積する並列プログラム化への課題Many problems for parallel programming • Hotchips2006 での Sun

Microsystems の Y.Lin氏のスライドより，彼が指摘したマルチスレッド並列処理プログラムのさまざまな課題．Y.Lin of Sun-Microsystems specified various issues to construct an MT program as →

• 並列できるタスクをどう見いだすか，作りだすか

• タスクのスレッド化への写像

• スケーラビリティをどのように達成するか．(English → photo)

が議論されている．

マルチコア /multi-core

Documents