csc.t363 computer architecture...2020/10/06 · csc.t363 computer architecture, department of...

CSC.T363 Computer Architecture, Department of Computer Science, TOKYO TECH 1

コンピュータアーキテクチャ演習の補足

情報工学系吉瀬謙二

Kenji Kise, Department co Computer Sciencekise_at_c.titech.ac.jp

Ver. 2020-10-05a2020年度（令和2年）版

Course number: CSC.T363

Visit our support page https://www.arch.cs.titech.ac.jp/lecture/CA/


【重要】 ACRiルームのサーバの予約

• ACRiルームのアカウントを使って、次のURLからログインする。

• https://gw.acri.c.titech.ac.jp/

• 「予約ページトップ」から、vs で始まるサーバの 10月6日(火) 12:00～15:00 の枠と、15:00～18:00の枠を予約すること。

• 同じサーバーを連続して予約すること。

• 12:00開始の枠は、30分前の 11:30 までに予約する必要があるので注意。


コンピュータの古典的な要素

出力制御

データパス

記憶

入力

出力

プロセッサ

コンピュータ

インタフェース

コンパイラ

性能の評価

Instruction Set Architecture (ISA), 命令セットアーキテクチャ


UART (Universal Asynchronous Receiver/Transmitter)

• 調歩同期方式によるシリアル信号をパラレル信号に変換したり，その逆方向の変換をおこなう集積回路をUARTと呼ぶ．8ビット（1バイト）単位でデータを送信・受信する．

• UARTを用いることで，FPGAとコンピュータの間でのお手軽なデータ通信が可能．

• 例えば，’a’ という文字を送信する場合，’a’ は 8’h61, 8’b01100001 （次スライドのASCII Tableを参照）なので，下図のタイミングで送信線TXDを制御する．

• データが送信されるまで送信線TXD を1とする．

• まず，青色で示した0 （これをスタートビットと呼ぶ）を送信することで，データ送信の開始を明示．

• 次に，黄色で示した様に送信したいデータ 8’b01100001 の最下位ビットから順番に送信する．

• 最後に，赤色で示した1（これをストップビットと呼ぶ）を送信する．

• 1ビットを送受信するための時間間隔は送信側と受信側で同じレートを用いる．これをボー・レート (baud) と呼ぶ．例えば、1000 baud であれば、１ビット送信の間隔は 1msec となる．

4

0 1 0 0 0 0 1 1 0

Time

1 1

Startbit

TXD

Stopbit

1


ASCII Table

5


シリアル通信による送信回路 m_UartTx

• システムクロック 100MHz、1Mbaud を想定する送信回路

• トップのモジュール m_main では、2秒に１回の頻度で、文字 a を送信する。

/************************************************************************************//* code201.v For CSC.T363 Computer Architecture, Archlab TOKYO TECH *//************************************************************************************/`timescale 1ns/100ps`default_nettype none/************************************************************************************/module m_main (w_clk, w_txd);

input wire w_clk;output wire w_txd;reg [31:0] r_cnt = 1;always@(posedge w_clk) r_cnt <= (r_cnt>(200_000_000 - 1)) ? 0 : r_cnt+1;m_UartTx m_UartTx0(w_clk, 8'h61, (r_cnt==0), w_txd);

endmodule/************************************************************************************/module m_UartTx (w_clk, w_din, w_we, w_txd);

input wire w_clk, w_we;input wire [7:0] w_din;output wire w_txd;reg [8:0] r_buf = 9'b111111111;reg [7:0] r_wait= 0;always@(posedge w_clk) begin

r_wait <= (w_we) ? 0 : (r_wait>=99) ? 0 : r_wait + 1;r_buf <= (w_we) ? {w_din, 1'b0} : (r_wait>=99) ? {1'b1, r_buf[8:1]} : r_buf;

endassign w_txd = r_buf[0];

endmodule/************************************************************************************/

code201.v

Source code will be available in /home/tu_kise/ca/src/


Inside main20.xdc

• このプロジェクトで用いる XDC (Xilinx Design Constraint ) ファイル

• FPGAの出力信号が w_txd （これはコンピュータの入力信号）

• FPGAの入力信号が w_rxd （これはコンピュータの出力信号）

main20.xdc

set_property -dict { PACKAGE_PIN E3 IOSTANDARD LVCMOS33} [get_ports { w_clk }];create_clock -add -name sys_clk -period 10.00 [get_ports {w_clk}];

set_property -dict { PACKAGE_PIN H5 IOSTANDARD LVCMOS33} [get_ports { w_led[0] }];set_property -dict { PACKAGE_PIN J5 IOSTANDARD LVCMOS33} [get_ports { w_led[1] }];set_property -dict { PACKAGE_PIN T9 IOSTANDARD LVCMOS33} [get_ports { w_led[2] }];set_property -dict { PACKAGE_PIN T10 IOSTANDARD LVCMOS33} [get_ports { w_led[3] }];

set_property -dict { PACKAGE_PIN A9 IOSTANDARD LVCMOS33} [get_ports { w_rxd }];set_property -dict { PACKAGE_PIN D10 IOSTANDARD LVCMOS33} [get_ports { w_txd }];



GtkTerm を利用する

• 「リモートデスクトップ接続」でACRiルームにログイン

• コマンド gtkterm & で GtkTerm を起動する。


GtkTerm の設定

• Configuration から Port を選択する。• Port として /dev/ttyUSB1 を選択する。

• Baud Rate に 1000000 を入力して、1Mbaud とする。 Baud rate を変更する場合には、この値を適切に修正すること。

• OK ボタンを押す。


Vivado でビットファイルを生成してコンフィギュレーション

• Vivado のプロジェクトを作成する。

• 設計ファイル code201.v

• 制約ファイル main20.xdc

• として、ビットストリームファイルを生成して、コンフィギュレーションする。

• GtkTerm に 2秒に1回の間隔で a が表示される。


シリアル通信による受信回路 m_UartRx

module m_UartRx (w_clk, w_rxd, w_dout, r_en);input wire w_clk, w_rxd;output wire [7:0] w_dout;output reg r_en = 0;

reg [2:0] r_detect_cnt = 0; /* to detect the start bit */always @(posedge w_clk) r_detect_cnt <= (w_rxd) ? 0 : r_detect_cnt + 1;wire w_detected = (r_detect_cnt>2);

reg r_busy = 0;reg [3:0] r_cnt = 0;reg [7:0] r_wait = 0;always@(posedge w_clk) r_wait <= (r_busy==0) ? 0 : (r_wait>=99) ? 0 : r_wait + 1;

reg [8:0] r_data = 0;always@(posedge w_clk) begin

if (r_busy==0) begin{r_data, r_cnt, r_en} <= 0;if(w_detected) r_busy <= 1;

endelse if (r_wait>= 99) begin

r_cnt <= r_cnt + 1;r_data <= {w_rxd, r_data[8:1]};if (r_cnt==8) begin r_en <= 1; r_busy <= 0; end

endendassign w_dout = r_data[7:0];

endmodule

code202.v

• システムクロック 100MHz、1Mbaud を想定する受信回路

• 受信した8ビットのデータを r_dout に出力し、そのことを伝えるために r_en を1にする。



コンピュータアーキテクチャComputer Architecture

2. コンピュータの性能と消費電力の動向Trends in Performance and Power

Ver. 2020-10-05a2020年度（令和2年）版

Course number: CSC.T363

www.arch.cs.titech.ac.jp/lecture/CA/Tue 14:20 - 16:00, 16:15 - 17:55Fri 14:20 - 16:00

吉瀬謙二情報工学系Kenji Kise, Department of Computer Sciencekise _at_ c.titech.ac.jp


Growth in clock rate of microprocessors

From CAQA 5th edition

CSC.T363 Computer Architecture, Department of Computer Science, TOKYO TECH 14From CAQA 5th edition

Growth in processor performance


Which is faster?

• Time to run the task (ExTime)– Execution time, response time, latency

• Tasks per day, hour, week, sec, ns … (Performance)– Throughput, bandwidth

Plane

Boeing 747

BAC Concorde

Speed

610 mph(1130km/h)

1350 mph(2500km/h)

DC to Paris

6.5 hours

3 hours

Passengers

470

132

Throughput(p x mph)

286,700(470 x 610)

178,200(132 x 1350)

From the lecture slide of David E CullerMPH (Mile Per Hour)


Which is faster?

• Time to run the task (ExTime)– Execution time, response time, latency

• Tasks per day, hour, week, sec, ns … (Performance)

– Throughput, bandwidth

Boeing 737

Nozomi

MaxSpeed

800km/h(670km)

270km/h(820km)

TimeCost

1:2032,000yen

4:00 18,000yen

Passengers

170

1,300

Throughput(P x S)

85,510(170 x 503)

266,500(1,300 x 205)

From the lecture slide of David E Culler

From Tokyo to Hiroshima


Defining (Speed) Performance

Normally interested in reducingResponse time (execution time) – the time between the start and the

completion of a task or a program

Important to individual users

Thus, to maximize performance, need to minimize execution time

◼ Throughput – the total amount of work done in a given time◼ Important to data center managers

◼ Decreasing response time almost always improves throughput

performanceX = 1 / execution_timeX

If X is n times faster than Y, then

performanceX execution_timeY -------------------- = --------------------- = nperformanceY execution_timeX


Pipelined Processor

• Non pipelining(Multi-cycle)

Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005

• Pipelining


Pipelined Processor

Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005


Inside module m_proc12 (pipelined processor)

• add, addi, lw, sw, bne 命令に対応したパイプライン版

• 図では省略しているが，分岐のための比較，データメモリの入力データにもフォワーディングが必要

20

m_regfile

m_regs

(32bit x 32)

+

w_rs

w_rt

w_rrs

w_rrt

w_rd

ExMe_rsltm_memory

m_imem

(32bit x 2048)

32

32

5

5

5

ID EX MEM

w_imm

16

SignExtIm

m

Mu

x

w_imm32

Mu

x

w_rd2

m_memory

m_dmem

(32bit x 2048)

WB

Mu

x

32

32

11

w_

rslt

2

MeWb_ldd

32

32

IdEx_rrt

MeWb_rslt

ExMe_rrt

+

r_pc

4

r_pc[12:2]

32

32

11

IF

w_npc

Pipeline register

IfId IdEx ExMe MeWb

IdEx_pc

IdEx_op

IdEx_rd2

IdEx_w

IdEx_we

IdEx_rs

IdEx_rt

ExMe_pc

ExMe_op

ExMe_rd2

ExMe_w

ExMe_we

MeWb_pc

MeWb_op

MeWb_wIf

Id_

ir

w_rslt2

w_

rslt

IfId_pc

w_op

w_rd2

w_w

w_we

w_rs

w_rt

MeWb_rd2

5

w_pc4

+

Shift

left 2

32

32w_tpc

IfId_pc4Mu

x

32

!=

w_taken

1

Mu

xM

ux

See Computer Logic Design Lecture 13


Performance Factors

Want to distinguish elapsed time and the time spent on our task

CPU execution time (CPU time) : time the CPU spends working on a task

Does not include time waiting for I/O or running other programs

CPU execution time # CPU clock cycles

for a program for a program= x clock cycle time

CPU execution time # CPU clock cycles for a program

for a program clock rate = -------------------------------------------

◼ Can improve performance by reducing either the length of the clock cycleor the number of clock cycles required for a program

or


Performance Factors

Performance = f x IPCf: frequency (clock rate)

IPC: retired instructions per cycle

CPU execution time # CPU clock cycles for a program

for a program clock rate = -------------------------------------------

int flag = 1;

int foo(){

while(flag);

}

Performance = clock rate x 1 / # CPU clock cycles for a program


Pollack’s Rule

• Pollack's Rule states that microprocessor "performance increase due to microarchitecture advances is roughly proportional to the square root of the increase in complexity". Complexity in this context means processor logic, i.e. its area.

• Superscalar, vector

• Instruction level parallelism, data level parallelism


From multi-core era to many-core era

Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction, MICRO-36


Power within a Microprocessor

• Powerdynamic = 1/2 x Capacitive load x Voltage 2 x Frequency switched

• Pdynamic = 1/2 x C x V2 x F

• Power required per transistor

• The first 32-bit microprocessors like Intel 80386 consumed less than two watt.

• 3.3GHz Intel Core i7 consumes 130 watts.


From multi-core era to many-core era

Platform 2015: Intel® Processor and Platform Evolution for the Next Decade, 2005


Intel Sandy Bridge, January 2011

4 to 8 core


アーキテクチャの異なる視点による分類

Flynnによる命令とデータの流れに注目した並列計算機

の分類（1966年）

SISD (Single Instruction stream, Single Data stream)

SIMD( Single Instruction stream, Multiple Data stream)

MISD (Multiple Instruction stream, Single Data stream)

MIMD (Multiple Instruction stream, Multiple Data stream)

Instruction stream

Data stream

SISD SIMD MISD MIMD


アーキテクチャの異なる視点による分類

Flynnによる命令とデータの流れに注目した並列計算機

の分類（1966年）

SISD (Single Instruction stream, Single Data stream)

SIMD( Single Instruction stream, Multiple Data stream)

MISD (Multiple Instruction stream, Single Data stream)

MIMD (Multiple Instruction stream, Multiple Data stream)

MIMD

csc.t363 computer architecture...2020/10/06 · csc.t363 computer architecture, department of...

Documents