fpga%hardware%implementation%strategies%forpatscott/teaching/numeric/montgomery_606_f... ·...

FPGA Hardware Implementation Strategies For

J. Montgomery (2013)

¡  Introduction to FPGAs §  Techniques in Lower-‐Level Coding, Resources §  Methodology, Operation-‐Flow, Physical Complications §  Why You Should Care

¡  FFT Algorithm (Cooley-‐Tukey) (Re-‐Cap) ¡  Porting to FPGA By Example:

§  Challenges in Constant-‐Throughput

¡  (Time Permitting) Alternative Implementation Strategies §  Pipeline versus Burst §  Higher-‐Order RADIX

FFT Implementations on FPGAs 2

¡  Field-‐Programmable Gate Array: §  Comparisons: ▪  CPU: One Operation at a time (hence power inefficiency and very high clock

speed necessity) ▪  GPU: One ‘type’ of operation, performed in parallel ▪  IC/ASIC: One specific (set of) operations. Every operation can be performed

simultaneously* (some exclusions apply) ▪  FPGA: “Programmable” IC, User-‐specified set of operations performed with

(nearing) the speed and efficiency of an IC

§  Used For? ▪  Anything an IC could be used for, and more. ▪  FAST à Digital Signal Processing. ▪  Flexible à Can be re-‐programmed on the PCB


¡  Logic Cell: §  1 Logic Cell == 1 LUT, 1 Flip/Flop, Connections to adjacent cells

¡  Logic Slice: 2 Logic Cells ¡  Configurable Logic Block (CLB): 4 Logic Slices

§  Configurable because they can use their LUTS as Distributed RAM storage. [Distributed RAM is *slow* storage]

¡  DSP (Extreme) Slice: §  Digital Signal Processing Blocks integrated into the fabric of the

FPGA. §  Typically Pre-‐adder, 1 Multiply-‐and-‐accumulate with memory registers §  Extremely Fast. Independent Clock (Typically 8X clocking)

¡  Block RAM


¡  Xilinx Virtex-‐7 Board (top-‐of-‐the-‐line): §  1,139,200 Dedicated Logic Slices §  CLBs (Can be Mixed and Matched): ▪  Additional 178,000 logic slices ▪  OR 17,700Kb Distributed RAM

§  3,360 DSP Slices §  67,680 Kb Block RAM

¡  Between 4-‐24 Clock Regions (up to 200Mhz)


¡  VHDL (Very Hard Description Language) §  Several “Stages” to get to VHDL Synthesis (writing to chip): ▪  “Behavioral Level” à Uses ‘Cores’ of higher-‐level operation, describes their relationships, and the information flow/operations.

▪  “Explicit Level” à Some of these cores can be treated like chips, with pins in and out, but ultimately the only things you can ‘do’ is to flip bits, and build chains of basic logic operations.


¡  VHDL (Very Hard Description Language) §  Several “Stages” to get to VHDL Synthesis (writing to chip): ▪  “Simulation” à Self Explanatory. Does your ‘coding’ do anything?

▪  “Component level” à Match your operations and cores to specific chip resources. Constrain their relationships by defining clocking regions.


¡  VHDL (Very Hard Description Language) §  Several “Stages” to get to VHDL Synthesis (writing to chip): ▪  “Placement And Routing” à Multi-‐Dimensional Minimization Problem ▪  Aside from SLLs, Routing takes up the same resources you would use for Logic ▪  Precision timing is crucial. Nothing is over-‐write safe. Nothing waits without more logic telling it to do so. ▪  Physical Distances come into play. (Simulate clock regions with distances, etc)


Routing and Placement


( figure 1)

¡  Not Line-‐By-‐Line executions. §  Even Object-‐Oriented languages are technically line-‐by-‐line executions.

§  Though the ‘explicit’ syntax may be reminiscent of C – execution is more like LabView. A logic Chain, or Block will do whatever it has been written to do as soon as it gets stimulus to do so (a bit gets flipped). ▪  Means you can theoretically have branching numbers of parallel tasks.

§  Fastest thing on the board is the Clock – so we write in “Frames”


¡  “Won’t some (Genius) Engineer just do this for me?” ¡  Maybe, but having some understanding of these things things helps you: §  Makes the “Precision” and limitations in your code/operations transparent ▪  Forces you to think about Bit-‐Accurate math

§  We’ve learned a lot about the implicit assumptions about numbers we make when using computers – we do that a lot on the algorithmic level too, and this is how:


¡  Though it doesn’t seem like it – a lot of our higher-‐level programming is still rife with ‘abstract’ math. This is most true of real-‐time computations, signal processing, data-‐capture and handling, and control.

¡  Best way to explain this is through example: ▪  Lets do a step-‐by-‐step “Toy” FFT problem. ▪  Assume Fixed-‐Point 8-‐bit math.


¡  8-‐Point Cooley-‐Tukey FFT Algorithm – efficiently compute the DFT (N=8)

¡  Let’s define a Twiddle Factor – Number on the complex unit circle:

WN = e− i2ΠN ⇒WN

kn = e− i2Πnk

N

Fk = f (n)*en=0

n=N−1

∑−2 iΠnk

N

(1)

(2)


¡  Two ways to do this – separate the sum by even and odd values (Decimation in Time), or by lowest and highest half (Decimation in Frequency)

¡  Decimation refers to which set of values are in Bit Reversed Order


¡  Reduce sum to even and odd components, pull “Twiddle Factor” out of the odd sums, and repeat until DFTs reduced to 1 operation in the sum. §  For X-‐point FFT, log2(X) Twiddle factors. §  Smallest operations are E+(W*C)


Fk = f (n)*Wnk

Nn=0

n=N−1

∑ → f (2n)*n=0

n=N2 −1

∑ WN2nk + f (2n+1)*WN

2nk *Wk2

n=0

n=N2 −1

∑

Fk = f (2n)*WN2

nk

n=0

n=N2 −1

∑ +Wk2 * f (2n+1)n=0

n=N2 −1

∑ *WN2

nk$

%&&

'

())

Fk = f (2n)*WN4

nk

n=0

n=N4 −1

∑ +Wk4 f (2n+1)n=0

n=N4 −1

∑ *WN4

nk$

%&&

'

())

+Wk2 * f (2n+ 4)*WN

4

nk

n=0

n=N4 −1

∑ +Wk4 * f (2n+ 5)n=0

n=N4 −1

∑ *WN4

nk$

%&&

'

())

(3)

(4)

(5)


¡  Until, for an 8-‐point FFT, you can write the equation the following way:

¡  Notice here:

Fk = f0 + f4Wk2( )+W k

4 f2 + f6Wk2( )!

"#$+W

k8 f1 + f5W

k2( )+W k

4 f3 + f7Wk2( )!

"#$

Wk2 = e−iΠk = { (−1)→k∍{odds}

(1)→k∍{evens}

(6)

(7)


¡  Fundamentally the same equation – just different order of operations. §  In equations the smallest parts become (E +-‐ W)C

§  Iterating in this way will eventually give you the equation in the last slide. This is because:

e−iΠk ==Wk2 == (−1)k

Fk = f (n)*WnkN

n=0

n=N−1

∑ → f (n)*n=0

n=N2 −1

∑ WNnk + f (n+ N

2 )*WNnk *WN

N2 k

n=0

n=N2 −1

∑

Fk = f (n)n=0

n=N2 −1

∑ + (−1)k f (n+ N2 )

$

%&&

'

())WN

nk

(8)

(9)

(10)FFT Implementations on FPGAs 18

¡  Notice that every twiddle factor within each stage, or layer, differs from every other twiddle factor only by k. §  For an 8-‐point FFT, there are three stages: ▪  1st Stage, there are 4 unique twiddle factors, but 2 of these differ from the other 2 only by a sign. These are +-‐1 ,

▪  2nd Stage has only +-‐1, +-‐i ▪  3rd Stage has only +-‐1

§  Every frequency bin calculation can then be made by combinations of the twiddle factors and their signs.

±22(1− i)


BUTTERFLY: INTRO

Single Butterfly operation Image is for Decimation In Time. Second term is multiplied by twiddle factor, then the two terms are added and subtracted. The E+WC term continues on the top row The W-‐WC term continues on the bottom row For DIF, the multiply follows the add/subtract

( figure 2)FFT Implementations on FPGAs 20

BUTTERFLY (DIT)

Notation here is slightly different. Key:

W 0N =W

1

W 1N =W

18

W 3N =W

38

W 2N =W

28


BUTTERFLY (DIF)

Notation here is slightly different. Key:

W 0N =W

1

W 1N =W

18

W 3N =W

38

W 2N =W

28


¡  Either DIF or DIT can be used, but DIF is generally preferred for the following reason: §  Natural Ordering Inputs avoids Shuffle Resources

¡  DIT can be used just as well with relatively fewer resources depending on PT length §  “BURST I/Os” use delays-‐shuffling to reorder without loss of resources.


¡  Let’s focus on DIF Pipelining for now. §  If we have time, I’ll go over Burst I/O

¡  Imagine a 4x4bit (real/imaginary) 8-‐point FFT, with incoming data being streamed from an ADC

¡  Say we are interested in frequencies up to 100Mhz – so our samples will be streaming at a rate of 200Mhz

¡  We want Frequency Bin outputs to be streaming at the same rate out as data streams in


Some Visual Aids: Pipeline-‐I/O Layout

“Grouping” is not relevant for now Note that for true constant-‐throughput each “Radix-‐2 Butterfly” operation is in fact 2 multiply-‐accumulate operations, using the same ROM memory and twiddle factors for every FFT stage except the final one For RADIX-‐2 loading in N-‐points of data takes N cycles in the first stage. N/2 cycles in each intermediate stage. And N-‐Cycles for the final transform For 8-‐point FT this means a latency of ~10-‐cycles

( figure 5)


¡  F1-‐5: §  Load x0-‐x4 into M1

¡  F6: §  load x5 into M1 §  Stage 1 process [x0,x4]

▪  Output as P0, P4 into M2

¡  F7: §  Load x6 into M1 §  Stage 1 process [x1,x5]

▪  Output as P1, P5 into M2

¡  F1-‐5: §  M1: [x0,x1,x2,x3,x4,-‐-‐,-‐-‐,-‐-‐]

¡  F6: §  M1: [x0,x1,x2,x3,x4,x5,-‐-‐,-‐-‐] §  M2: [P0,-‐-‐,-‐-‐,-‐-‐, P4,-‐-‐,-‐-‐,-‐-‐]

¡  F7: §  M1: [x0,x1,x2,x3,x4,x5,x6,-‐-‐] §  M2: [P0,P1,-‐-‐,-‐-‐,P4,P5,-‐-‐,-‐-‐]


¡  F8: §  M1:

[x0,x1,x2,x3,x4,x5,x6,x7] §  M2:

[P0,P1,P2,-‐-‐,P4,P5, P3,-‐-‐]


▪  Output as P2, P6 into M2 ▪  Can now start Stage 2 Processing


¡  F9: §  M1:

[x0,x1,x2,x3,x4,x5,x6,x7] §  M2:

[P0,P1,P2,P3,P4,P5,P6,P7] §  M3:

[Q0,-‐-‐,Q2,-‐-‐,-‐-‐,-‐-‐,-‐-‐,-‐-‐]

¡  F9: §  Load x0 (new set!) into M1 §  Stage 1 process [x3,x7]

▪  Output as P3, P7 into M2 ▪  (last Stage 1 process for 1st set)

§  Stage 2 process [P0,P2] ▪  Output as Q0, Q2 into M3


¡  F10: §  M1:

[x0,x1,x2,x3,x4,x5,x6,x7] §  M2:

[P0,P1,P2,P3,P4,P5,P6,P7] §  M3:

[Q0,Q1,Q2,Q3,-‐-‐,-‐-‐,-‐-‐,-‐-‐]

¡  F10: §  Load x1 §  Stage 2 process [P1,P3]

▪  Output as Q1, Q3 into M3 ▪  Now have [Q0, Q1] & [Q2, Q3] and

can begin Stage 3 Processing


¡  F11: §  M1:

[x0,x1,x2,x3,x4,x5,x6,x7] §  M2:

[P0,P1,P2,P3,P4,P5,P6,P7] §  M3:

[Q0,Q1,Q2,Q3,Q4,-‐-‐,Q6,-‐-‐] §  Output:

[F0,-‐-‐,-‐-‐,-‐-‐,-‐-‐,-‐-‐,-‐-‐,-‐-‐]

¡  F11: §  load x2 into M1 §  Stage 2 process [P4,P6]

▪  Output as Q4, Q6 into M3

§  Stage 3 Process [Q0, Q1]+ ▪  Output Fbin [F0]


¡  F12: §  M1:

[x0,x1,x2,x3,x4,x5,x6,x7] §  M2:

[P0,P1,P2,P3,P4,P5,P6,P7] §  M3:

[Q0,Q1,Q2,Q3,Q4,Q5,Q6,Q7] §  Output:

[F0,-‐-‐,-‐-‐,-‐-‐,F4,-‐-‐,-‐-‐,-‐-‐]

¡  F12: §  Load x3 in M2 §  Stage 2 process [P5,P7]

▪  Output as Q5, Q7 into M3 ▪  Finished Set 1 Stage 2 processing

§  Stage 3 Process [Q0, Q1]-‐ ▪  NOTE: still only 1 Multiple-‐

Accumulate operation performed ▪  Output Fbin [F4]


¡  F13: §  M1:

[x0,x1,x2,x3,x4,x5,x6,x7] §  M2:

[P0,P1,P2,P3,P4,P5,P6,P7] §  M3:

[Q0,Q1,Q2,Q3,Q4,Q5,Q6,Q7] §  Output:

[F0,-‐-‐,F2,-‐-‐,F4,-‐-‐,-‐-‐,-‐-‐]

¡  F13: §  Load x4 into M1 §  Stage 3 Process [Q2, Q3]+

▪  Output Fbin [F2]


¡  F14: §  M1:

[x0,x1,x2,x3,x4,x5,x6,x7] §  M2:

[P0,P1,P2,P3,P4,P5,P6,P7] §  M3:

[Q0,Q1,Q2,Q3,Q4,Q5,Q6,Q7] §  Output:

[F0,-‐-‐,F2,-‐-‐,F4,-‐-‐,F6,-‐-‐]

¡  F14: §  Load x5 §  Stage 1 process [x0,x4]

▪  Output as P0, P4 into M2 §  Stage 3 Process [Q2, Q3]-‐

▪  Output Fbins [F6]


¡  F15: §  M1:

[x0,x1,x2,x3,x4,x5,x6,x7] §  M2:

[P0,P1,P2,P3,P4,P5,P6,P7] §  M3:

[Q0,Q1,Q2,Q3,Q4,Q5,Q6,Q7] §  Output:

[F0,F1,F2,-‐-‐,F4,-‐-‐,F6,-‐-‐]

¡  F15: §  load x6 into M1 §  Stage 1 process [x1,x5]

▪  Output as P1, P5 into M2 §  Stage 3 Process [Q4, Q5]+

▪  Output Fbins [F1]


¡  F16: §  M1:

[x0,x1,x2,x3,x4,x5,x6,x7] §  M2:

[P0,P1,P2,P3,P4,P5,P6,P7] §  M3:

[Q0,Q1,Q2,Q3,Q4,Q5,Q6,Q7] §  Output:

[F0,F1,F2,-‐-‐,F4,F5,F6,-‐-‐]


▪  Output as P2, P6 into M2 ▪  Can now start S2 Processing on Set

2

§  Stage 3 Process [Q4, Q5]-‐ ▪  Output Fbin [F5]


¡  F17: §  M1:

[x0,x1,x2,x3,x4,x5,x6,x7] §  M2:

[P0,P1,P2,P3,P4,P5,P6,P7] §  M3:

[Q0,Q1,Q2,Q3,Q4,Q5,Q6,Q7] §  Output:

[F0,F1,F2,F3,F4,F5,F6,-‐-‐]

¡  F17: §  Load x0 (Set3) into M1 §  Stage 1 process [x3,x7]

▪  Output as P3, P7 into M2 ▪  (last Stage 1 process for 2nd set)

§  Stage 2 process [P0,P2] ▪  Output as Q0, Q2 into M3

§  Stage 3 Process [Q6, Q7]+ ▪  Output Fbin [F3]


¡  F18: §  M1:

[x0,x1,x2,x3,x4,x5,x6,x7] §  M2:

[P0,P1,P2,P3,P4,P5,P6,P7] §  M3:

[Q0,Q1,Q2,Q3,Q4,Q5,Q6,Q7] §  Output:

[F0,F1,F2,F3,F4,F5,F6,F7]

¡  F18: §  Load x1 §  Stage 2 process [P1,P3]

▪  Output as Q1, Q3 into M3 ▪  Now have [Q0, Q2] & [Q1, Q3] and

can begin S3 Processing on Set 2

§  Stage 3 Process [Q6, Q7]-‐ ▪  Output Fbins [F7]


¡  F19: §  M1:

[x0,x1,x2,x3,x4,x5,x6,x7] §  M2:

[P0,P1,P2,P3,P4,P5,P6,P7] §  M3:

[Q0,Q1,Q2,Q3,Q4,Q5,Q6,Q7] §  Output:

[F0,F1,F2,F3,F4,F5,F6,F7]

¡  F19: §  load x2 into M1 §  Stage 2 process [P4,P6]

▪  Output as Q4, Q6 into M3

§  Stage 3 Process [Q0, Q1]+ ▪  Output Fbin [F0] (First Fbin for Set

2)


¡  Imagine running this 8-‐bit, 8-‐point, FFT on an 8 bit machine. Will it work as presented?


¡  Imagine running this 8-‐bit, 8-‐point, FFT on an 8 bit machine. Will it work as presented? §  Nope! §  There is an implicit overflow problem here

¡  Each addition/subtraction threatens to at worst add 1 bit to the values we are operating on. §  Safest rescale strategy is to rescale (truncate) by 1 bit each

stage. For 8 point FFT, this means truncating by 3 bits. ▪  Bit growth can be a real problem for long calculation chains. Imagine a

65,536 point FFT (16 bit rescaling).


¡  This normalization is independent of the standard FFT normalization we think of and is related to the maximum precision §  For an 8bit-‐in-‐8bit-‐out system (like one of ones used on the

CHIME boards) it is strictly about controlling overflow, and keeping resource use to a minimum

§  What if you have an 8bit ADC (or other data-‐capture system), but are capable of doing higher-‐precision math on your FPGA? ▪  Bit Scaling becomes more of a dynamic question. ▪  Can you afford not to scale, and keep every bit? ▪  Can you make assumptions about your data?


¡  Xilinx DSP Spec and FFT LogiCORE Sheet: §  http://www.xilinx.com/support/documentation/data_sheets/

ds180_7Series_Overview.pdf §  http://www.xilinx.com/support/documentation/user_guides/ug369.pdf §  www.xilinx.com/support/documentation/ip_documentation/ds808_xfft.pdf §  http://www.xilinx.com/support/documentation/user_guides/ug073.pdf

¡  http://www.cs.berkeley.edu/~demmel/cs267/lecture24/lecture24.html

¡  http://wwwhome.ewi.utwente.nl/~gerezsh/sendfile/sendfile.php/idsp-‐fft.pdf?sendfile=idsp-‐fft.pdf


¡  GPU vs FGPA: A closer look §  Algorithmic changes (specifically with respect to FFTs)?

§  Performance advantages §  Application differences

¡  FPGA “Burst I/O” FFT Strategy


¡  Hardware comprises almost entirely multiply-‐accumulates and fast memory (DSPs + ROM) §  Unlike FPGA – fewer specialized regions (more on this later) and

far fewer routing options. ¡  Uses an “Instruction Set” that demands specific drivers,

data-‐types, I/O specifications, and includes background homeostatic processes (“Dark Processes”) §  Advantage being, among other things, it can “Run

code” (Typically C)


¡  As Pat mentioned – GPU programming is hyper-‐parallelized. In a way that IS reminiscent of an FPGA. §  Computational chains are done by our very fast dedicated DSPs

§ Multiple quasi-‐independent operations can occur simultaneously (like our butterflies and matrix multiplication)


¡  However, GPUs adhere to very strict constraints which make them inappropriate for a variety of tasks: §  Complete lack of routing flexibility means many pathological tasks: ▪  Anything involving lots of data shuffling, or different discrete operations happening in different physical areas of chip will be SLOW (for instance, because of different types of incoming data, or required pre-‐processing)


¡  However, GPUs adhere to very strict constraints which make them inappropriate for a variety of tasks: §  They are fixed-‐computation/fixed programming acceleration. ▪  Computation ‘Chains’ are specified by the number of allowed ‘threads’

▪  Data-‐sharing done by ‘thread-‐locking’ or ‘thread-‐sharing’ ▪  Generally this process of data sharing adds latency

▪  The types of task and number of simultaneous chains are largely specified by their physical design


265-‐PT FFT on a GPU

Notice how it is broken down into 4-‐PT FFTs per thread, and then each thread exchanges the intermediate data products? That sharing adds latency, and necessitates large buffers. In an FPGA, the Pipelining implementation I showed earlier scales fine with FFT length – all you do is add initial latency. If you need additional DSPs at any time in FFT stages (depending on sampling rate, etc), you can simply branch off – spanning threads like a tree ad-‐infinitum (resources permitting)


¡  The fact that they “Run Code” and use an instruction book means that the efficiency/performance of any operation being performed also suffers from or is mediated by the efficiency with which the commands can be ‘compiled’ onto the core. §  You could make the analogy to writing more or less efficient protocols on an FPGA, but it’s not completely valid – this compile translation adds another middle-‐man that can, in addition to being less efficient that writing it explicitly in VHDL, result in unpredicted or variable latency that now has dependencies on your compiler, etc.


¡  GPUs have only 1 very specific supported data I/O mechanism: §  PCI-‐Express

¡  This means that any information streaming to a GPU must be ‘pre-‐processed’ either with ASICs (Application-‐Specific ICs), or in many cases by the CPU – placing GPU-‐friendly data into shared memory buffers. §  For Graphics processing, this is A-‐OK! §  For data-‐capture this is very bad


¡  Remember our toy scenario? 8 Bit ADC streaming data samples at 200Mhz. §  These byte-‐streams cannot be routed directly to the GPU. To use a GPU for this some IC would have to interpret – data to the correct protocol. ▪  Fun fact – this is now often done with small FPGAs, since they can do this with

minimal latency, and can be built onto a PCI / PCI-‐Express card (Or similar) to process incoming data into protocols recognizable by GPUs, CPUs, or fit to be streamed out over Ethernet, TCP/IP, Infiniband, etc.


¡  Remember our toy scenario? 8 Bit ADC streaming data samples at 200Mhz. §  These byte-‐streams cannot be routed directly to the GPU. To use a GPU for this some IC would have to interpret – data to the correct protocol. ▪  Fun fact – this is now often done with small FPGAs, since they can do this with

minimal latency, and can be built onto a PCI / PCI-‐Express card (Or similar) to process incoming data into protocols recognizable by GPUs, CPUs, or fit to be streamed out over Ethernet, TCP/IP, Infiniband, etc.

§  FPGAs don’t care what protocol is being used. Data is Data. Bits are Bits. Relies on YOU to write the procedure to slap on pre/suffix identifiers. ▪  Has dedicated, optimized components ON the chip to strip Ethernet, TCP/IP, etc

protocols and feed them to computation chains.


¡  GPUs have a lower-‐level architecture that runs maintenance protocols, optimizes and monitors on-‐chip resource use, performs checks for corruption/failures/errors in both operation and data output, and in general “Manages and Regulates data flow.” §  This provides extra redundancy §  On-‐the-‐fly error correction/recognition §  Power optimization (in some cases) §  and a degree of standardization and user-‐friendliness.


¡  GPUs have a lower-‐level architecture that runs maintenance protocols, optimizes and monitors on-‐chip resource use, performs checks for corruption/failures/errors in both operation and data output, and in general “Manages and Regulates data flow.” §  Intrinsic to these processes is the use of lots of memory buffers and handoffs that would not be used in FPGAs

§  It also dramatically increases latency §  Makes latencies dynamic, unpredictable, and dependent on other processes or the state of the chip as a function of time


¡  Additional algorithmic differences are that GPUs natively work with Fixed-‐Point numbers – so any data capture will have to undergo conversion in this respect as well. §  In addition to adding complication, latency, and perhaps

obscuring the question of precision a bit, this complicates notions of dynamic scaling and computational bit-‐growth


¡  Programmers do not have access to GPU LUTs §  This means that things like Twiddle Factors will have to be calculated and stored in ROM memory at least once (maybe more depending on the specific application)

¡  Implementing bit-‐growth control measures aren’t quite as transparent. §  Bit-‐Accurate C can be written, but because data is subject to

other ‘shadow-‐processes’ it may not always be obvious where bit truncation occurs (if necessary) or how round-‐off errors propagate.

§  Alternatively, the GPU may perform such truncation to avoid overflows automatically.


¡  Clean-‐cut “Speed Comparisons” are difficult because it’s tough to compare apples-‐to-‐apples: §  Clock Speed ▪  Very misleading. ▪  FPGAs ~200-‐300Mhz (Base), 2+Ghz(DSP,I/O) ▪  GPUs measured typically in Ghz, with pre-‐defined number of threads (Unlike FPGAs: wit no well defined limit on concurrent operations)

▪  Less clear how background processes dynamically shuffle their load with respect to stealing cycles or freezing resources

▪  Higher clock speed is directly and clearly proportional to max throughput on an FPGA. This is not so on GPUs by a long shot


¡  Flops/Second: §  Slightly more fair of a comparison – but still run into the trouble of transparency on a GPU

§  This is where optimization can matter hugely – GPUs can easily out-‐shine FPGAs on specific tasks tailored to their architecture

§  However, GPUs can perform extremely poorly on tasks where data must be moved very inefficiently around the chip, and between threads

§  FPGAs are robust to ‘pathological’ cases because of the flexibility in routing and protocol management


¡  Flops/Joule §  Easy winner is the FGPA ▪  Exact numbers depend on the utility, but can be as much as ~50%

§  Fewer things going on to perform the same task. §  Large availability of very efficient logic slices where DSPs would be overkill (like routing), where GPUs rely almost entirely on power-‐hungry DSP blocks.

§  Heavily optimized routing means fewer slices being utilized, less power moving around the chip.


¡  Flops/$, Bang-‐for-‐your-‐buck ▪  (Let’s disregard people-‐hour costs for writing the code)

§  In general if your task *can* be performed adequately and reliably on a GPU it will be far cheaper to do so

§  However, if your task depends on pretty much anything real-‐time like data-‐capture or protocol interpretation, GPUs are a no-‐go

§  Hybrid Systems are very popular for this: ▪  LOFAR (does) & CHIME (will) use FPGAs to capture data and control ADCs (& the like), perform on-‐the-‐fly FFTs, interpret to PCI-‐Express or Gigabit Ethernet and stream to a GPU which acts as the Correlator


¡  GPUs: §  Cheap, Easy to use, fast* (heavily dependent on your operation) §  High (in this context) latency, variable/unpredictable latency ▪  Unsuitable for constant-‐throughout at rates approaching clock speed

§  Extremely limited I/O, can’t talk to most I/Os §  Relies heavily on buffers, cached memory: can run into a

bottleneck problem where data backs up §  Power hungry §  Performance fluctuations in non-‐deterministic ways depending

on internal state of the chip, background processes, and automatic resource allocations / compilers / instruction sets ▪  Not guaranteed to perform exactly the same on different machines even in

controlled environments FFT Implementations on FPGAs 61

¡  FPGAs: §  Can be expensive, difficult to use, but consistent §  Latency depends entirely on your utilization of resources, and

theoretical limits of the algorithm §  Achievable constant throughput scales exactly with clock

speed(s) §  Extremely flexible I/O §  Can be very highly optimized §  Power efficient, thermally efficient §  No pre-‐defined threads or thread counts/limits §  No shadow processes – what you write is what you get §  Precision-‐Transparent data manipulation and flow


BACKUP SLIDE RADIX-‐2 Burst I/O

Switches Shuffle Correct Order for each successive butterfly, keeping cyclically permuting intermediate terms within the raw incoming data samples DSPs can clock typically 8x the max FPGA clock speed – so for resource-‐starved projects this can be a good option, since at a minimum 8x fewer resources can be used. Catch is that it can only be written using DSP slice logic (which are typically in high demand) and the LUT/Switching Logic is considerably more difficult ( figure 6)



To get us from Frame 1 to Frame 19 (as we did in the Pipeline Implementation earlier), we need 30 DSP operations (real), 60DSP operations (Real+Imaginary, max) So, in 14 [19-‐5 initial loading cycles] clock cycles we need to do 60 operations. The most per frame we have to do is 5 (real), or 10 (real+Imaginary) (corresponding to Stage 1 + Stage 2 + (½) Stage 3. Plus clever addressing and shuffling. If our DSP can clock 8x over the input rate, we can perform 112 operations in that time.

( figure 6)



This is misleading though, because we need to wait 8 clock cycles before each new piece of data comes in. So for almost every (pipeline) frame our BURST implementation will have idle frames (which can be spend shuffling data to make addressing easier, but don’t need to be). However, for every (pipeline) frame with all three stages doing computations, we will be 2 cycles short (1 real + 1 imaginary). ( figure 6)



For our 8-‐pt FFT we can get away with this, because only Stage 3 is always working (our half-‐stage), so most of the time only 6 operations are done. This gives us time to catch up. We can also add another DSP to handle just Stage 3 (or stage 1, or stage 2). We want to divide the labor by stages, not alternating frames because of the interleaving of intermediate data products.

( figure 6)


RADIX-‐2 Burst I/O

Example Operations Color Coded by Pipeline Frames:

S1|S1|S1|S1S2|S2|S2S3|S2S3|S3|S1S3|S1S3|S1S3|S1S2S3|S2S3



Switches Shuffle Correct Order for each successive dragonfly, keeping cyclically permuting intermediate terms within the raw incoming data samples DSPs can clock typically 8x the max FPGA clock speed – so for resource-‐starved projects this can be a good option, since at a minimum 16x fewer resources can be used compared to RADIX-‐2 implementations Catch is that it can only be written using DSP slice logic (which are typically in high demand) and the LUT/Switching Logic is considerably more difficult

( figure 7)


BACKUP SLIDE DSP Slice Diagram

Two Tiled DSP48 Slices


BACKUP SLIDE DSP Slice Diagram

Simplified DSP48 Slice Model


fpga%hardware%implementation%strategies%forpatscott/teaching/numeric/montgomery_606_f... ·...

Documents