fpga%hardware%implementation%strategies%forpatscott/teaching/numeric/montgomery_606_f... ·...
TRANSCRIPT
¡ Introduction to FPGAs § Techniques in Lower-‐Level Coding, Resources § Methodology, Operation-‐Flow, Physical Complications § Why You Should Care
¡ FFT Algorithm (Cooley-‐Tukey) (Re-‐Cap) ¡ Porting to FPGA By Example:
§ Challenges in Constant-‐Throughput
¡ (Time Permitting) Alternative Implementation Strategies § Pipeline versus Burst § Higher-‐Order RADIX
FFT Implementations on FPGAs 2
¡ Field-‐Programmable Gate Array: § Comparisons: ▪ CPU: One Operation at a time (hence power inefficiency and very high clock
speed necessity) ▪ GPU: One ‘type’ of operation, performed in parallel ▪ IC/ASIC: One specific (set of) operations. Every operation can be performed
simultaneously* (some exclusions apply) ▪ FPGA: “Programmable” IC, User-‐specified set of operations performed with
(nearing) the speed and efficiency of an IC
§ Used For? ▪ Anything an IC could be used for, and more. ▪ FAST à Digital Signal Processing. ▪ Flexible à Can be re-‐programmed on the PCB
FFT Implementations on FPGAs 3
¡ Logic Cell: § 1 Logic Cell == 1 LUT, 1 Flip/Flop, Connections to adjacent cells
¡ Logic Slice: 2 Logic Cells ¡ Configurable Logic Block (CLB): 4 Logic Slices
§ Configurable because they can use their LUTS as Distributed RAM storage. [Distributed RAM is *slow* storage]
¡ DSP (Extreme) Slice: § Digital Signal Processing Blocks integrated into the fabric of the
FPGA. § Typically Pre-‐adder, 1 Multiply-‐and-‐accumulate with memory registers § Extremely Fast. Independent Clock (Typically 8X clocking)
¡ Block RAM
FFT Implementations on FPGAs 4
¡ Xilinx Virtex-‐7 Board (top-‐of-‐the-‐line): § 1,139,200 Dedicated Logic Slices § CLBs (Can be Mixed and Matched): ▪ Additional 178,000 logic slices ▪ OR 17,700Kb Distributed RAM
§ 3,360 DSP Slices § 67,680 Kb Block RAM
¡ Between 4-‐24 Clock Regions (up to 200Mhz)
FFT Implementations on FPGAs 5
¡ VHDL (Very Hard Description Language) § Several “Stages” to get to VHDL Synthesis (writing to chip): ▪ “Behavioral Level” à Uses ‘Cores’ of higher-‐level operation, describes their relationships, and the information flow/operations.
▪ “Explicit Level” à Some of these cores can be treated like chips, with pins in and out, but ultimately the only things you can ‘do’ is to flip bits, and build chains of basic logic operations.
FFT Implementations on FPGAs 6
¡ VHDL (Very Hard Description Language) § Several “Stages” to get to VHDL Synthesis (writing to chip): ▪ “Simulation” à Self Explanatory. Does your ‘coding’ do anything?
▪ “Component level” à Match your operations and cores to specific chip resources. Constrain their relationships by defining clocking regions.
FFT Implementations on FPGAs 7
¡ VHDL (Very Hard Description Language) § Several “Stages” to get to VHDL Synthesis (writing to chip): ▪ “Placement And Routing” à Multi-‐Dimensional Minimization Problem ▪ Aside from SLLs, Routing takes up the same resources you would use for Logic ▪ Precision timing is crucial. Nothing is over-‐write safe. Nothing waits without more logic telling it to do so. ▪ Physical Distances come into play. (Simulate clock regions with distances, etc)
FFT Implementations on FPGAs 8
¡ Not Line-‐By-‐Line executions. § Even Object-‐Oriented languages are technically line-‐by-‐line executions.
§ Though the ‘explicit’ syntax may be reminiscent of C – execution is more like LabView. A logic Chain, or Block will do whatever it has been written to do as soon as it gets stimulus to do so (a bit gets flipped). ▪ Means you can theoretically have branching numbers of parallel tasks.
§ Fastest thing on the board is the Clock – so we write in “Frames”
FFT Implementations on FPGAs 10
¡ “Won’t some (Genius) Engineer just do this for me?” ¡ Maybe, but having some understanding of these things things helps you: § Makes the “Precision” and limitations in your code/operations transparent ▪ Forces you to think about Bit-‐Accurate math
§ We’ve learned a lot about the implicit assumptions about numbers we make when using computers – we do that a lot on the algorithmic level too, and this is how:
FFT Implementations on FPGAs 11
¡ Though it doesn’t seem like it – a lot of our higher-‐level programming is still rife with ‘abstract’ math. This is most true of real-‐time computations, signal processing, data-‐capture and handling, and control.
¡ Best way to explain this is through example: ▪ Lets do a step-‐by-‐step “Toy” FFT problem. ▪ Assume Fixed-‐Point 8-‐bit math.
FFT Implementations on FPGAs 12
¡ 8-‐Point Cooley-‐Tukey FFT Algorithm – efficiently compute the DFT (N=8)
¡ Let’s define a Twiddle Factor – Number on the complex unit circle:
WN = e− i2ΠN ⇒WN
kn = e− i2Πnk
N
Fk = f (n)*en=0
n=N−1
∑−2 iΠnk
N
(1)
(2)
FFT Implementations on FPGAs 13
¡ Two ways to do this – separate the sum by even and odd values (Decimation in Time), or by lowest and highest half (Decimation in Frequency)
¡ Decimation refers to which set of values are in Bit Reversed Order
FFT Implementations on FPGAs 14
¡ Reduce sum to even and odd components, pull “Twiddle Factor” out of the odd sums, and repeat until DFTs reduced to 1 operation in the sum. § For X-‐point FFT, log2(X) Twiddle factors. § Smallest operations are E+(W*C)
FFT Implementations on FPGAs 15
Fk = f (n)*Wnk
Nn=0
n=N−1
∑ → f (2n)*n=0
n=N2 −1
∑ WN2nk + f (2n+1)*WN
2nk *Wk2
n=0
n=N2 −1
∑
Fk = f (2n)*WN2
nk
n=0
n=N2 −1
∑ +Wk2 * f (2n+1)n=0
n=N2 −1
∑ *WN2
nk$
%&&
'
())
Fk = f (2n)*WN4
nk
n=0
n=N4 −1
∑ +Wk4 f (2n+1)n=0
n=N4 −1
∑ *WN4
nk$
%&&
'
())
+Wk2 * f (2n+ 4)*WN
4
nk
n=0
n=N4 −1
∑ +Wk4 * f (2n+ 5)n=0
n=N4 −1
∑ *WN4
nk$
%&&
'
())
(3)
(4)
(5)
FFT Implementations on FPGAs 16
¡ Until, for an 8-‐point FFT, you can write the equation the following way:
¡ Notice here:
Fk = f0 + f4Wk2( )+W k
4 f2 + f6Wk2( )!
"#$+W
k8 f1 + f5W
k2( )+W k
4 f3 + f7Wk2( )!
"#$
Wk2 = e−iΠk = { (−1)→k∍{odds}
(1)→k∍{evens}
(6)
(7)
FFT Implementations on FPGAs 17
¡ Fundamentally the same equation – just different order of operations. § In equations the smallest parts become (E +-‐ W)C
§ Iterating in this way will eventually give you the equation in the last slide. This is because:
e−iΠk ==Wk2 == (−1)k
Fk = f (n)*WnkN
n=0
n=N−1
∑ → f (n)*n=0
n=N2 −1
∑ WNnk + f (n+ N
2 )*WNnk *WN
N2 k
n=0
n=N2 −1
∑
Fk = f (n)n=0
n=N2 −1
∑ + (−1)k f (n+ N2 )
$
%&&
'
())WN
nk
(8)
(9)
(10)FFT Implementations on FPGAs 18
¡ Notice that every twiddle factor within each stage, or layer, differs from every other twiddle factor only by k. § For an 8-‐point FFT, there are three stages: ▪ 1st Stage, there are 4 unique twiddle factors, but 2 of these differ from the other 2 only by a sign. These are +-‐1 ,
▪ 2nd Stage has only +-‐1, +-‐i ▪ 3rd Stage has only +-‐1
§ Every frequency bin calculation can then be made by combinations of the twiddle factors and their signs.
±22(1− i)
FFT Implementations on FPGAs 19
BUTTERFLY: INTRO
Single Butterfly operation Image is for Decimation In Time. Second term is multiplied by twiddle factor, then the two terms are added and subtracted. The E+WC term continues on the top row The W-‐WC term continues on the bottom row For DIF, the multiply follows the add/subtract
( figure 2)FFT Implementations on FPGAs 20
BUTTERFLY (DIT)
Notation here is slightly different. Key:
W 0N =W
1
W 1N =W
18
W 3N =W
38
W 2N =W
28
( figure 3)FFT Implementations on FPGAs 21
BUTTERFLY (DIF)
Notation here is slightly different. Key:
W 0N =W
1
W 1N =W
18
W 3N =W
38
W 2N =W
28
( figure 4)FFT Implementations on FPGAs 22
¡ Either DIF or DIT can be used, but DIF is generally preferred for the following reason: § Natural Ordering Inputs avoids Shuffle Resources
¡ DIT can be used just as well with relatively fewer resources depending on PT length § “BURST I/Os” use delays-‐shuffling to reorder without loss of resources.
FFT Implementations on FPGAs 23
¡ Let’s focus on DIF Pipelining for now. § If we have time, I’ll go over Burst I/O
¡ Imagine a 4x4bit (real/imaginary) 8-‐point FFT, with incoming data being streamed from an ADC
¡ Say we are interested in frequencies up to 100Mhz – so our samples will be streaming at a rate of 200Mhz
¡ We want Frequency Bin outputs to be streaming at the same rate out as data streams in
FFT Implementations on FPGAs 24
Some Visual Aids: Pipeline-‐I/O Layout
“Grouping” is not relevant for now Note that for true constant-‐throughput each “Radix-‐2 Butterfly” operation is in fact 2 multiply-‐accumulate operations, using the same ROM memory and twiddle factors for every FFT stage except the final one For RADIX-‐2 loading in N-‐points of data takes N cycles in the first stage. N/2 cycles in each intermediate stage. And N-‐Cycles for the final transform For 8-‐point FT this means a latency of ~10-‐cycles
( figure 5)
FFT Implementations on FPGAs 25
¡ F1-‐5: § Load x0-‐x4 into M1
¡ F6: § load x5 into M1 § Stage 1 process [x0,x4]
▪ Output as P0, P4 into M2
¡ F7: § Load x6 into M1 § Stage 1 process [x1,x5]
▪ Output as P1, P5 into M2
¡ F1-‐5: § M1: [x0,x1,x2,x3,x4,-‐-‐,-‐-‐,-‐-‐]
¡ F6: § M1: [x0,x1,x2,x3,x4,x5,-‐-‐,-‐-‐] § M2: [P0,-‐-‐,-‐-‐,-‐-‐, P4,-‐-‐,-‐-‐,-‐-‐]
¡ F7: § M1: [x0,x1,x2,x3,x4,x5,x6,-‐-‐] § M2: [P0,P1,-‐-‐,-‐-‐,P4,P5,-‐-‐,-‐-‐]
FFT Implementations on FPGAs 26
¡ F8: § M1:
[x0,x1,x2,x3,x4,x5,x6,x7] § M2:
[P0,P1,P2,-‐-‐,P4,P5, P3,-‐-‐]
¡ F8: § Load x7 into M1 § Stage 1 process [x2,x6]
▪ Output as P2, P6 into M2 ▪ Can now start Stage 2 Processing
FFT Implementations on FPGAs 27
¡ F9: § M1:
[x0,x1,x2,x3,x4,x5,x6,x7] § M2:
[P0,P1,P2,P3,P4,P5,P6,P7] § M3:
[Q0,-‐-‐,Q2,-‐-‐,-‐-‐,-‐-‐,-‐-‐,-‐-‐]
¡ F9: § Load x0 (new set!) into M1 § Stage 1 process [x3,x7]
▪ Output as P3, P7 into M2 ▪ (last Stage 1 process for 1st set)
§ Stage 2 process [P0,P2] ▪ Output as Q0, Q2 into M3
FFT Implementations on FPGAs 28
¡ F10: § M1:
[x0,x1,x2,x3,x4,x5,x6,x7] § M2:
[P0,P1,P2,P3,P4,P5,P6,P7] § M3:
[Q0,Q1,Q2,Q3,-‐-‐,-‐-‐,-‐-‐,-‐-‐]
¡ F10: § Load x1 § Stage 2 process [P1,P3]
▪ Output as Q1, Q3 into M3 ▪ Now have [Q0, Q1] & [Q2, Q3] and
can begin Stage 3 Processing
FFT Implementations on FPGAs 29
¡ F11: § M1:
[x0,x1,x2,x3,x4,x5,x6,x7] § M2:
[P0,P1,P2,P3,P4,P5,P6,P7] § M3:
[Q0,Q1,Q2,Q3,Q4,-‐-‐,Q6,-‐-‐] § Output:
[F0,-‐-‐,-‐-‐,-‐-‐,-‐-‐,-‐-‐,-‐-‐,-‐-‐]
¡ F11: § load x2 into M1 § Stage 2 process [P4,P6]
▪ Output as Q4, Q6 into M3
§ Stage 3 Process [Q0, Q1]+ ▪ Output Fbin [F0]
FFT Implementations on FPGAs 30
¡ F12: § M1:
[x0,x1,x2,x3,x4,x5,x6,x7] § M2:
[P0,P1,P2,P3,P4,P5,P6,P7] § M3:
[Q0,Q1,Q2,Q3,Q4,Q5,Q6,Q7] § Output:
[F0,-‐-‐,-‐-‐,-‐-‐,F4,-‐-‐,-‐-‐,-‐-‐]
¡ F12: § Load x3 in M2 § Stage 2 process [P5,P7]
▪ Output as Q5, Q7 into M3 ▪ Finished Set 1 Stage 2 processing
§ Stage 3 Process [Q0, Q1]-‐ ▪ NOTE: still only 1 Multiple-‐
Accumulate operation performed ▪ Output Fbin [F4]
FFT Implementations on FPGAs 31
¡ F13: § M1:
[x0,x1,x2,x3,x4,x5,x6,x7] § M2:
[P0,P1,P2,P3,P4,P5,P6,P7] § M3:
[Q0,Q1,Q2,Q3,Q4,Q5,Q6,Q7] § Output:
[F0,-‐-‐,F2,-‐-‐,F4,-‐-‐,-‐-‐,-‐-‐]
¡ F13: § Load x4 into M1 § Stage 3 Process [Q2, Q3]+
▪ Output Fbin [F2]
FFT Implementations on FPGAs 32
¡ F14: § M1:
[x0,x1,x2,x3,x4,x5,x6,x7] § M2:
[P0,P1,P2,P3,P4,P5,P6,P7] § M3:
[Q0,Q1,Q2,Q3,Q4,Q5,Q6,Q7] § Output:
[F0,-‐-‐,F2,-‐-‐,F4,-‐-‐,F6,-‐-‐]
¡ F14: § Load x5 § Stage 1 process [x0,x4]
▪ Output as P0, P4 into M2 § Stage 3 Process [Q2, Q3]-‐
▪ Output Fbins [F6]
FFT Implementations on FPGAs 33
¡ F15: § M1:
[x0,x1,x2,x3,x4,x5,x6,x7] § M2:
[P0,P1,P2,P3,P4,P5,P6,P7] § M3:
[Q0,Q1,Q2,Q3,Q4,Q5,Q6,Q7] § Output:
[F0,F1,F2,-‐-‐,F4,-‐-‐,F6,-‐-‐]
¡ F15: § load x6 into M1 § Stage 1 process [x1,x5]
▪ Output as P1, P5 into M2 § Stage 3 Process [Q4, Q5]+
▪ Output Fbins [F1]
FFT Implementations on FPGAs 34
¡ F16: § M1:
[x0,x1,x2,x3,x4,x5,x6,x7] § M2:
[P0,P1,P2,P3,P4,P5,P6,P7] § M3:
[Q0,Q1,Q2,Q3,Q4,Q5,Q6,Q7] § Output:
[F0,F1,F2,-‐-‐,F4,F5,F6,-‐-‐]
¡ F16: § Load x7 into M1 § Stage 1 process [x2,x6]
▪ Output as P2, P6 into M2 ▪ Can now start S2 Processing on Set
2
§ Stage 3 Process [Q4, Q5]-‐ ▪ Output Fbin [F5]
FFT Implementations on FPGAs 35
¡ F17: § M1:
[x0,x1,x2,x3,x4,x5,x6,x7] § M2:
[P0,P1,P2,P3,P4,P5,P6,P7] § M3:
[Q0,Q1,Q2,Q3,Q4,Q5,Q6,Q7] § Output:
[F0,F1,F2,F3,F4,F5,F6,-‐-‐]
¡ F17: § Load x0 (Set3) into M1 § Stage 1 process [x3,x7]
▪ Output as P3, P7 into M2 ▪ (last Stage 1 process for 2nd set)
§ Stage 2 process [P0,P2] ▪ Output as Q0, Q2 into M3
§ Stage 3 Process [Q6, Q7]+ ▪ Output Fbin [F3]
FFT Implementations on FPGAs 36
¡ F18: § M1:
[x0,x1,x2,x3,x4,x5,x6,x7] § M2:
[P0,P1,P2,P3,P4,P5,P6,P7] § M3:
[Q0,Q1,Q2,Q3,Q4,Q5,Q6,Q7] § Output:
[F0,F1,F2,F3,F4,F5,F6,F7]
¡ F18: § Load x1 § Stage 2 process [P1,P3]
▪ Output as Q1, Q3 into M3 ▪ Now have [Q0, Q2] & [Q1, Q3] and
can begin S3 Processing on Set 2
§ Stage 3 Process [Q6, Q7]-‐ ▪ Output Fbins [F7]
FFT Implementations on FPGAs 37
¡ F19: § M1:
[x0,x1,x2,x3,x4,x5,x6,x7] § M2:
[P0,P1,P2,P3,P4,P5,P6,P7] § M3:
[Q0,Q1,Q2,Q3,Q4,Q5,Q6,Q7] § Output:
[F0,F1,F2,F3,F4,F5,F6,F7]
¡ F19: § load x2 into M1 § Stage 2 process [P4,P6]
▪ Output as Q4, Q6 into M3
§ Stage 3 Process [Q0, Q1]+ ▪ Output Fbin [F0] (First Fbin for Set
2)
FFT Implementations on FPGAs 38
¡ Imagine running this 8-‐bit, 8-‐point, FFT on an 8 bit machine. Will it work as presented?
FFT Implementations on FPGAs 39
¡ Imagine running this 8-‐bit, 8-‐point, FFT on an 8 bit machine. Will it work as presented? § Nope! § There is an implicit overflow problem here
¡ Each addition/subtraction threatens to at worst add 1 bit to the values we are operating on. § Safest rescale strategy is to rescale (truncate) by 1 bit each
stage. For 8 point FFT, this means truncating by 3 bits. ▪ Bit growth can be a real problem for long calculation chains. Imagine a
65,536 point FFT (16 bit rescaling).
FFT Implementations on FPGAs 40
¡ This normalization is independent of the standard FFT normalization we think of and is related to the maximum precision § For an 8bit-‐in-‐8bit-‐out system (like one of ones used on the
CHIME boards) it is strictly about controlling overflow, and keeping resource use to a minimum
§ What if you have an 8bit ADC (or other data-‐capture system), but are capable of doing higher-‐precision math on your FPGA? ▪ Bit Scaling becomes more of a dynamic question. ▪ Can you afford not to scale, and keep every bit? ▪ Can you make assumptions about your data?
FFT Implementations on FPGAs 41
¡ Xilinx DSP Spec and FFT LogiCORE Sheet: § http://www.xilinx.com/support/documentation/data_sheets/
ds180_7Series_Overview.pdf § http://www.xilinx.com/support/documentation/user_guides/ug369.pdf § www.xilinx.com/support/documentation/ip_documentation/ds808_xfft.pdf § http://www.xilinx.com/support/documentation/user_guides/ug073.pdf
¡ http://www.cs.berkeley.edu/~demmel/cs267/lecture24/lecture24.html
¡ http://wwwhome.ewi.utwente.nl/~gerezsh/sendfile/sendfile.php/idsp-‐fft.pdf?sendfile=idsp-‐fft.pdf
FFT Implementations on FPGAs 42
¡ GPU vs FGPA: A closer look § Algorithmic changes (specifically with respect to FFTs)?
§ Performance advantages § Application differences
¡ FPGA “Burst I/O” FFT Strategy
FFT Implementations on FPGAs 43
¡ Hardware comprises almost entirely multiply-‐accumulates and fast memory (DSPs + ROM) § Unlike FPGA – fewer specialized regions (more on this later) and
far fewer routing options. ¡ Uses an “Instruction Set” that demands specific drivers,
data-‐types, I/O specifications, and includes background homeostatic processes (“Dark Processes”) § Advantage being, among other things, it can “Run
code” (Typically C)
FFT Implementations on FPGAs 44
¡ As Pat mentioned – GPU programming is hyper-‐parallelized. In a way that IS reminiscent of an FPGA. § Computational chains are done by our very fast dedicated DSPs
§ Multiple quasi-‐independent operations can occur simultaneously (like our butterflies and matrix multiplication)
FFT Implementations on FPGAs 45
¡ However, GPUs adhere to very strict constraints which make them inappropriate for a variety of tasks: § Complete lack of routing flexibility means many pathological tasks: ▪ Anything involving lots of data shuffling, or different discrete operations happening in different physical areas of chip will be SLOW (for instance, because of different types of incoming data, or required pre-‐processing)
FFT Implementations on FPGAs 46
¡ However, GPUs adhere to very strict constraints which make them inappropriate for a variety of tasks: § They are fixed-‐computation/fixed programming acceleration. ▪ Computation ‘Chains’ are specified by the number of allowed ‘threads’
▪ Data-‐sharing done by ‘thread-‐locking’ or ‘thread-‐sharing’ ▪ Generally this process of data sharing adds latency
▪ The types of task and number of simultaneous chains are largely specified by their physical design
FFT Implementations on FPGAs 47
265-‐PT FFT on a GPU
Notice how it is broken down into 4-‐PT FFTs per thread, and then each thread exchanges the intermediate data products? That sharing adds latency, and necessitates large buffers. In an FPGA, the Pipelining implementation I showed earlier scales fine with FFT length – all you do is add initial latency. If you need additional DSPs at any time in FFT stages (depending on sampling rate, etc), you can simply branch off – spanning threads like a tree ad-‐infinitum (resources permitting)
FFT Implementations on FPGAs 48
¡ The fact that they “Run Code” and use an instruction book means that the efficiency/performance of any operation being performed also suffers from or is mediated by the efficiency with which the commands can be ‘compiled’ onto the core. § You could make the analogy to writing more or less efficient protocols on an FPGA, but it’s not completely valid – this compile translation adds another middle-‐man that can, in addition to being less efficient that writing it explicitly in VHDL, result in unpredicted or variable latency that now has dependencies on your compiler, etc.
FFT Implementations on FPGAs 49
¡ GPUs have only 1 very specific supported data I/O mechanism: § PCI-‐Express
¡ This means that any information streaming to a GPU must be ‘pre-‐processed’ either with ASICs (Application-‐Specific ICs), or in many cases by the CPU – placing GPU-‐friendly data into shared memory buffers. § For Graphics processing, this is A-‐OK! § For data-‐capture this is very bad
FFT Implementations on FPGAs 50
¡ Remember our toy scenario? 8 Bit ADC streaming data samples at 200Mhz. § These byte-‐streams cannot be routed directly to the GPU. To use a GPU for this some IC would have to interpret – data to the correct protocol. ▪ Fun fact – this is now often done with small FPGAs, since they can do this with
minimal latency, and can be built onto a PCI / PCI-‐Express card (Or similar) to process incoming data into protocols recognizable by GPUs, CPUs, or fit to be streamed out over Ethernet, TCP/IP, Infiniband, etc.
FFT Implementations on FPGAs 51
¡ Remember our toy scenario? 8 Bit ADC streaming data samples at 200Mhz. § These byte-‐streams cannot be routed directly to the GPU. To use a GPU for this some IC would have to interpret – data to the correct protocol. ▪ Fun fact – this is now often done with small FPGAs, since they can do this with
minimal latency, and can be built onto a PCI / PCI-‐Express card (Or similar) to process incoming data into protocols recognizable by GPUs, CPUs, or fit to be streamed out over Ethernet, TCP/IP, Infiniband, etc.
§ FPGAs don’t care what protocol is being used. Data is Data. Bits are Bits. Relies on YOU to write the procedure to slap on pre/suffix identifiers. ▪ Has dedicated, optimized components ON the chip to strip Ethernet, TCP/IP, etc
protocols and feed them to computation chains.
FFT Implementations on FPGAs 52
¡ GPUs have a lower-‐level architecture that runs maintenance protocols, optimizes and monitors on-‐chip resource use, performs checks for corruption/failures/errors in both operation and data output, and in general “Manages and Regulates data flow.” § This provides extra redundancy § On-‐the-‐fly error correction/recognition § Power optimization (in some cases) § and a degree of standardization and user-‐friendliness.
FFT Implementations on FPGAs 53
¡ GPUs have a lower-‐level architecture that runs maintenance protocols, optimizes and monitors on-‐chip resource use, performs checks for corruption/failures/errors in both operation and data output, and in general “Manages and Regulates data flow.” § Intrinsic to these processes is the use of lots of memory buffers and handoffs that would not be used in FPGAs
§ It also dramatically increases latency § Makes latencies dynamic, unpredictable, and dependent on other processes or the state of the chip as a function of time
FFT Implementations on FPGAs 54
¡ Additional algorithmic differences are that GPUs natively work with Fixed-‐Point numbers – so any data capture will have to undergo conversion in this respect as well. § In addition to adding complication, latency, and perhaps
obscuring the question of precision a bit, this complicates notions of dynamic scaling and computational bit-‐growth
FFT Implementations on FPGAs 55
¡ Programmers do not have access to GPU LUTs § This means that things like Twiddle Factors will have to be calculated and stored in ROM memory at least once (maybe more depending on the specific application)
¡ Implementing bit-‐growth control measures aren’t quite as transparent. § Bit-‐Accurate C can be written, but because data is subject to
other ‘shadow-‐processes’ it may not always be obvious where bit truncation occurs (if necessary) or how round-‐off errors propagate.
§ Alternatively, the GPU may perform such truncation to avoid overflows automatically.
FFT Implementations on FPGAs 56
¡ Clean-‐cut “Speed Comparisons” are difficult because it’s tough to compare apples-‐to-‐apples: § Clock Speed ▪ Very misleading. ▪ FPGAs ~200-‐300Mhz (Base), 2+Ghz(DSP,I/O) ▪ GPUs measured typically in Ghz, with pre-‐defined number of threads (Unlike FPGAs: wit no well defined limit on concurrent operations)
▪ Less clear how background processes dynamically shuffle their load with respect to stealing cycles or freezing resources
▪ Higher clock speed is directly and clearly proportional to max throughput on an FPGA. This is not so on GPUs by a long shot
FFT Implementations on FPGAs 57
¡ Flops/Second: § Slightly more fair of a comparison – but still run into the trouble of transparency on a GPU
§ This is where optimization can matter hugely – GPUs can easily out-‐shine FPGAs on specific tasks tailored to their architecture
§ However, GPUs can perform extremely poorly on tasks where data must be moved very inefficiently around the chip, and between threads
§ FPGAs are robust to ‘pathological’ cases because of the flexibility in routing and protocol management
FFT Implementations on FPGAs 58
¡ Flops/Joule § Easy winner is the FGPA ▪ Exact numbers depend on the utility, but can be as much as ~50%
§ Fewer things going on to perform the same task. § Large availability of very efficient logic slices where DSPs would be overkill (like routing), where GPUs rely almost entirely on power-‐hungry DSP blocks.
§ Heavily optimized routing means fewer slices being utilized, less power moving around the chip.
FFT Implementations on FPGAs 59
¡ Flops/$, Bang-‐for-‐your-‐buck ▪ (Let’s disregard people-‐hour costs for writing the code)
§ In general if your task *can* be performed adequately and reliably on a GPU it will be far cheaper to do so
§ However, if your task depends on pretty much anything real-‐time like data-‐capture or protocol interpretation, GPUs are a no-‐go
§ Hybrid Systems are very popular for this: ▪ LOFAR (does) & CHIME (will) use FPGAs to capture data and control ADCs (& the like), perform on-‐the-‐fly FFTs, interpret to PCI-‐Express or Gigabit Ethernet and stream to a GPU which acts as the Correlator
FFT Implementations on FPGAs 60
¡ GPUs: § Cheap, Easy to use, fast* (heavily dependent on your operation) § High (in this context) latency, variable/unpredictable latency ▪ Unsuitable for constant-‐throughout at rates approaching clock speed
§ Extremely limited I/O, can’t talk to most I/Os § Relies heavily on buffers, cached memory: can run into a
bottleneck problem where data backs up § Power hungry § Performance fluctuations in non-‐deterministic ways depending
on internal state of the chip, background processes, and automatic resource allocations / compilers / instruction sets ▪ Not guaranteed to perform exactly the same on different machines even in
controlled environments FFT Implementations on FPGAs 61
¡ FPGAs: § Can be expensive, difficult to use, but consistent § Latency depends entirely on your utilization of resources, and
theoretical limits of the algorithm § Achievable constant throughput scales exactly with clock
speed(s) § Extremely flexible I/O § Can be very highly optimized § Power efficient, thermally efficient § No pre-‐defined threads or thread counts/limits § No shadow processes – what you write is what you get § Precision-‐Transparent data manipulation and flow
FFT Implementations on FPGAs 62
BACKUP SLIDE RADIX-‐2 Burst I/O
Switches Shuffle Correct Order for each successive butterfly, keeping cyclically permuting intermediate terms within the raw incoming data samples DSPs can clock typically 8x the max FPGA clock speed – so for resource-‐starved projects this can be a good option, since at a minimum 8x fewer resources can be used. Catch is that it can only be written using DSP slice logic (which are typically in high demand) and the LUT/Switching Logic is considerably more difficult ( figure 6)
FFT Implementations on FPGAs 63
BACKUP SLIDE RADIX-‐2 Burst I/O
To get us from Frame 1 to Frame 19 (as we did in the Pipeline Implementation earlier), we need 30 DSP operations (real), 60DSP operations (Real+Imaginary, max) So, in 14 [19-‐5 initial loading cycles] clock cycles we need to do 60 operations. The most per frame we have to do is 5 (real), or 10 (real+Imaginary) (corresponding to Stage 1 + Stage 2 + (½) Stage 3. Plus clever addressing and shuffling. If our DSP can clock 8x over the input rate, we can perform 112 operations in that time.
( figure 6)
FFT Implementations on FPGAs 64
BACKUP SLIDE RADIX-‐2 Burst I/O
This is misleading though, because we need to wait 8 clock cycles before each new piece of data comes in. So for almost every (pipeline) frame our BURST implementation will have idle frames (which can be spend shuffling data to make addressing easier, but don’t need to be). However, for every (pipeline) frame with all three stages doing computations, we will be 2 cycles short (1 real + 1 imaginary). ( figure 6)
FFT Implementations on FPGAs 65
BACKUP SLIDE RADIX-‐2 Burst I/O
For our 8-‐pt FFT we can get away with this, because only Stage 3 is always working (our half-‐stage), so most of the time only 6 operations are done. This gives us time to catch up. We can also add another DSP to handle just Stage 3 (or stage 1, or stage 2). We want to divide the labor by stages, not alternating frames because of the interleaving of intermediate data products.
( figure 6)
FFT Implementations on FPGAs 66
RADIX-‐2 Burst I/O
Example Operations Color Coded by Pipeline Frames:
S1|S1|S1|S1S2|S2|S2S3|S2S3|S3|S1S3|S1S3|S1S3|S1S2S3|S2S3
( figure 6)FFT Implementations on FPGAs 67
BACKUP SLIDE RADIX-‐4 Burst I/O
Switches Shuffle Correct Order for each successive dragonfly, keeping cyclically permuting intermediate terms within the raw incoming data samples DSPs can clock typically 8x the max FPGA clock speed – so for resource-‐starved projects this can be a good option, since at a minimum 16x fewer resources can be used compared to RADIX-‐2 implementations Catch is that it can only be written using DSP slice logic (which are typically in high demand) and the LUT/Switching Logic is considerably more difficult
( figure 7)
FFT Implementations on FPGAs 68