introduction to cerg:cryptographic engineering research group … · 2019-08-28 · asic stratix...
Post on 22-May-2020
9 Views
Preview:
TRANSCRIPT
8/28/19
1
Introduction to Post-Quantum Cryptography
CERG: Cryptographic Engineering Research Group
23 faculty members, 9 Ph.D. students,
6 affiliated scholars
Previous Cryptographic Contests
time97 98 99 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17
AES
NESSIE
CRYPTREC
eSTREAM
SHA-3
34 stream 4 HW winnersciphers ® + 4 SW winners
51 hash functions ® 1 winner
15 block ciphers ® 1 winnerIX.1997 X.2000
I.2000 XII.2002
IV.2008
X.2007 X.2012
XI.2004
CAESARI.2013
57 authenticated ciphers ®multiple winnersII. 2019
Cryptographic Contests 2007-Present
Year07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22
SHA-3
51 hash functions ® 1 winnerX.2007 X.2012
CAESARI.2013
57 authenticated ciphers ®multiple winners
II.2019
Post-Quantum
56 Lightweight authenticated ciphers & hash functions
VIII.2018Lightweight
69 Public-Key Post-Quantum Cryptography Schemes
XII.2016
TBD
TBD
CompletedIn progressX.2012X.2012
8/28/19
2
Why Cryptographic Contests?Avoid back-door theoriesSpeed-up the acceptance of a new standardStimulate non-classified research on methods of designing a specific cryptographic transformationFocus the effort of a relatively small cryptographic community
5
Evaluation Criteria
6
Security
Software Efficiency Hardware Efficiency
Simplicity
FPGAs ASICs
Flexibility Licensing
µProcessors µControllers
AES Contest, 1997-2000: Block Ciphers
7
Straw Poll @ NIST AES 3 conferenceGMU FPGA Results
Rijndael second best in FPGAs,selected as a winner due to much better performance
in software & ASICs
SHA-3 Round 2: 14 candidates
8
Throughput vs. Area: Normalized to Results for SHA-2 and Averaged over 7 FPGA Families
Area
Throughput
Best: Fast & Small
Worst: Slow & Big
8/28/19
3
SHA-3 Contest, 2007-2012: Hash Functions
9
ASIC Stratix III FPGA
Keccak – an early leader in hardware and winnerof the competition 10
CAESAR, 2013-2018: Authenticated Ciphers
Relative (w.r.t. AES-GCM) Throughput in Virtex 6
in red – algorithms qualified to Round 3
14-16x better than the current standard
AEGIS and MORUS – finalists in the use case High-Performance Applications
Post-Quantum Cryptography (PQC)
11
• Public-key cryptographic algorithms for which there are no known attacks using quantum computers
• Capable of being implemented using any traditional methods, including software and hardware
• Running efficiently on any modern computing platforms: PCs, tablets, smartphones, servers with FPGA accelerators, etc.
• Term introduced by Dan Bernstein in 2003• Equivalent terms: quantum-proof, quantum-safe or
quantum-resistant• Based entirely on traditional semiconductor VLSI technology!
Quantum Computers
12
• First perceived by physicists (Richard Feynman, David Deutsch) in 1980s
• First significant quantum algorithms(capable of running on quantum computers only) developed in 1990s
• First practical realization in 1998(2 qubits)
• Significant technological advances during the last 20 years
• One of ten breakthrough technologiesof 2017
Photo: https://www.technologyreview.com
8/28/19
4
13Source: Vandersypen, PQCrypto 2017
Major advances during the last 20 years
Timeline of Quantum Computing: https://en.wikipedia.org/wiki/Timeline_of_quantum_computing
Quantum Computers
14
• Substantial investments by:Google, IBM, Intel, Microsoft,Alcatel-Lucent, NTT
• Quantum computers based on superconducting circuitsoperating in the temperatureclose to absolute 0 (~0.01 K)
Photos: https://www.technologyreview.com
• November 2017: IBM’s 50-qubit chip• January 2018: Intel’s 49-qubit chip,
“Tangle-Lake” • March 2018: Google’s 72-qubit chip
“Bristlecone”
Progress in Quantum Computing
15Photos: https://www.technologyreview.com
Remaining Challenges in Quantum Computing
1. High sensitivity to manufacturing variationsSolution: Best industry cleanrooms, e.g., QuTech-Intel collaboration toward quantum-dot arrays made @ Intel 300mm wafers
2. Scalable control circuits (currently bulky & expensive)Solution: Tailored cryo-CMOS digital control
3. Multitude of interconnects and external pinsSolution: Multiplexing electronics co-integrated with qubits
4. Non-standard architecture & limited programmabilitySolution: System layer approach
Likely to be overcome in the next 10-15 years
Source: Vandersypen, PQCrypto 2017
8/28/19
5
System Layer Approach
17Source: Vandersypen, PQCrypto 2017
What Quantum Computers Can Do?
18Source: Vandersypen, PQCrypto 2017
Quantum Computers & Cryptography
19
1994: Shor’s Algorithm, breaks major public key cryptosystems based on
Factoring: RSA
Discrete logarithm problem (DLP): DSA, Diffie-Hellman
Elliptic Curve DLP: Elliptic Curve Cryptosystems
independently of the key size assuming
a sufficiently powerful and reliable quantum computer available
Underlying Mathematical Problem - RSA
N =P*Q (P, Q random primes) 1230186684530117755130494958384962720772853569595334792197322452151726400507263657518745202199786469389956474942774063845925192557326303453731548268507917026122142913461670429214311602221240
479274737794080665351419597459856902143413=
334780716989568987860441698482126908177047949837137685689124313889828837938780022876147116525317
43087737814467999489*
36746043666799590428244633799627952632279158164343087642676032283815739666511279233373417143396
810270092798736308917
Record Using Classical Computers, 232 digits, 768 bits
8/28/19
6
How Real Is the Danger?
Source: Vandersypen, PQCrypto 2017
“There is a 1 in 7 chance that some fundamental public-key crypto will be broken by quantum by 2026, and a 1 in 2 chance of the same by 2031.”Dr. Michele MoscaDeputy Director of the Institute for Quantum Computing, University of WaterlooApril 2015
21
“Theorem” by Mosca
y – Time to Develop & Deploy PQC Standards
x – Time Information Must Remain Protected
z – Time to Build Quantum Computers
If z < y + x, then worry!
Encrypted Data Stored by Powerful Adversaries
No Announcement when Quantum Computer Available to NSA, Foreign Governments,
or Organized Crime
NIST PQC Standardization Process
Feb. 2016: NIST announcement of standardization plans at PQCrypto 2016, Fukuoka, Japan,
Dec. 2016: NIST Call for Proposals and Request for Nominations for Public-Key Post-Quantum Cryptographic Algorithms:
Nov. 30, 2017: Deadline for submitting candidates
Dec. 2017: Announcement of the First Round Candidates
23
Three Types of PQC Schemes
1. Public Key Encryption
2. Digital Signature
3. Key Encapsulation Mechanism (KEM)
8/28/19
7
Five Security Categories
Level Security DescriptionI At least as hard to break as AES-128 using exhaustive
key searchII At least as hard to break as SHA-256 using collision
searchIII At least as hard to break as AES-192 using exhaustive
key searchIV At least as hard to break as SHA-384 using collision
searchV At least as hard to break as AES-256 using exhaustive
key search
Leading PQC Families
26
Family Encryption/KEM
Signature
Hash-based XX
Code-based XX X
Lattice-based XX X
Multivariate X XX
Isogeny-based X
XX – high-confidence candidates, X – medium-confidence candidates
Underlying Mathematical Problem – PQC
Closest Vector Problem
Lattice in dimension n=2:Set of points given by
p = i⋅b1 + j ⋅b2
where i and j are arbitrary integers
Problem: Find the point of latticegiven by the base vectorsb1 and b2 closest to the arbitrary point of an n-dimensional space tImagine n in the range
of 825
Underlying Mathematical Problem – PQC
Solving a system of m quadratic equations with n unknowns
Imagine m and n in the range of 70 and above
8/28/19
8
Round 1 Candidates
Family Signature Encryption/KEM Overall
Lattice-based 5 22 27
Code-based 2 16 18
Multivariate 7 2 9
Hash-based 2 2
Isogeny-based 1 1
Other 3 4 7
Total 19 45 64
69 accepted as complete, 5 withdrawn26 Countries, 260 co-authors
29
Round 1 Submissions
69 Submissions, 26 Countries, 260 co-authors
NIST PQC Standardization Process
Jan. 30, 2019: Announcement of candidates qualified to Round 2
April 10, 2019: Publication of Round 2 submission packages
Aug. 22-24, 2019: Second NIST PQC Conference
2020: Beginning of Round 3 and/or selection of first future standards
2022-2024: Draft standards published
31
Round 2 Candidates
Family Signature Encryption/KEM Overall
Lattice-based 3 9 12
Code-based 7 7
Multivariate 4 4
Symmetric-based
2 2
Isogeny-based 1 1
Total 9 17 26
26 Candidates announced on January 30, 2019
32
8/28/19
9
Round 2 Submissions
Sources: Moody, PQCrypto May 2019
The 2nd Round Candidates• Encryption/KEMs (17)
▪ Digital Signatures (9)
• CRYSTALS-KYBER• FrodoKEM• LAC• NewHope• NTRU (merger of NTRUEncrypt/NTRU-HRSS-KEM)• NTRU Prime• Round5 (merger of Hila5/Round2)• SABER• Three Bears
• BIKE• Classic McEliece• HQC• LEDAcrypt (merger of LEDAkem/pkc)• NTS-KEM• ROLLO (merger of LAKE/LOCKER/Ouroboros-R)• RQC
• SIKE
• CRYSTALS-DILITHIUM• FALCON• qTESLA
• Picnic• SPHINCS+
• GeMSS• LUOV• MQDSS• Rainbow
NIST Report on the 1st Round: https://doi.org/10.6028/NIST.IR.8240
• Lattice-based• Code-based• Isogenies
• Lattice-based• Symmetric-based• Multivariate
Similarities with Previous Contests
Evaluation to be performed in rounds (12-18 months each)A pool of candidates narrowed down after each roundSmall tweaks allowed at the beginning of each roundOptimized software implementation developedfor various platformsNo immediate plans for obligatory hardware implementations
Differences from Previous Contests
Candidates not qualified to the next round (and not withdrawn by the authors) may be considered at a later dateTaking into account quantum attacks, possible only on the platforms that do not exist at the time of the standard developmentSecurity analysis much more challengingand often controversial
Adi Shamir’s Proposal
After the initial evaluation period (e.g., 3 years) the division of all schemes into the following categories:
2 Productions Schemes: recommended for actual wide-scale deployment, Highly Trusted4 Development Schemes: Time-Tested, Trusted;at least 15 years of analysis behind them;Intended for initial R&D by industry.8 Research Schemes: Promising Properties, Good Performance. May contain some high-risk candidates. Main Goal: Concentrate the effort of the research community.
8/28/19
10
PQCRYPTO Consortium
11 universities and companiesFunded by European Commission under the H2020 program
Initial Recommendations published in 2015Co-authors of 22 Submissions
The Most Trusted Schemes – Encryption/KEM
Classical McElieceProposed 40 years ago as an alternative to RSACode-based familyBased on binary Goppa codesNo patentsConservative parameters (Category 5, 256-bit security):a) length n=6960, dimension k= 5413, errors=119b) length n=8192, dimension k= 6528, errors=128
Complexity of the best attack identical after 40 years of analysis, and more than 30 papers devoted to thorough cryptanalysisSizes:Public key: a) 1,047,319 bytes, b) 1,357,824 bytesPrivate key: a) 13,908 bytes, b) 14,080 bytesCiphertext: a) 226 bytes, b) 240 bytes Efficient Software (Haswell, larger parameter set)
295,930 for encryption, 355,152 for decryptionConstant time
Efficient Hardware (Yale)
The Most Trusted Schemes – Signatures
Hash-based Schemes:
Representatives:
SPHINCS-256 => SPHINCS+
Security based on the security of a single underlying primitive
hash function
Features:
Efficient signature generation and verificationRelatively large keys (~ tens of kilobytes)
Likely Development Schemes – Lattice-based
• NTRUEncrypt – proposed in 1996, published in 1998
• Efficient encryption & decryption
• Relatively small key sizes (kilobytes)
• Suitable for constrained environments
• Standards: IEEE P1363.1-2008, ANSI X9.98-2010,
Consortium for Efficient Embedded Security EESS 2015
• No proof of security
8/28/19
11
Likely Development Schemes – Lattice-based
• New lattice-based schemes with extended security proof,
smaller key sizes, and better efficiency
CRYSTALS-KYBER
Ring-LWE (Learning with Errors)
Pilot Hardware Implementations:
• Ruhr University of Bochum, Germany• Technical University Munich, Germany• ESAT/COSIC KU Leuven, Belgium• The University of Texas at Austin, USA• George Mason University, USA, etc.
Binary RLWE
NewHope CRYSTALS-DILITHIUM
PQC Major Challenges
42
Fairness Number of Candidates
Challenges of Post-Quantum CryptographyMathematical complexity
Large amount of man-power
Large keys and internal statesHardware resources required
New types of basic operationsNeed for random sampling not only from uniform but also from discrete Gaussian and/or other distributionsConstant-time implementationsNeed for new SCA (Side-Channel Attack) countermeasures against power and electromagnetic analysisPlug-and-play replacement for current public-key cryptography units
Intermediate use of hybrid systems43
Hardware Benchmarking
44
8/28/19
12
Major Optimization Targets
High-SpeedLightweight
• Parallel processing• Constant-time• Parametric code
• Small area, power,energy per bit
• Resistance to power& electromagnetic analysis
Round 2 Candidates in Hardware#Round 2candidates
5
14
29
26
AES
SHA-3
CAESAR
PQC
Implementedin hardware
5
14
28
?
Percentage
100%
100%
97%
?
46
Software/HardwareCodesign
47
Most time-critical operation
Software
Hardware
Software/Hardware Codesign
48
8/28/19
13
SW/HW Codesign: Motivational Example 1
49
Software Software/Hardware
91% major operation(s) 9% other operations
~1% major operation(s) in HW9% other operations in SW
speed-up ≥ 100
Total Speed-Up ≥ 10
Other9%
Major91%
Major~1%
Other9%
Time saved90%
SW/HW Codesign: Motivational Example 2
50
Software Software/Hardware
99% major operation(s) 1% other operations
~1% major operation(s) in HW1% other operations in SW
speed-up ≥ 100
Total Speed-Up ≥ 50
Other1%
Major99%
Major~1%
Other1%
Time saved98%
SW/HW Codesign: Advantages
51
v Focus on a few major operations, known to be easily parallelizable
§ much shorter development time (at least by a factor of 10)§ guaranteed substantial speed-up§ high-flexibility to changes in other operations (such as candidate
tweaks)v Insight regarding performance of future instruction set extensions of
modern microprocessors
v Possibility of implementing multiple candidates by the same research group, eliminating the influence of different
§ design skills§ operation subset (e.g., including or excluding key generation)§ interface & protocol§ optimization target
§ platform
SW/HW Codesign: Potential Pitfalls
52
v Performance & ranking may strongly depend on
A. features of a particular platform
o Software/hardware interface
o Support for cache coherency
o Differences in max. clock frequency
B. selected hardware/software partitioning
C. optimization of an underlying software implementation
v Limited insight on ranking of purely hardware implementations
First step, not the ultimate solution!
8/28/19
14
Two Major Types of Platforms
53
FPGA Fabric& Hard-core Processors
FPGA Fabric, including Soft-core Processors
Examples:• Xilinx Zynq 7000 System on Chip (SoC)• Xilinx Zynq UltraScale+ MPSoC• Intel Arria 10 SoC FPGAs• Intel Stratix 10 SoC FPGAs
Examples:Xilinx Virtex UltraScale+ FPGAsIntel Stratix 10 FPGAs, including • Xilinx MicroBlaze• Intel Nios II• RISC-V, originally UC Berkeley
Processorw/ Memory
& I/O
FPGA Fabric
FPGA Fabric
Soft-coreProcessor
Two Major Types of Platform
54
Feature FPGA Fabric and Hard-core Processor
FPGA Fabric with Soft-core Processor
Processor ARM MicroBlaze, NIOS II, RISC-V, etc.
Clock frequency >1 GHz max. 200-450 MHzPortability similar FPGA SoCs various FPGAs, FPGA SoCs,
and ASICs
Hardware accelerators Yes YesInstruction set extensions No Yes
Ease of design (methodology, tools, OS support)
Easy Dependent on a particular soft-core processor and
tool chain
Xilinx Zynq UltraScale+ MPSoC1.2 GHz ARM Cortex-A53 + UltraScale+ FPGA logic
Experimental Setup
55
Output FIFOInput FIFO Hardware Accelerator
Zynq Processing System
AXI DMA
FIFO Interface
FIFO Interface
AXI StreamInterface
AXI StreamInterface
AXI L
ite
Inte
rfac
e
AXI F
ull
Inte
rface
AXI L
ite
Inte
rface
IRQ
Clocking wizard
rd_clkwr_clk wr_clk rd_clkclk
UUT_clk
Main Clock
AXI L
ite
Inte
rface
AXI TimerAXI Lite Interface
All elements located on a single chip
Code Release
56
• Full Code & Configuration of the Experimental Setup
• Software/Hardware Codesign of Round 1 NTRUEncrypt
to be made available at
https://cryptography.gmu.edu/athena
under PQC
by August 31, 2019
8/28/19
15
GMU CERGCase Study
57
SW/HW Codesign: Case Study
58
7 IND-CCA*-secure Lattice-Based Key Encapsulation Mechanisms (KEMs)representing
5 NIST PQC Round 2 Submissions
LWE (Learning with Error)-based:FrodoKEM
RLWR (Ring Learning with Rounding)-based:Round5
Module-LWR-based:Saber
* IND-CCA = with Indistinguishability under Chosen Ciphertext Attack
NTRU-based:NTRU
• NTRU-HPS• NTRU-HRSS
NTRU Prime• Streamlined NTRU Prime• NTRU LPRime
SW/HW Partitioning
Top candidates for offloading to hardware
From profiling:
v Large percentage of the execution time
v Small number of function calls
From manual analysis of the code:
v Small size of inputs and outputs
v Potential for combining with neighboring functions
From knowledge of operations and concurrent computing:
v High potential for parallelization59
Example: LightSaber Decapsulation
MatrixVectorMul43.44%
InnerProduct43.52%
GenMatrix5.03%
GenSecret2.30%Hash
3.30%
Other2.40%
60
8/28/19
16
LightSaber Decapsulation
MatrixVectorMul43.44%
InnerProduct43.52%
GenMatrix5.03%
GenSecret2.30%
Hash3.30%
Other2.40%
Execution time saved88.83%
Other2.40%
HardwareAccelerator
8.77%
Total Speed-Up = 100/11.17=9.0
Execution time remaining
11.17%
Accelerator Speed-Up = 97.60/8.77=11.1
Execution time of functions to be moved to hardware
97.60%Execution time of functions
remaining in software2.40% 61
TentativeResults
62
Total Execution Time in Software [𝜇s]Encapsulation
0
500
1,000
1,500
2,000
2,500
3,000
3,500
4,000
4,500
Roun d5 Saber Str NTRUPrime
NTRULP Rime
NTRU-HRSS NTRU-HPS FrodoKEM
Level 1 Level 2 Level 3 Level 4 Level 5
16,19262,07634,609
63
Total Execution Time in Software/Hardware [𝜇s]Encapsulation
0
100
200
300
400
500
600
Round 5 Saber NTRU-HRSS Str NTRUPrime
NTRU-HPS NTRULP Rime
FrodoKEM
Level 1 Level 2 Level 3 Level 4 Level 5
1,2232,1861,642
1
3⇒4
7
5⇒3
4⇒6
2
6⇒5
64
8/28/19
17
Total Speed-ups: Encapsulation
17.8
13.2
8.3 7.9 7.1
3.2 2.4
21.1
10.2 9.3
12.7
3.4 2.53.6 2.6
28.4
10.6
18.1
0.0
5.0
10.0
15.0
20.0
25.0
30.0
NTRU-HRSS FrodoKEM Round5 NTRU-HPS Sa ber NTRU L PRime Str N TRUPrime
Level 1 Level 2 Level 3 Level 4 Level 5
65
Accelerator Speed-ups: Encapsulation
146.4 140.4
43.5
8.7 8.520.4
12.3
192.2
44.3
27.515.3
10.6 14.5
32.517.7
46.1
11.1 21.3
0.020.040.060.080.0
100.0120.0140.0160.0180.0200.0
NTRU-HRSS NTRU-HPS FrodoKEM NTRULP Rime
Str NTRUPrime
Round 5 Saber
Level 1 Level 2 Level 3 Level 4 Level 5
66
SW Part Sped up by HW[%]: Encapsulation
99.36 98.49
95.03 94.62
88.03
70.35
62.53
99.5998.93 97.45
89.71
71.44
63.43
73.26
65.00
99.55 99.14 98.62
50.00
60.00
70.00
80.00
90.00
100.00
Roun d5 Saber NTRU-HRSS FrodoKEM NTRU-HPS NTRULP Rime
Str NTRUPrime
Level 1 Level 2 Level 3 Level 4 Level 5
67
Total Execution Time in Software [𝜇s]Decapsulation
16,19262,37734,649
01,0002,0003,0004,0005,0006,0007,0008,0009,000
10, 00011, 00012, 000
Round 5 Saber NTRULP Rime
Str NTRUPrime
NTRU-HPS NTRU-HRSS FrodoKEM
Level 1 Level 2 Level 3 Level 4 Level 5
Order reversedcompared to encapsulation
3 4
5 6
68
8/28/19
18
0
100
200
300
400
500
600
700
Roun d5 Saber NTRU-HPS Str NTRUPrime
NTRU-HRSS NTRULP Rime
FrodoKEM
Level 1 Level 2 Level 3 Level 4 Level 5
Total Execution Time in Software/Hardware [𝜇s]:Decapsulation
1,3193,1201,866
1
7
5⇒3
3⇒6
26⇒54
69
Total Speed-ups: Decapsulation
77.4 74.1
12.3 9.0 8.1
38.3
3.9
119.3
45.5
18.613.4 9.4 4.1
54.8
4.520.0 17.9
9.8
0.010.020.030.040.050.060.070.080.090.0
100.0110.0120.0130.0
NTRU-HPS NTRU-HRSS Str NTRUPrime
FrodoKEM Saber Round 5 NTRULP Rime
Level 1 Level 2 Level 3 Level 4 Level 5
70
Accelerator Speed-ups: Decapsulation
188.1 182.8
44.4
11.1 8.1
86.2
27.5
235.7
112.0
46.032.8
17.7 9.4
132.7
37.746.0
24.69.8
0.025.050.075.0
100.0125.0150.0175.0200.0225.0250.0
NTRU-HRSS NTRU-HPS Str NTRUPrime
FrodoKEM NTRULP Rime
Saber
Level 1 Level 2 Level 3 Level 4 Level 5
Round5
71
SW Part Sped up by HW[%]: Decapsulation100.00 99.25 99.18 97.60
93.9698.53
76.47
100.00 99.5898.69 98.10
96.78
77.46
98.92
79.22
100.0098.41 97.11
50.00
60.00
70.00
80.00
90.00
100.00
Round 5 NTRU-HPS NTRU-HRSS Str NTRUPrime
Saber FrodoKEM NTRULP Rime
Level 1 Level 2 Level 3 Level 4 Level 5
72
8/28/19
19
Conclusionsv Total speed-ups
§ for encapsulation from 2.4 (Str NTRU Prime) to 28.4 (FrodoKEM)§ for decapsulation from 3.9 (NTRU LPRime) to 119.3 (NTRU-HPS)
v Total speed-up dependent on the percentage of the software execution time taken by functions offloaded to hardware and the amount of acceleration itself
v Hardware accelerators thoroughly optimized using Register-Transfer Level design methodology
v Determining optimal software/hardware partitioning requires more workv Ranking of the investigated candidates affected, but not dramatically
changed, by hardware accelerationv It is possible to complete similar designs for all Round 2 candidates within
the evaluation period (12-18 months) v Additional benefit: Comprehensive library of major operations in hardware
73
Future Work
74
Breadth
Depth
4 RemainingLattice-based
KEMs
3 Lattice-based Digital
Signatures
Current work
7 Code-basedKEMs
7 OtherCandidates
More operations moved to hardware / C code optimized for ARM Cortex-A53*Algorithmic optimizations of software and hardware*
Hardware library of basic operations
of lattice-based candidates
Hardware library of basic operations
of code-based candidates
Hardware library
for PQC
Full hardware implementations
*collaboration with submission teams and other groups very welcome
High-LevelSynthesis
High-Level Synthesis (HLS)
76
High Level Language(C, C++, Java, Python, etc.)
Hardware Description Language(VHDL or Verilog)
High-Level Synthesis(HLS)
8/28/19
20
Popular HLS Tools
77
Commercial (FPGA-oriented):
Academic: • Bambu: Politecnico di Milano, Italy• DWARV: Delft University of Technology, The Netherlands• GAUT: Universite de Bretagne-Sud, France• LegUp: University of Toronto, Canada
• Vivado HLS: Xilinx – selected for this study• FPGA SDK for OpenCL: Intel
Software/Hardware Codesign with HLS
Most time-critical operation
Software
HLS-Generated Hardware
78
Transformation to HLS-ready C/C++ Code
1. Interface mapping
2. Addition of HLS Tool directives (pragmas)
3. Hardware-driven code refactoring
79
Using Existing Code
Function: Polynomial Multiplication in NTRUEncrypt
Source of C code: OnBoard Security
Goal: The same or comparable number of clock cycles as in the Register-Transfer Level (manual) implementation in VHDL.
Attempt 1: Reference implementation based on the grade school algorithm for multiplication (a.k.a. schoolbook, paper-and-pencil, etc.).
Attempt 2: Optimized implementation based on rotation.Multiple attempts at optimization using Vivado HLS directives (pragmas) and minor code changes.
80
8/28/19
21
Outcome & Revised Approach
Outcome 1: Tens of thousands of clock cycles, compared to the expected n=743 clock cycles
Solution: Rewriting the code in C in such a way to match the block diagram used to generate VHDL code
Outcome 2:Expected functionalityAround n clock cycles of the execution time
Similar approach applied and successfully repeated for 4 NTRU-based Round 1 KEMs!
81
Sources of Productivity Gains
• Higher-level of abstraction• Focus on datapath rather than control logic• Debugging in software (C/C++)
• Faster run time• No timing waveforms
82
Achievements: Number of Clock CyclesAlgorithm RTL HLS HLS/RTL
EncapsulationNTRUEncrypt 744 750 1.008NTRU-HRSS 702 708 1.009Str NTRU Prime 762 769 1.009NTRU LPRime 1524 1538 1.009
DecapsulationNTRUEncrypt 1491 1502 1.007NTRU-HRSS 2111 2162 1.024Str NTRU Prime 2291 2346 1.024NTRU LPRime 2287 2307 1.009
83
#Clock Cycles increased by less than 2.5%
Overhead: Clock Frequency & Resource Utilization
84
Algorithm RTL HLS HLS/RTL
Clock Frequency [MHz]NTRUEncrypt 330 270 0.82NTRU-HRSS 300 261 0.87Str NTRU Prime 255 191 0.75NTRU LPRime 255 191 0.75
SlicesNTRUEncrypt 4,431 7,361 1.66NTRU-HRSS 5,476 7,358 1.34Str NTRU Prime 9,699 18,428 1.90NTRU LPRime 8,483 15,585 1.84
Clock Frequency reduced by 25% or less#Slices increased by 90% or less
8/28/19
22
Overhead: Resource Utilization
85
Algorithm RTL HLS HLS/RTLLUTs
NTRUEncrypt 27,912 50,061 1.79NTRU-HRSS 33,230 48,452 1.46Str NTRU Prime 65,207 117,424 1.80NTRU LPRime 52,297 94,062 1.80
Flip-flopsNTRUEncrypt 24,697 24,728 1.00NTRU-HRSS 30,327 33,235 1.10Str NTRU Prime 32,929 51,253 1.56NTRU LPRime 39,730 40,089 1.00
#LUTs increased by 80% or less#Flip-flops increased by 56% or less
Possible Future Work
HLS-Generated HardwareMinimal Hand Optimization
HLS-Generated HardwareThorough Hand Optimization
86
Most time-critical operation
PQC Opportunities & Challenges
The biggest revolution in cryptography, since the invention of
public-key cryptography in 1970s
Efficient hardware implementations in FPGAs and ASICs desperately needed to prove the candidates suitability for
high-performance applications and constrained
environments. Collaboration sought by submission teams!
Likely extensions to Instruction Set Architectures
of multiple major microprocessors
Start-up & new-product opportunities
Once in the lifetime opportunity! Get involved!87
Q&A
88
Suggestions?CERG: http://cryptography.gmu.edu
ATHENa: http://cryptography.gmu.edu/athena
Questions? Comments?
Thank You!
top related