intel’s new aes instructions - indian statistical institutedebrup/aeworkshop/slides/03_aes...

59
Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron - Intel Corporation, Israel Development Center, Haifa, Israel - University of Haifa, Israel 1

Upload: dangnhu

Post on 12-Mar-2018

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

Intel’s New AES Instructions

Enhanced Performance and Security

Shay Gueron

- Intel Corporation, Israel Development Center, Haifa, Israel

- University of Haifa, Israel

1

Page 2: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

Overview

• AES basics

• Performance hungry applications

• The security issue

• The AES instrcutions

• Performance scalability

• Basic usage

• Software flexibility

• Software tools

• Performance and optimizations

• More on software flexibility

• And more…

2

Page 3: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

AES Basics

3

Page 4: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

AES Overview

4

X10-14 “Rounds”

Shift Row Plain Text

Fast Software Encryption

SubByte (Sbox) Mix Columns

Slow Software Encryption

Cipher Text

Add Round Key

Round key

Page 5: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

AES Transformations

• AddRoundKey — 128b xor of State and round key

• SubBytes — nonlinear bytewise substitution (repeted 16x)

• ShiftRows — bytewise permutation

• MixColumns — matrix multiplication in GF(28)

• InvSubBytes, InvShiftRows, InvMixColumns

• SubWord – 4 x SubBytes

• RotWord – [a0, a1, a2, a3] [a1, a2, a3, a0]

• Rcon – in round i equals [{02}i-1, {00}, {00}, {00}]

5

Page 6: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

AES Encryption

Tmp = AddRoundKey (Data, Round_Key_Encrypt [0])

For round = 1-9 or 1-11 or 1-13:

Tmp = ShiftRows (Tmp)

Tmp = SubBytes (Tmp)

Tmp = MixColumns (Tmp)

Tmp = AddRoundKey (Tmp, Round_Key_Encrypt [round])

end loop

Tmp = ShiftRows (Tmp)

Tmp = SubBytes (Tmp)

Tmp = AddRoundKey (Tmp, Round_Key_Encrypt [10 or 12 or 14])

Result = Tmp

6

40/48/56 steps

Page 7: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

AES Decryption (Equivalent Inverse Cipher)

Tmp = AddRoundKey (Data, Round_Key_Decrypt [0])

For round = 1-9 or 1-11 or 1-13:

Tmp = InvShiftRows (Tmp)

Tmp = InvSubBytes (Tmp)

Tmp = InvMixColumns (Tmp)

Tmp = AddRoundKey (Tmp, Round_Key_Decrypt [round])

end loop

Tmp = InvShiftRows (Tmp)

Tmp = InvSubBytes (Tmp)

Tmp = AddRoundKey (Tmp, Round_Key_Decrypt [10 or 12 or 14])

Result = Tmp

7

Equivalent Inverse Cipher

Page 8: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

AES-128 Key Expansion

AES-128 Key Expansion

for (i = 0 .. 3) { w[i] = Cipher Key[i] }

for (i = 4 .. 43) {

temp = w[i-1]

if (i mod 4 = 0) {

temp = SubWord(RotWord(temp)) xor Rcon

}

w[i] = w[i-4] xor temp

}

8

AES-256 Key Expansion Encrypt

for (i = 0 .. 7) { w[i] = Cipher Key[i] }

for (i = 8 .. 59) {

temp = w[i-1]

if (i mod 8 = 0) {

temp = SubWord(RotWord(temp)) xor Rcon

}

else if (i mod 8 = 4) {

temp = SubWord(temp)

}

w[i] = w[i-8] xor temp

}

Page 9: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

Preparing the decryption key schedule

9

K0 K2K1 K3

K4 K6K5 K7

K8 K10K9 K11

K12 K14K13 K15

Key0

Key1

Key2

Key3

K16 K18K17 K19

K20 K22K21 K23

K24 K26K25 K27

K28 K30K29 K31

Key4

Key5

Key6

Key7

K32 K34K33 K35Key8

Encrypt Keys

K36 K38K37 K39Key9

K40 K42K41 K43Key10

InvMixCols

InvMixCols

InvMixCols

InvMixCols

InvMixCols

InvMixCols

InvMixCols

InvMixCols

InvMixCols

Encry

pt

Round K

eys

For the Equivalent Inverse cipher: apply InvMixCols to Encrypt Round keys

Decry

pt

Round K

eys

Page 10: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

Performance Hungry

Applications

10

Page 11: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

Performance hungry AES usage models

• SSL/TLS for HTTPS

• IPSec

• OS Based Disk Encryption

– E.g., Microsoft Bitlocker

– Similar in Linux

• File encryption utilities

• Storage Encryption

• Voice Over IP Security (VOIP)

11

Relevant to clinet and server

platforms

Page 12: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

The Security Issue

12

Page 13: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

CPU cache

Memory tradeoff for capacity and latency (and cost)

Most instructions are in relation to memory (load and store)

Cache = small and fast memory

• working close to CPU’s frequency

• hiding the latency of larger large memories

• Speculative: holds “next” required data

13

Problem: in a multitasking environment

memory access can be made implicitly

data-dependent

Page 14: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

Cache-based attacks (among others)

Theoretical attacks by Page:

• Time-driven: execution time as function of cache-hit/miss numbers

– 2003: Tsunoo et al. on DES

– 2004: Bernstein on first round of AES

– 2006: Neve et al. on first and second round of AES

• Trace-driven: sequence of cache-hit/miss

– 2005: Bertoni et al. on AES through SimpleScalar

– 2005: Lauradoux et al. on AES

– 2006: Acıiçmez et al. on AES

• Access-driven: cache line accesses of crypto process

– 2005: Percival on RSA with multithreaded processors

– 2005-06: Osvik, Shamir et al. on AES with multithreaded processors

– 2005-06: Neve and Seifert on AES with single-threaded processors and last round attack

14

Page 15: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

Table based AES (e.g., OpenSSL)

Tables based easier accesses and operations on 32-bit proc.

For AES encryption, 5 precomputed tables [1-byte] [4-byte]

Composed from two tables S and S’ [1-byte] [1-byte]

T0 = [S’,S,S,SS’]

T1 = [SS’,S’,S,S]

T2 = [S,SS’,S’,S]

T3 = [S,S,SS’,S’]

T4 = [S,S,S,S]

15

/* round 1: */

t0 = T0[s0 >> 24] T1[(s1 >> 16) & 0xff] T2[(s2 >> 8) & 0xff] T3[s3 & 0xff] rcon[4];

t1 = T0[s1 >> 24] T1[(s2 >> 16) & 0xff] T2[(s3 >> 8) & 0xff] T3[s0 & 0xff] rcon[5];

t2 = T0[s2 >> 24] T1[(s3 >> 16) & 0xff] T2[(s0 >> 8) & 0xff] T3[s1 & 0xff] rcon[6];

t3 = T0[s3 >> 24] T1[(s0 >> 16) & 0xff] T2[(s1 >> 8) & 0xff] T3[s2 & 0xff] rcon[7];

/* round 2: */

Page 16: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

Table based AES

T4 is used for the last round (no MixColumns) and for Key Expansion

T4=

16

lsb 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

msb

0 63 7c 77 7b f2 6b 6f c5 30 01 67 2b fe d7 ab 76

1 ca 82 c9 7d fa 59 47 f0 ad d4 a2 af 9c a4 72 c0

2 b7 fd 93 26 36 3f f7 cc 34 a5 e5 f1 71 d8 31 15

3 04 c7 23 c3 18 96 05 9a 07 12 80 e2 eb 27 b2 75

4 09 83 2c 1a 1b 6e 5a a0 52 3b d6 b3 29 e3 2f 84

5 53 d1 00 ed 20 fc b1 5b 6a cb be 39 4a 4c 58 cf

6 d0 ef aa fb 43 4d 33 85 45 f9 02 7f 50 3c 9f a8

7 51 a3 40 8f 92 9d 38 f5 bc b6 da 21 10 ff f3 d2

8 cd 0c 13 ec 5f 97 44 17 c4 a7 7e 3d 64 5d 19 73

9 60 81 4f dc 22 2a 90 88 46 ee b8 14 de 5e 0b db

10 e0 32 3a 0a 49 06 24 5c c2 d3 ac 62 91 95 e4 79

11 e7 c8 37 6d 8d d5 4e a9 6c 56 f4 ea 65 7a ae 08

12 ba 78 25 2e 1c a6 b4 c6 e8 dd 74 1f 4b bd 8b 8a

13 70 3e b5 66 48 03 f6 0e 61 35 57 b9 86 c1 1d 9e

14 e1 f8 98 11 69 d9 8e 94 9b 1e 87 e9 ce 55 28 df

15 8c a1 89 0d bf e6 42 68 41 99 2d 0f b0 54 bb 16

each

value

repeated

4x

Page 17: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

Exploiting OS scheduling

AES rounds are short vs context switch frequency

Preemptive scheduling

ability for a process to yield CPU before end of OS quantum

2 processes

• spy continuously watches the cache accesses

• crypto runs for small amounts a time

17

start of

OS quantum

end of

OS quantum

(re)loading table and wait accessing

tables

Page 18: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

Cache sharing leakages

Two processes on the same processor: crypto and spy

1. spy loads a (large) table

2. crypto runs on the processor

3. spy reloads and times each table line:

if loading time is

short line not evicted

long line evicted

18

ca

ch

e lin

es

Page 19: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

Mitigation

• There are way to write AES software and avoid the data-dependency of memory accesses

– But they severely degrade performance

19

Page 20: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

Intel’s AES Instructions

20

Page 21: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

AES New Instructions (AES-NI)

• Will be introduced into the Intel Instructions Set starting from 2009

Four instructions to perform AES encryption and decryption

• AESENC – Perform one round encryption of AES

• AESENCLAST – Perform last round encryption of AES

• AESDEC – Perform one round decryption of AES

• AESDECLAST – Perform last round decryption of AES

Two instructions to perform AES Key Expansion

• AESKEYGENASSIST – Used for round key expansion

• AESIMC – convert encryption round keys to a form usable for decryption

• Intel’s architecture uses the equivalent inverse cipher

21

Page 22: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

AES Data Structure

S(0,0) S(0,3)S(0,2)S(0,1)

S(1,0) S(1,3)S(1,2)S(1,1)

S(2,0) S(2,3)S(2,2)S(2,1)

S(3,0) S(3,3)S(3,2)S(3,1)

X0 = S (3 ,0) S (2, 0) S (1, 0) S (0, 0)

X1 = S (3, 1) S (2 ,1) S (1, 1) S (0, 1)

X2 = S (3, 2) S (2, 2) S (1, 2) S (0, 2)

X3 = S (3, 3) S (2, 3) S (1, 3) S (0, 3)

22

lsbmsb

X1 3263X2 6495X3 96127 X0 031xmm1

X5 3263X6 6495X7 96127 X4 031xmm2/m128

The State (xmm0) in matrix representation

State and Round Key in xmm0 and xmm2/m128

Page 23: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

The 4 AES Round Instructions

AESENC xmm0, xmm2/m128

Tmp:= xmm0;

Round Key:= xmm2/m128;

Tmp:= Shift Rows (Tmp);

Tmp:= Substitute Bytes (Tmp);

Tmp:= Mix Columns (Tmp);

xmm0:= Tmp xor Round Key

AESENCLAST xmm0, xmm2/m128

Tmp:= xmm0;

Round Key:= xmm2/m128;

Tmp:= Shift Rows (Tmp);

Tmp:= Substitute Bytes (Tmp);

xmm0:= Tmp xor Round Key

23

AESDEC xmm0, xmm2/m128

Tmp:= xmm0;

Round Key:= xmm2/m128;

Tmp:= Inverse Shift Rows (Tmp);

Tmp:= Inverse Substitute Bytes (Tmp);

Tmp:= Inverse Mix Columns (Tmp:=);

xmm0:= Tmp xor Round Key

AESDECLAST xmm0, xmm2/m128

State := xmm0;

Round Key := xmm2/m128

Tmp:= Inverse Shift Rows (State);

Tmp:= Inverse Substitute Bytes (Tmp);

xmm0:= Tmp xor Round Key

Page 24: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

Two instructions for Key Expansion

AESIMC xmm0, xmm2/m128

RoundKey := xmm2/m128;

xmm0 := InvMixColumns (RoundKey)

AESKEYGENASSIST xmm0, xmm2/m128, imm8

Tmp := xmm2/m128

RCON[31-8] := 0; RCON[7-0] := imm8;

X3[31-0] := Tmp[127-96]; X2[31-0] := Tmp[95-64];

X1[31-0] := Tmp[63-32]; X0[31-0] := Tmp[31-0];

xmm0 := [RotWord (SubWord (X3)) XOR RCON, SubWord (X3), Rotword (SubWord (X1)) XOR RCON, SubWord (X1)]

24

Page 25: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

AESKEYGENASSIST xmm0, xmm2/m128, imm8

25

X3 X2 X1 X0

X3 X3 X1 X1

Duplicate

X3’ X3’ X1’ X1’

S-box

X3’’ X3’ X1’

Rotate Rotate

Duplicate

X3’’’ X3’ X1’’’ X1’

XOR RCON

X1’’

XOR RCON

S-box S-box S-box

Page 26: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

Performance

Scalability

26

Page 27: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

Design for performance scalability

Tmp = AddRoundKey (Data, Round_Key_Encrypt [0])

For round = 1-9 or 1-11 or 1-13:

Tmp = ShiftRows (Tmp)

Tmp = SubBytes (Tmp)

Tmp = MixColumns (Tmp)

Tmp = AddRoundKey (Tmp, Round_Key_Encrypt [round])

end loop

Tmp = ShiftRows (Tmp)

Tmp = SubBytes (Tmp)

Tmp = AddRoundKey (Tmp, Round_Key_Encrypt [10 or 12 or 14])

Result = Tmp

27

Can control last round

via immediate

Page 28: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

Basic Usage

28

Page 29: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

AES-128 Key Expansion

begin

word temp

for (i = 0 .. 3) {

w[i] = Initial Key[i]

}

for (i = 4 .. 43) {

temp = w[i-1]

if (I mod 4 = 0) {

temp = SubWord(RotWord(temp)) xor Rcon

}

w[i] = w[i-4] xor temp

}

end

29

AESKEYGENASSIST

Page 30: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

AES-256 Key Expansion

word temp

for (i = 0 .. 7) { w[i] = Initial Key[i] }

for (i = 8 .. 59) {

temp = w[i-1]

if (i mod 8 = 0) {

temp = SubWord(RotWord(temp)) xor Rcon

}

else if (i mod 8 = 4) {

temp = SubWord(temp)

}

w[i] = w[i-8] xor temp

}

30

AESKEYGENASSIST

AESKEYGENASSIST

Page 31: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

AESIMC xmm0, xmm2/m128

31

K0 K2K1 K3

K4 K6K5 K7

K8 K10K9 K11

K12 K14K13 K15

Key0

Key1

Key2

Key3

K16 K18K17 K19

K20 K22K21 K23

K24 K26K25 K27

K28 K30K29 K31

Key4

Key5

Key6

Key7

K32 K34K33 K35Key8

Encrypt Keys

K36 K38K37 K39Key9

K40 K42K41 K43Key10

InvMixCols

InvMixCols

InvMixCols

InvMixCols

InvMixCols

InvMixCols

InvMixCols

InvMixCols

InvMixCols

Encypt

Round K

eys

Equivalent Inverse cipher requires applying InvMixCols to Encrypt Round keys

Decry

pt

Round K

eys

Page 32: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

AES-128 Key Expansion

AESKEYGENASSIST xmm2, xmm0, 0x1

call key_expand_128

AESKEYGENASSIST xmm2, xmm0, 0x2

call key_expand_128

AESKEYGENASSIST xmm2, xmm0, 0x4

call key_expand_128

AESKEYGENASSIST xmm2, xmm0, 0x36

call key_expand_128

32

key_expand_128: pshufd xmm2, xmm2, 0xff vpslldq xmm3, xmm0, 0x4 pxor xmm0, xmm3 vpslldq xmm3, xmm0, 0x4 pxor xmm0, xmm3 vpslldq xmm3, xmm0, 0x4 pxor xmm0, xmm3 pxor xmm0, xmm2 movdqu XMMWORD PTR [rcx], xmm0 add rcx, 0x10 ret

Page 33: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

AES-192 Key Expansion

aeskeygenassist xmm2, xmm3, 0x1

call key_expansion_192

aeskeygenassist xmm2, xmm3, 0x2

call key_expansion_192

aeskeygenassist xmm2, xmm3, 0x4

call key_expansion_192

aeskeygenassist xmm2, xmm3, 0x8

call key_expansion_192

aeskeygenassist xmm2, xmm3, 0x10

call key_expansion_192

aeskeygenassist xmm2, xmm3, 0x20

call key_expansion_192

aeskeygenassist xmm2, xmm3, 0x40

call key_expansion_192

aeskeygenassist xmm2, xmm3, 0x80

call key_expansion_192

33

key_expand_192: key_expansion_192: pshufd xmm2, xmm2, 0x55 vpslldq xmm4, xmm0, 0x4 pxor xmm0, xmm4 pslldq xmm4, 0x4pxor xmm0, xmm4 Pslldq xmm4, 0x4 pxor xmm0, xmm4 pxor xmm0, xmm2 pshufd xmm2, xmm0, 0xff vpslldq xmm4, xmm3, 0x4 pxor xmm3, xmm4 pxor xmm3, xmm2 movdqu XMMWORD PTR [rcx], xmm0 add rcx, 0x10 movdqu XMMWORD PTR [rcx], xmm3 add rcx, 0x8ret

Page 34: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

AES-256 Key Expansion

aeskeygenassist xmm2, xmm3, 0x1

call key_expansion_256

aeskeygenassist xmm2, xmm3, 0x2

call key_expansion_256

aeskeygenassist xmm2, xmm3, 0x4

call key_expansion_256

aeskeygenassist xmm2, xmm3, 0x8

call key_expansion_256

aeskeygenassist xmm2, xmm3, 0x10

call key_expansion_256

aeskeygenassist xmm2, xmm3, 0x20

call key_expansion_256

aeskeygenassist xmm2, xmm3, 0x40

call key_expansion_256

34

key_expand_1256: key_expansion_256: pshufd xmm2, xmm2, 0xff vpslldq xmm4, xmm0, 0x4 pxor xmm0, xmm4 pslldq xmm4, 0x4 pxor xmm0, xmm4 pslldq xmm4, 0x4 pxor xmm0, xmm4 pxor xmm0, xmm2 movdqu XMMWORD PTR [rcx], xmm0 add rcx, 0x10 aeskeygenassist xmm4, xmm0, 0 pshufd xmm2, xmm4, 0xaa vpslldq xmm4, xmm3, 0x4 pxor xmm3, xmm4 pslldq xmm4, 0x4 pxor xmm3, xmm4 pslldq xmm4, 0x4

Page 35: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

Encrypting with AES round instructions

AES-128 ECB mode example

Round keys already expanded

for i form 1 to N_BLOCKS do

xmm0 = BLOCK [i] // load next data process

for j from 1 to 9 do

xmm0 = AESENC (xmm0, RK [j])

end

xmm0 = AESENCLAST (xmm0, RK [10])

store xmm0

end

35

Page 36: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

AES-128 assembler (encryption and decryption)

36

AES-128 decryption

pxor xmm0, xmm02

AESDEC xmm0, xmm01

AESDEC xmm0, xmm00

AESDEC xmm0, xmm9

AESDEC xmm0, xmm8

AESDEC xmm0, xmm7

AESDEC xmm0, xmm6

AESDEC xmm0, xmm5

AESDEC xmm0, xmm4

AESDEC xmm0, xmm3

AESDECLAST xmm0, xmm2

Decryption Round Keys

AESIMC xmm3, xmm3

AESIMC xmm4, xmm4

AESIMC xmm5, xmm5

AESIMC xmm6, xmm6

AESIMC xmm7, xmm7

AESIMC xmm8, xmm8

AESIMC xmm9, xmm9

AESIMC xmm00, xmm00

AESIMC xmm01, xmm01

AES-128 encryption

pxor xmm0, xmm2

AESENC xmm0, xmm3

AESENC xmm0, xmm4

AESENC xmm0, xmm5

AESENC xmm0, xmm6

AESENC xmm0, xmm7

AESENC xmm0, xmm8

AESENC xmm0, xmm9

AESENC xmm0, xmm00

AESENC xmm0, xmm01

AESENCLAST xmm0, xmm02

Page 37: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

Software Flexibility:

modes of operation

37

Page 38: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

ECB (Encrypt)

38

get next plaintext block

AES encrypt

store result into memory as ciphertext block

more data YES NO

DONE

Use AES-NI

building blocks

Page 39: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

CBC (Encrypt)

39

initialize feedback register with IV

get next plaintext block

XOR with feedback register

AES encrypt

store result into feedback register

store result into memory as ciphertext block

more data YES NO

DONE

Use AES-NI

building blocks

Page 40: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

CTR (Encrypt)

40

initialize counter register with IV

get counter register

XOR with next plaintext block

AES encrypt

increment counter register

store result into memory as ciphertext block

more data YES NO

DONE

Use AES-NI

building blocks

Page 41: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

GCM

41

data 1

ciphertext 1

data 2

ciphertext 2

data 3

ciphertext 3

hash 1

multiply with

hash key

in GF(2128)

hash 0 hash 2

multiply with

hash key

in GF(2128)

etc…

AES

CTR

computation

of the

Galois hash

Use AES-NI

building blocks

Page 42: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

Software Tools

42

Page 43: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

Software Development tools

43

C/C++ program

icl /arch:AVX <filename>

Executable binary

Program output

sde -- <binary name>

Prior to silicon Program output

today

Software

Development

Emulator

Page 44: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

Running the Basic Emulator

sde -- foo.exe <foo options>

For ease of use

• Special command window where every command is run on the emulator

% sde -help Usage: sde [args] -- application [application-args] -mix (run mix histogram tool) -omix (set the output file name for mix, Implies -mix. Default is "mix.out") -debugtrace (run mix debugtrace tool) -odebugtrace (set the output file name for debugtrace, Implies -debugtrace. Default is "debugtrace.out") -ast (run the AVX/SSE transition checker) -oast (set the output file name for the AVX/SSE transition checker. Implies -ast. Default is "avx-sse-transition.out") -no-avx (disable AVX emulation, just emulate AES+PCLMULQDQ+SSE4) -no-aes (disable AES+PCLMULQDQ+AVX emulation, just emulate SSE4) -pin-runtime (Use Pin's runtime libraries, required on some Linux* systems)

44

Page 45: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

Compiler support for using AES-NI

Compiler support through

• Inline asm

• Intrinsics

45

extern __m128i __cdecl _mm_aesdec_si128(__m128i v, __m128i rkey);

extern __m128i __cdecl _mm_clmulepi64_si128(__m128i v1, __m128i v2, const int imm8);

wmmintrin.h

#include <ia32intrin.h>

__m128i x, y, z;

z = _mm_aesdec_si128(x, y);

User program

Page 46: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

AES-128 CBC Encryption (Intrinsics)

void AES_128_CBC_Encrypt () {

int i, j, k;

__m128i tmp, feedback;

__m128i RKEY [11];

for (k=0; k<11; k++)

RKEY [k] = _mm_load_si128 ( (__m128i*)&Key_Schedule [4*k]);

};

feedback = _mm_load_si128 ( (__m128i*)&IV [0]);

for(i=0; i < NBLOCKS; i++) {

tmp = _mm_load_si128 ( (__m128i*)&PLAINTEXT[i*4]);

tmp = _mm_xor_si128 (tmp,feedback);

tmp = _mm_xor_si128(tmp, RKEY[0]);

for(j=1; j <10; j++) {

tmp = _mm_aesenc_si128 (tmp, RKEY [j]);

};

tmp = _mm_aesenclast_si128 (tmp, RKEY [10]);

feedback = tmp;

_mm_store_si128 ((__m128i*)&CIPHERTEXT[4*i], tmp);

}

} 46

Page 47: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

Performance

and optimizations

47

Page 48: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

Parallelization

•All useful modes are parallel (except for CBC-encrypt) are parallelizable

• Blocks can be processed independently

– Can apply the loop reversal technique

•The only serial mode in use is CBC encrypt

– Leading usage model is Bitlocker (and disk encryption in general)

– CBC-encrypt throughput is less sensitive in because

• Disk write latency is ok (CBC-encrypt)

• Disk read is sensitive (CBC-encrypt is parallel)

• Efficient SW technique can help squeeze out performance boost from the existing architecture

• The latency of the AES-NI does not matter too much

– As long as #registers ≥ Latency of instrcution

48

Page 49: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

Straightforward AES

for i form 1 to N_BLOCKS do

xmm0 = BLOCK [i] // load

xmm0 = AESENC (xmm0, RK [1])

xmm0 = AESENC (xmm0, RK [2])

xmm0 = AESENC (xmm0, RK [3])

xmm0 = AESENC (xmm0, RK [9])

xmm0 = AESENCLAST (xmm0, RK [10])

store xmm0

end

49

Wait L cycles

Performance: (10 x Latency) cycles / 16B

Page 50: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

Efficient Usage of AES-NI: Loop Reversal

for i from 0 to N_BLOCKS/8 -1 do

xmm0 = BLOCK [8*i+1], xmm2 = BLOCK [8*i+2]; … xmm8 = BLOCK [8*i+8]

xmm0 = AESENC (xmm0, RK [1])

xmm2 = AESENC (xmm2, RK [1])

xmm3 = AESENC (xmm2, RK [1])

xmm8 = AESENC (xmm8, RK [1])

xmm0 = AESENC (xmm0, RK [2])

xmm2 = AESENC (xmm2, RK [2])

xmm8 = AESENC (xmm8, RK [2])

xmm0 = AESENCLAST (xmm0, RK [10])

xmm2 = AESENCLAST (xmm2, RK [10])

xmm8 = AESENCLAST (xmm8, RK [10])

store xmm0; store xmm0; … store xmm8

end

50

L cycles elapse – ready

No need to wait

Scheduling the flow to space dependent AES-Ni by more than L cycles

Effectively, dispatch an AES-NI every cycle

Throughput: 80 cycles / (8*16B)

Gain speedup factor of L

Parallel modes of operation and fully pipelines hardware implementation of the AES-NI allow for re-scheduling the flow in a way that dependent AES-NI’s are spaced to hide the latency of one instruction

Page 51: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

Parallelizing CBC encryption

void AES_128_CBC_Encrypt_Parallel_4_Blocks () {

int i, j, k;

__m128i tmp, feedback, feedback1, feedback2,

__m128i feedback3, feedback4;

__m128i tmp1, tmp2, tmp3, tmp4;

__m128i RKEY [11];

for (k=0; k<11; k++){

RKEY [k] = _mm_load_si128 ( (__m128i*)&Key_Sched [4*k]);

};

feedback1 = _mm_load_si128 ( (__m128i*)&IV1 [0]);

feedback2 = _mm_load_si128 ( (__m128i*)&IV2 [0]);

feedback3 = _mm_load_si128 ( (__m128i*)&IV3 [0]);

feedback4 = _mm_load_si128 ( (__m128i*)&IV4 [0]);

for(i=0; i < NBLOCKS; i++)

tmp1 = _mm_load_si128 ( (__m128i*)&PLAINTEXT1[i*4]);

tmp2 = _mm_load_si128 ( (__m128i*)&PLAINTEXT2[i*4]);

tmp3 = _mm_load_si128 ( (__m128i*)&PLAINTEXT3[i*4]);

tmp4 = _mm_load_si128 ( (__m128i*)&PLAINTEXT4[i*4]);

51

tmp1 = _mm_xor_si128 (tmp1, feedback1); tmp2 = _mm_xor_si128 (tmp2, feedback2); tmp3 = _mm_xor_si128 (tmp3, feedback3); tmp4 = _mm_xor_si128 (tmp4, feedback4); tmp1 = _mm_xor_si128(tmp1,RKEY[0]); tmp2 = _mm_xor_si128(tmp2,RKEY[0]); tmp3 = _mm_xor_si128(tmp3,RKEY[0]); tmp4 = _mm_xor_si128(tmp4,RKEY[0]); for(j=1; j <10; j++) { tmp1 = _mm_aesenc_si128 (tmp1, RKEY [j]); tmp2 = _mm_aesenc_si128 (tmp2, RKEY [j]); tmp3 = _mm_aesenc_si128 (tmp3, RKEY [j]); tmp4 = _mm_aesenc_si128 (tmp4, RKEY [j]); }; tmp1 = _mm_aesenclast_si128 (tmp1, RKEY [10]); tmp2 = _mm_aesenclast_si128 (tmp2, RKEY [10]); tmp3 = _mm_aesenclast_si128 (tmp3, RKEY [10]); tmp4 = _mm_aesenclast_si128 (tmp4, RKEY [10]); feedback1 = tmp1; feedback2 = tmp2; feedback3 = tmp3; feedback4 = tmp4; _mm_store_si128 ((__m128i*)&CIPHERTEXT1[4*i], tmp1); _mm_store_si128 ((__m128i*)&CIPHERTEXT2[4*i], tmp2); _mm_store_si128 ((__m128i*)&CIPHERTEXT3[4*i], tmp3); _mm_store_si128 ((__m128i*)&CIPHERTEXT4[4*i], tmp4); } }

Parallelization at a higher level: operate on multiple independent data streams in parallel

Page 52: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

Performance projections

• Highly optimized software implementations of AES

– On today’s silicon ~15 cycles/byte (OpenSSL)

– 18 cycles/byte from MSFT on 2006 platform

• No side channel mitigation included

• Mitigation in costly (no known real “protected implementation”)

• With AES-NI:

– Side channel mitigation is built-in

Significant speedup

• 2-3x in CBC encrypt in serial mode

• More than 10x in parallel modes of operation

52

Page 53: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

More on

Software Flexibility

53

Page 54: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

Rijndael-256 (256b block size)

VPBLENDVB xmm3, xmm2, xmm0, xmm5

VPBLENDVB xmm4, xmm0, xmm2, xmm5

PSHUFB xmm3, xmm8

PSHUFB xmm4, xmm8

AESENC xmm0, xmm6

AESENC xmm2, xmm7

54

“left” half of RIJNDAEL input state (columns 0-3),

“right” half of RIJNDAEL input state (columns 4-7),

“right” half of RIJNDAEL round key

“left” half of RIJNDAEL round key Mask: 0x03020d0c0f0e0908b0a050407060100

(account for ShiftRows)

VPBLENDVB mask selecting bytes 1-3, 6-7, 10-11, 15 of from 1st operand and all other bytes from 2nd operand

Page 55: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

Isolating the AES Transformations

• AES-NI perform bundled sequences of AES transformations

– But - each one of these transformations can be isolated by a proper combination and the use of the byte shuffling (PSHUFB instruction).

• Motivation

– Constructing cipher variants

– Supporting possible future modifications in the AES standard

– Using the AES primitives as building blocks for ciphers and for cryptographic hash functions.

• Hashing: some of the new Secure Hash Function submissions to NIST’s SHA-3 competition use AES rounds and/or AES transformations.

– E.g., LANE, SHAMATA, SHAvite-3, and Vortex

55

Page 56: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

Isolating the AES Transformations

Isolating ShiftRows

PSHUFB xmm0, 0x0b06010c07020d08030e09040f0a0500

Isolating InvShiftRows

PSHUFB xmm0, 0x0306090c0f0205080b0e0104070a0d00

Isolating MixColumns

AESDECLAST xmm0, 0x00000000000000000000000000000000

AESENC xmm0, 0x00000000000000000000000000000000

Isolating InvMixColumns

AESENCLAST xmm0, 0x00000000000000000000000000000000

AESDEC xmm0, 0x00000000000000000000000000000000

Isolating SubBytes

PSHUFB xmm0, 0x0306090c0f0205080b0e0104070a0d00

AESENCLAST xmm0, 0x00000000000000000000000000000000

Isolating InvSubBytes

PSHUFB xmm0, 0x0b06010c07020d08030e09040f0a0500

AESDECLAST xmm0, 0x00000000000000000000000000000000

56

AESDECLAST xmm0, 0 Tmp:= Inverse Shift Rows (State); Tmp:= Inverse Substitute Bytes (Tmp); xmm0:= Tmp xor 0 = xmm0 AESENC xmm0, 0 Round Key:= 0 Tmp:= Shift Rows (Tmp); Tmp:= Substitute Bytes (Tmp); Tmp:= Mix Columns (Tmp); xmm0:= Tmp xor 0

Page 57: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

AES-NI and PCLMULQDQ: Latency and throughput

• Micro-architectural enhancements:

• Latency and throughput improve across CPU generations

• (Latency and Throughput are measured in cycles)

• The AES-NI AESENC/AESENCLAST, AESDEC/AESDECLAST

• Latency/Throughput

• WSM: 7/2; SNB: 8/1; HSW: 7/1 BDW: 7/1 SKL: 4/1

•PCLMULQDQ:

• Latency/Throughput

• SNB: 14/8 ; HSW: 7/2 ; BDW: 7/1 SKL: 4/1

57

Architecture Codenames: Westmere (WSM) Sandy bridge (SNB), Haswell (HSW), Broadwell (BDW), Skylake (SKL)

Page 58: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

Backup

58

Page 59: Intel’s New AES Instructions - Indian Statistical Institutedebrup/AEworkshop/slides/03_AES NI.pdf · Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron

References

S. Gueron. Intel Advanced Encryption Standard (AES) Instructions Set, Rev 3.01. Intel Software Network.

• https://software.intel.com/sites/default/files/article/165683/aes-wp-2012-09-22-v01.pdf

• S. Gueron. Intel's New AES Instructions for Enhanced Performance and Security. Fast Software Encryption, 16th International Workshop (FSE 2009), Lecture Notes in Computer Science: 5665, p. 51-66 (2009).

59