Vulnerabilities on highend processors Andr Seznec IRISAINRIA CAPS

Vulnerabilities on high-end processors André Seznec IRISA/INRIA CAPS project-team 1

2 A paradox § Microarchitectures are more and more complex § Timing side channel attacks were presented on versions of AES (Bernstein) and RSA (Açiimez et al. )

3 Many hardware features only to improve performance § Caches § Pipeline § Superscalar execution § Branch prediction § Thread parallelism

4 Execution time of a short instruction sequence is a complex function ! Branch Predictor Correct mispredict hit miss ITLB hit miss DTLB hit miss I-cache hit miss Execution core L 2 Cache D-cache

5 Execution time of a short instruction sequence is a complex function (2) § Depends on the precise state of every microarchitecture component: è More than 100 speculative instructions inflight at the same time on a Pentium 4 § Instructions are executed out-of-order. è Strange correlations almost impredictable at compile time (even in the back-end compiler)

Understanding AES cache timing attack on high end microprocessor (follows Bernstein 2005) § AES with lookup tables is a 10 round algorithm with the following “vulnerabilties” è The number, the types and the order of the instructions are independent of the key K and the message M to be encrypted. The exact locations of the data word read and written by the first round only depend on K xor M: – The execution time of the first round depends on K xor M (at least statistically) CAN BE EXPLOITED è 6

7 Bernstein 2005 (empty cache) § Plaintext attack § Irrealistic hypothesis: è Access to cycle-accurate encryption timing è Cache is flushed between two encryptions • Not explicit in the paper (but see Lauradoux et al. ) § Byte by byte determination of the key based on statistically determining the maximum encryption time for each byte of K xor M § works only on Pentium 3, not on Pentium 4

8 A loaded cache attack (proof of concept codes available) § § Plaintext attack: è Timing of large number of encryptions An irrealistic hypothesis: è Access to cycle-accurate encryption timings On a byte basis of K xor M, determine bit subchains statistically leading to the highest encryption time (+ threshold to get confidence) Depending on microarchitectures: – 0 to 80 bits of the key recovered by this method depending on the model and stepping of Pentium 4 – Suspect exercising banking in the cache

9 First vulnerability § For given sequence, è Timings are erratic: • Unlikely to get exactly the same timing è But statistically correlated: • cache banking, operation chaining appears in the average

10 A possible counter measure for AES § Periodically and randomly change the mapping of the look up tables: è 9000 cycles for this change: XOR based permutation: • See Lauradoux et al è HAVEGE can provide the random numbers.

11 Indirect timing measures ? § Hypothesis: è The attacker has access to user mode on the system (legal or illegal) è The attacker has no access to your data è He/she can run concurently its process with the encryption § On conventional systems, no access to microscopic timing of your application: è Time slice in 1, 000 s cycles

12 Simultaneous Multithreading (SMT): parallel processing on a single processor § functional units are underused on superscalar processors § SMT: è Sharing the functional units on a superscalar processor between several process § Advantages: è Single process can use all the resources units è dynamic sharing of all structures on parallel/multiprocess workloads Second Vulnerability

13 Superscalar Issue slots SMT

14 Indirect timing measures on a SMT processor (principles) SPY wants to get information on CRYPT 1. SPY and CRYPT runs in parallel 2. SPY tracks a specific event on CRYPT: 1. For instance execution of a branch 3. SPY saturates hardware resources needed for this event by CRYPT for fast execution 4. SPY records its own execution time (reading the hardware clock counter): 1. Irregurality in its own execution time signals the event: 1. CRYPT has try to grab the hardware resource

15 Indirect timing measures on a SMT proof of concept (derived from SBPA) The skeleton of a naive RSA core For I =1 to N Sequence X // 1, 000 s of cycles If Key[I]=1 Sequence Y // 1, 000 s of cycles Endfor Spy this branch B

16 Indirect timing measures on a SMT proof of concept (2) § Branch instructions are buffered in a BTB: è On Pentium 4, when the branch misses in the BTB, more than 20 cycles penalty § SPY: nearly infinite loop iterating on branching over a set of branches occupying the possible entries for B è Track irregularities in the timing of the loop: • When B is executed, a branch of the SPY is ejected from the BTB, thus creating a timing irregularity: – Iteration is X-type or XY-type Able to reproduce this attack on a toy example

17 Indirect timing measures on a SMT § Feasible: è On a branch on Pentium 4 HT, information is leaking: • I recovered all the bits of 32 bits key in a single run (on a toy example) è Same kind of attack may apply for cache access: memory access sequence could be discovered

18 Feasible, but difficult § Technically, very difficult: è Lack of documentation on the BTB • Strange indexing, unknown associativity, BTB hierarchy è Requires relatively infrequent events: 1, 000 s cycles frequency: measure resolution is in the 100 s cycles resolution

19 So what ? § On Pentium 4 HT: è If key bits control branches (or addresses of loads): • Might be recovered by a spy thread

20 Countermeasures § Just deactivate Hyperthreading. è At present that is a global OS mode (boot time) § Rework implementation: è Introduce randomness in control path at execution ? • Makes attack much more complex