HighSpeed Cryptography in Java X 25519 Poly 1305

  • Slides: 42
Download presentation
High-Speed Cryptography in Java X 25519, Poly 1305, and Ed. DSA Adam Petcher Java

High-Speed Cryptography in Java X 25519, Poly 1305, and Ed. DSA Adam Petcher Java Platform Group, Oracle adam. petcher@oracle. com Copyright © 2018, Oracle and/or its affiliates. All rights reserved.

Safe Harbor Statement The following is intended to outline our general product direction. It

Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, timing, and pricing of any features or functionality described for Oracle’s products may change and remains at the sole discretion of Oracle Corporation. Copyright © 2018, Oracle and/or its affiliates. All rights reserved.

This Talk: Not All Crypto Composition of Talk Cryptography Efficient Java Benchmarks Copyright ©

This Talk: Not All Crypto Composition of Talk Cryptography Efficient Java Benchmarks Copyright © 2018, Oracle and/or its affiliates. All rights reserved. 3

Overview • New crypto algorithms implemented for JDK 11 and later • Modern techniques,

Overview • New crypto algorithms implemented for JDK 11 and later • Modern techniques, more efficient, better security • 100% Java crypto can be fast Copyright © 2018, Oracle and/or its affiliates. All rights reserved. 4

Outline • Background – Side-channel attacks – Modern crypto algorithms – General implementation techniques

Outline • Background – Side-channel attacks – Modern crypto algorithms – General implementation techniques • High-speed cryptography in Java – Implementation details – Benchmarks – Relevant future Java features Copyright © 2018, Oracle and/or its affiliates. All rights reserved. 5

Side-Channel Attacks Background Copyright © 2018, Oracle and/or its affiliates. All rights reserved. 6

Side-Channel Attacks Background Copyright © 2018, Oracle and/or its affiliates. All rights reserved. 6

Side Channels in Crypto: Square-and-Multiply Non-secret a, secret x, calculate ax r : =

Side Channels in Crypto: Square-and-Multiply Non-secret a, secret x, calculate ax r : = 1 for(i : = 0 to bit_length(x)) { r : = r^2 if (bit_set(x, i)) { r : = r * a } This branch is only taken when bit_set(x, i) for some i } Copyright © 2018, Oracle and/or its affiliates. All rights reserved. 7

Why is this a Problem? • Side-channel attack – Attacker looks at something in

Why is this a Problem? • Side-channel attack – Attacker looks at something in addition to function input/output – Learns secret keys, cleartext, etc. • Timing attack – Time of multiplication is measurable → trivially leaks number of bits set in secret – More sophisticated timing attacks run multiple operations to recover key • Cache attack (Flush+Reload) – Presence/absence of program fragment in cache tells whether it was executed – Co-located attacker can flush fragment from cache and see if it is re-loaded – Practical in multi-tenant cloud environment • Any branching (if/else, while, a ? b : c) can leak secrets to side channels Copyright © 2018, Oracle and/or its affiliates. All rights reserved. 8

Sources of Branching in Crypto Code • Efficient multiplication and modular reduction – Recursive

Sources of Branching in Crypto Code • Efficient multiplication and modular reduction – Recursive in general-purpose arithmetic libraries • Square-and-multiply – Also “double-and-add” • Elliptic curve point arithmetic – E. g. add(p, q) doesn’t work when p==q or p==0 – Check for exceptional conditions and do something else • Indexing is also a problem – Lookup in precomputed tables based on fragment of secret – Often used in elliptic curve arithmetic Copyright © 2018, Oracle and/or its affiliates. All rights reserved. 9

Modern Crypto Algorithms Background Copyright © 2018, Oracle and/or its affiliates. All rights reserved.

Modern Crypto Algorithms Background Copyright © 2018, Oracle and/or its affiliates. All rights reserved. 10

Crypto Algorithms in Context: TLS 1. 3 Client Server Client. Hello + Key Agreement

Crypto Algorithms in Context: TLS 1. 3 Client Server Client. Hello + Key Agreement Share Verify Signature Server. Hello + Key Agreement Share + Signature Key Agreement Authenticated Encryption Copyright © 2018, Oracle and/or its affiliates. All rights reserved. 11

Java Cryptography Architecture (JCA) • Provider architecture for crypto services in JDK • Application

Java Cryptography Architecture (JCA) • Provider architecture for crypto services in JDK • Application requests algorithm, provider framework chooses implementation • APIs for different services (Signature, Key. Agreement, Cipher, etc. ) • TLS implementation in JDK uses JCA for crypto services Copyright © 2018, Oracle and/or its affiliates. All rights reserved. 12

New Crypto Algorithms in the JDK • X 25519/X 448 – Key Agreement using

New Crypto Algorithms in the JDK • X 25519/X 448 – Key Agreement using Curve 25519/Curve 448 – Delivered in JDK 11 under JEP 324 (JCA only) • Ed. DSA – Signatures using Curve 25519/Curve 448 – Currently in development under JEP 339 • Poly 1305 – Authenticator typically combined with Cha 20 cipher to get authenticated encryption – Part of Cha 20/Poly 1305 delivered in JDK 11 under JEP 329 (JCA only, by Jamil Nimeh) • Others not covered in this talk (RSASSA-PSS, HKDF) Copyright © 2018, Oracle and/or its affiliates. All rights reserved. 13

Benefits of New Algorithms • Resilient against side-channel attacks • Good performance on multiple

Benefits of New Algorithms • Resilient against side-channel attacks • Good performance on multiple platforms • Trustworthy construction/parameters • Conservative security against known attacks • Simplementation that is hard to get wrong • No need to validate input • Not covered by patents • Important elements of TLS 1. 3 Copyright © 2018, Oracle and/or its affiliates. All rights reserved. 14

Modern Crypto Implementation Background Copyright © 2018, Oracle and/or its affiliates. All rights reserved.

Modern Crypto Implementation Background Copyright © 2018, Oracle and/or its affiliates. All rights reserved. 15

Efficient Branchless Elliptic Curve Implementation • Complete, efficient point arithmetic – No need to

Efficient Branchless Elliptic Curve Implementation • Complete, efficient point arithmetic – No need to check for exceptional points – Minimize number of field multiplications • Branchless double-and-add – Use branchless conditional assignment/swap e. g. RFC 7748 (X 25519/X 448) • Branchless table lookups Copyright © 2018, Oracle and/or its affiliates. All rights reserved. 16

Efficient Branchless Finite Field Arithmetic • Problem: addition/multiplication in field of integers modulo p

Efficient Branchless Finite Field Arithmetic • Problem: addition/multiplication in field of integers modulo p a + b (mod p) a * b (mod p) • With these constraints/goals – Numbers are large (~100 – 500 bits) – Must be fast • Primary bottleneck of these crypto algorithms • e. g. X 25519 op needs 2040 adds and 2550 multiplies – Value of p is fixed and known at compile time • Also, p has some useful structure like 2 a – 2 b - k – At most 1 or 2 adds before each multiply Copyright © 2018, Oracle and/or its affiliates. All rights reserved. 17

Finite Field Arithmetic in Crypto Algorithms X 25519/X 448 Ed. DSA X Elliptic Curve

Finite Field Arithmetic in Crypto Algorithms X 25519/X 448 Ed. DSA X Elliptic Curve Arithmetic (RFC 7748) Edwards Elliptic Curve Arithmetic (RFC 8032) Poly 1305 (RFC 7539) Specific Finite Field Implementations 2255 - 19 2448 – 2224 – 1 2130 - 5 Shared Finite Field Arithmetic Implementation Copyright © 2018, Oracle and/or its affiliates. All rights reserved. 18

Traditional Big Number Modular Arithmetic • Use an array of e. g. 64 -bit

Traditional Big Number Modular Arithmetic • Use an array of e. g. 64 -bit longs • Add: Loop through arrays, add and carry – Reduce during/after add to keep array length fixed – Or reduce after several add/multiply operations • Multiply: Use appropriate algorithm based on size – Grade school, Karatsuba – May require reduction during algorithm Copyright © 2018, Oracle and/or its affiliates. All rights reserved. 19

Traditional Modular Arithmetic Example add/carry multiply/carry reduce • Problems with carry: – May require

Traditional Modular Arithmetic Example add/carry multiply/carry reduce • Problems with carry: – May require branching – Cannot be done in parallel – Almost unnecessary after add Legend Each is a 64 -bit integer High-order limbs on the left Copyright © 2018, Oracle and/or its affiliates. All rights reserved.

Better Modular Arithmetic • Use an array of 64 -bit longs, but store fewer

Better Modular Arithmetic • Use an array of 64 -bit longs, but store fewer bits in each • Add: Loop through arrays and add – Numbers get a little bigger, use empty space – No carry/reduce necessary, can be done in parallel • Multiply: Grade school followed by carry/reduce – Use 64 -> 128 bit multiply – Use remaining space to carry and reduce back to starting state – Also works with Karatsuba for larger values (e. g. 400 bits) • Improves performance because adds are cheap Copyright © 2018, Oracle and/or its affiliates. All rights reserved. 21

Modular Arithmetic Example: Curve 25519 Start with at most 51 bits in 5 limbs

Modular Arithmetic Example: Curve 25519 Start with at most 51 bits in 5 limbs add After add: 52 bits multiply reduce After multiply: 108 bits in 9 limbs After reduce: 60 bits and back to 5 limbs carry reduce Use remaining space for carry/reduce carry Back to 51 bits in 5 limbs Copyright © 2018, Oracle and/or its affiliates. All rights reserved. 22

Efficient Modular Reduction • Modern algorithms use well-structured primes – E. g. 2255 –

Efficient Modular Reduction • Modern algorithms use well-structured primes – E. g. 2255 – 19, 2448 – 2224 – 1, and 2130 – 5 • Array representation is aligned with power of first term • Reduction requires one multiply and add per additional term • e. g. 2255 – 19 reduce – arr[i – 5] += arr[i] * 19 – arr[i] : = 0 • Very efficient compared to general-purpose reduction Copyright © 2018, Oracle and/or its affiliates. All rights reserved. 23

Implementation Notes • Necessary to specialize implementation for each field – Number of limbs,

Implementation Notes • Necessary to specialize implementation for each field – Number of limbs, bits per limb – How to reduce using structure of prime – Carry/reduce sequence • Parts of implementation can be shared – Add, multiply, carry, input/output conversion, etc. • Implementation does not branch – Always carry/reduce after multiplication regardless of value Copyright © 2018, Oracle and/or its affiliates. All rights reserved. 24

High-Speed Cryptography in Java Copyright © 2018, Oracle and/or its affiliates. All rights reserved.

High-Speed Cryptography in Java Copyright © 2018, Oracle and/or its affiliates. All rights reserved. 25

Pure Java Challenges • Java lacks some things that are typically used in native

Pure Java Challenges • Java lacks some things that are typically used in native implementations – 64→ 128 bit multiply – API for AVX and other SIMD instructions – Precise register management – Full support for unsigned arithmetic • Still, performance is good – Fast enough for typical applications, but slower than fastest native implementations – Modern crypto algorithms allow for faster implementations Copyright © 2018, Oracle and/or its affiliates. All rights reserved. 26

Overview of Solution • Implement EC operations as described in RFCs • Use mutable

Overview of Solution • Implement EC operations as described in RFCs • Use mutable objects to avoid garbage and copying • Use 32 -bit finite field implementation – Only need 32→ 64 bit multiply – Use signed arithmetic • Hand-optimize critical parts of code – Reuse mutable objects in EC operations – Use individual longs instead of arrays in field arithmetic • Solution balances performance with maintainability Copyright © 2018, Oracle and/or its affiliates. All rights reserved. 27

Finite Field Implementation • Representations – Curve 25519: 10 limbs, 26 bits each –

Finite Field Implementation • Representations – Curve 25519: 10 limbs, 26 bits each – Curve 448: 16 limbs, 28 bits each – Poly 1305: 5 limbs, 26 bits each • Curve 25519 representation is unusual, not aligned exactly – Slightly slower, but more general • Use signed long values in representation (e. g. 9 (mod 11) = -2) – Carry is more complicated, subtract is simpler – Java has better support for signed operations (e. g. Math. multiply. High) • Branch-free and 10 x faster than Big. Integer Copyright © 2018, Oracle and/or its affiliates. All rights reserved. 28

Optimization: Manual Loop Unrolling Start with nested for loop on arrays in multiplication: long[]

Optimization: Manual Loop Unrolling Start with nested for loop on arrays in multiplication: long[] c = new long[2 * NUM_LIMBS - 1]; for(int i = 0; i < NUM_LIMBS; i++) { for(int j = 0; j < NUM_LIMBS; j++) { c[i + j] += a[i] * b[j]; } } // reduce c into smaller output array Problem: compiler doesn’t optimize away array, unroll loop, etc. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. 29

Optimization: Manual Loop Unrolling Replace with primitives and unrolled loop: long c 0 =

Optimization: Manual Loop Unrolling Replace with primitives and unrolled loop: long c 0 = (a[0] * b[0]); long c 1 = (a[0] * b[1]) + (a[1] * b[0]); long c 2 = (a[0] * b[2]) + (a[1] * b[1]) + (a[2] * b[0]); … Doubles performance of X 25519 operations. Found by Sergey Kuksenko (Java Performance Team) Copyright © 2018, Oracle and/or its affiliates. All rights reserved. 30

Benchmarks Copyright © 2018, Oracle and/or its affiliates. All rights reserved. 31

Benchmarks Copyright © 2018, Oracle and/or its affiliates. All rights reserved. 31

Benchmarking Overview • Test environment – Laptop with Core i 5 -6300 U (Skylake)

Benchmarking Overview • Test environment – Laptop with Core i 5 -6300 U (Skylake) at 2. 4 GHz – Tests run within Ubuntu VM on Windows host – Useful hardware acceleration enabled (aes, pclmulqdq, avx 2) • Only testing/reporting implementations in JDK • Approximate relative performance only Copyright © 2018, Oracle and/or its affiliates. All rights reserved. 32

Benchmarks: X 25519/X 448 3000 Operations per Second (more is better) 2500 2000 P-256

Benchmarks: X 25519/X 448 3000 Operations per Second (more is better) 2500 2000 P-256 P-384 1500 P-521 X 25519 1000 X 448 500 0 Generate Key Pair Key Agreement Not pictured: fastest native X 25519 implementations around 15 k ops/second Copyright © 2018, Oracle and/or its affiliates. All rights reserved. 33

Benchmarks: Ed. DSA (1 KB message) 1200 Operations per Second (more is better) 1000

Benchmarks: Ed. DSA (1 KB message) 1200 Operations per Second (more is better) 1000 800 P-256 P-384 600 P-521 Ed 25519 400 Ed 448 200 0 Generate Key Pair Sign Copyright © 2018, Oracle and/or its affiliates. All rights reserved. Verify 34

Benchmarks: Poly 1305 800 MB per Second (more is better) 700 600 500 Hmac.

Benchmarks: Poly 1305 800 MB per Second (more is better) 700 600 500 Hmac. SHA 256 Poly 1305 400 AES/GCM* Encrypt 300 Cha 20/Poly 1305 Encrypt 200 100 0 1 MB Message 1 KB Message *AES/GCM uses hardware acceleration Copyright © 2018, Oracle and/or its affiliates. All rights reserved. 35

Java Crypto Performance in the Future Copyright © 2018, Oracle and/or its affiliates. All

Java Crypto Performance in the Future Copyright © 2018, Oracle and/or its affiliates. All rights reserved. 36

Better Code Generation • Is it necessary to hand-unroll loops? • Compiler could figure

Better Code Generation • Is it necessary to hand-unroll loops? • Compiler could figure out that – Array does not escape – Array length and indices are constant – Loop can be unrolled • Graal VM (Oracle Labs) is able to do this – Experiment by Eric Caspole, Java Performance Team • Allows us to use simple loop in code Copyright © 2018, Oracle and/or its affiliates. All rights reserved. 37

Better Low-level Instructions • 64 -to-128 -bit scalar multiply – Intrinsic for Math. multiply.

Better Low-level Instructions • 64 -to-128 -bit scalar multiply – Intrinsic for Math. multiply. High (JDK 10) – Compiler could combine multiply. High and multiply into single operation • SIMD (e. g. Intel AVX) – Allows us to do several 32→ 64 bit multiplies in parallel – Panama project (Open. JDK) working on Vector API for SIMD – API is expressive enough for vectorized X 25519 (working prototype) – Needs more work for performance to match scalar implementation Copyright © 2018, Oracle and/or its affiliates. All rights reserved. 38

Better Register/Memory Management • Modern crypto algorithms minimize register use • E. g. Poly

Better Register/Memory Management • Modern crypto algorithms minimize register use • E. g. Poly 1305 register management – 3 -5 registers for accumulator – 3 -5 registers for fixed “r” value (part of secret key) – Message is read from memory, result is written out at end, no other memory IO – Optionally, r is in memory, accumulator in FP registers, all GP registers for cipher • No way to tell Java “keep this in register” – Not in C either, fastest native implementations use assembly Copyright © 2018, Oracle and/or its affiliates. All rights reserved. 39

Conclusion Copyright © 2018, Oracle and/or its affiliates. All rights reserved. 40

Conclusion Copyright © 2018, Oracle and/or its affiliates. All rights reserved. 40

Conclusion • Modern crypto algorithms implemented in JDK, and more on the way –

Conclusion • Modern crypto algorithms implemented in JDK, and more on the way – Performance is good now, and can be improved in the future • More information – X 25519/X 448 JEP: http: //openjdk. java. net/jeps/324 – Ed. DSA JEP: http: //openjdk. java. net/jeps/339 – Cha 20/Poly 1305 JEP: http: //openjdk. java. net/jeps/329 • Try it out! – JDK 11 Release: http: //jdk. java. net/11/ – JEPs have example code Copyright © 2018, Oracle and/or its affiliates. All rights reserved. 41

Other Sessions • Transport Layer Security (TLS) v 1. 3 support in Java –

Other Sessions • Transport Layer Security (TLS) v 1. 3 support in Java – Bradford Wetmore and Xue-Lei Fan – Monday, 11: 30 to 12: 15, Moscone West 2004 • Sergey Kuksenko talks about Performance – HTTP/2 Client: Wednesday, 2: 30 -3: 15, Moscone West 2005 – Garbage Collection: Wednesday, 4: 00 -4: 45, Moscone West 2004 • Java Crypto Hackergarten – Tuesday 10: 00 -12: 00, Developer Lounge Loft, Moscone West, Level 2 Copyright © 2018, Oracle and/or its affiliates. All rights reserved.