Design and Implementation of Multipleprecision Integer Library for

Design and Implementation of Multiple-precision Integer Library for GPUs Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu

Background & Related Work • Multiple-Precision Integer • GPU Computing & CUDA Multiple-Precision Arithmetic for CUDA • Multiple-Precision Arithmetics Implementation on GPUs Experimental Result • Data Structure

Multiple-Precision Integer • 32 bit & 64 bit System • Multiple-Precision Integer GPU Computing & CUDA • GPGPU • CUDA

10 Based Integer Big Integer in System • b is 2^32

GPU Computing ----The Power of GPU • Computing Capability • Memory Bandwidth

G 80 GPU Host Input Assembler Setup / Rstr / ZCull SP SP SP Geom Thread Issue SP SP SP Pixel Thread Issue SP SP SP TF TF L 1 L 1 L 2 FB Streaming Processor (SP) L 2 FB Streaming Multiprocessor (SM) L 2 FB SP L 2 FB Thread Processor Vtx Thread Issue

CUDA: Compute Unified Device Architecture CUDA: CPU + GPU C Parallel Computing modal Single instruction Multiple Thread (SIMT) All threads run the same function(1000 s threads on the fly) Each core deal with different data Hidden the IO by multiple-threads(more than 1000 s threads) Speed up Computing／IO Translation Coalesce the IO one time When half warp thread access neighboring data 1 cycle@GPU vs. ~1000 cycles@CPU

Background & Related Work • . Multiple-Precision Arithmetic for CUDA • Multiple-Precision Arithmetics Implementation on GPUs • Data Structure Experimental Result • Optimization of Data on CUDA

1. Multiple-precision Comparison 2. Multiple-precision Addition 3. Multiple-precision Subtraction 4. Multiple-precision Modular Addition 5. Multiple-precision Modular Subtraction

6. Multiple-precision Multiplication 7. Multiple-precision Division 8. Multiple-precision Montgomery Reduction 9. Multiple-precision Montgomery Multiplication 10. Barrett Modular Reduction Algorithm

11. Multiple-precision Multiplicative Inversion 12. Multiple-precision Montgomery Exponentiation 13. Montgomery Multi. Exponentiation 14. Multiple-precision Modular Addition …

Background & Related Work • . Multiple-Precision Arithmetic for CUDA • . Implementation on GPUs • Data Structure Experimental Result • Optimization of Data on CUDA

Data Structure • Two types of Data Structure Constant Value • Using Cache memory with Constant Temp value • Using Shared memory for temp value Balance Resource • Balance threads and memory Example • Data encoding

Example C = vector. A * Matrix B % prime

Global Memory There is no cache for global memory on G 80/G 200 Constant memory & texture memory have little cache IO latency 400 -600 clock cycles This is the bottle neck Key to Optimization!

Coalesced Global Memory Accesses

Non-Coalesced Global Memory Accesses

Coalescing on 1. 2 and Higher Devices Global memory access by threads in a half-warp can be coalesced When the words accessed by all threads lie in the same segment of size equal to: 32 bytes if all threads access 8 -bit words 64 bytes if all threads access 16 -bit words 128 bytes if all threads access 32 -bit or 64 -bit words Any pattern of addresses requested by the halfwarp Including patterns where multiple threads access the same address

Example of New Coalescing Rules Segment 0 (128 B) Segment 1 (128 B) Address Address Address Address 0 4 120 … 116 124 128 176 188 252 … 172 180 184 … Thread Thread 0 2 … 15 1 3 14 Reduced to 32 B Segment size is 32 bytes for 8 bit data, 64 bytes for 16 -bit data, 128 bytes for 32 -, 64 - and 128 bit data.

Example C = vector. A * Matrix B % prime

Background & Related Work • . Multiple-Precision Arithmetic for CUDA • . Implementation on GPUs • . Experimental Result

Multiple-Precision Addition (GPU GTX 280) 2500 65536, 2054. 699669 2000 65536, 2045. 259925 Add Speed (MB/s) Alignment(512) 1500 Alignment(1024) Normal(512) 1000 Normal(1024) 500 0 0 10000 20000 30000 Numbers 40000 50000 60000 70000

Multiple-Precision Addition GPU vs CPU 2500 Add Speed(MB/s) 2000 1500 GPU Alignment(1024) GPU Alignment(512) CPU 512 bits 1000 CPU 1024 bits 500 0 256 512 1024 2048 4096 Number 8192 CPU: Intel® Core™ i 7 CPU 860 @ 2. 80 GHz (single thread) GPU: XFX GTX 280, 1. 24 GHz 16384 32768 65536

Example C = vector. A * Matrix B % prime

Result of the Example GPU vs CPU 450 Encoding Speed (MB/s) 400 350 GPU 300 CPU 250 200 150 100 50 0 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 Matrix Width CPU: Intel® Core™ i 7 CPU 860 @ 2. 80 GHz (single thread) GPU: XFX GTX 280, 1. 24 GHz

Summary • 1 Multiple-Precision • 2 Arithmetic • 3 GPU Computing & Optimization • 4 Example & result

Thank you!