Department of Computer Science Efficient Execution of Memory

Department of Computer Science Memory Access Phase A dynamic portion of a program where

Department of Computer Science Execution Model Read D$, little comp. , write D$ Natural

Core becomes bottleneck (Power) ~ 2. 3 W ~ 6. 0 W Address Computation

Department of Computer Science Goal: To more efficiently access memory, obtain OOO’s performance without

Department of Computer Science Memory Access Dataflow A specialized dataflow architecture to access memory

Department of Computer Science What does the core do? address ready, control variable resolved,

Department of Computer Science MAD ISA Primitives Dataflow Graph Nodes – Analogous to compute

Department of Computer Science Transforming ISA Base. B 1 Pseudo Program + for(i=0; i<n;

Department of Computer Science Transforming ISA Base. B 1 + + n + i

Department of Computer Science MAD ISA (cont’d) Static Dataflow: Computations Dynamic Dataflow: Event-Condition. Actions

Department of Computer Science Transforming ISA Named Event Queues MAD ISA Data Movement #

Department of Computer Science Microarchitecture Matching Events Move data Data-driven computation 13

Department of Computer Science MAD Execution Code Gen. Accelerator Processor Off MAD ISA MAD

Department of Computer Science Evaluation Methodology Baseline: In-order, OOO 2 and OOO 4 MAD

Department of Computer Science Evaluation & Analysis Performance – Explicit static & dynamic dataflow,

Department of Computer Science Summary - Performance MAD’s performance is similar to OOO 4

Department of Computer Science Summary Energy ~Half energy compared to OOO 2 – Compared

Department of Computer Science Power: Natural Phases OOO 2, OOO 4, MAD 2, MAD

Department of Computer Science Summary MAD is an novel and useful customization for memory

Department of Computer Science Questions 21

Slides: 21

Download presentation

Department of Computer Science Efficient Execution of Memory Access Phases Using Dataflow Specialization Chen-Han Ho, Sung Jin Kim, and Karthikeyan Sankaralingam

Department of Computer Science Memory Access Phase A dynamic portion of a program where its instruction stream is predominantly for (f=0; f<FSIZE; f+=4) { __m 128 xmm_in_r = _mm_loadu_ps(in_r+p+f); memory accesses and address generation. __m 128 xmm_in_i = _mm_loadu_ps(in_i+p+f); __m 128 xmm_mul_r = _mm_mul_ps(xmm_in_r, xmm_coef); for (i=0; i<v_size; ++i){ __m 128 xmm_mul_i = for(i=0; i<8; ++i) {+= V[i]; _mm_mul_ps(xmm_in_i, xmm_coef); A[K[i]] accum_r = _mm_add_ps(xmm_accum_r, for(j=0; j<8; ++j) { } _mm_sub_ps( xmm_mul_r, xmm_mul_i)); for(int y = 0; y < src. Img. height; ++y ) float sum=0; accum_i = _mm_add_ps(xmm_accum_i, for(int x = 0; x < src. Img. width; ++x ){ xmm_mul_r, xmm_mul_i)); p = src. Img. build 3 x 3 Window(x, y); for(k=0; k<8; ++k) { _mm_add_ps( } sum+=mat. AT[i*mat. Acol+k]*NPU_SEND(p[0][0]); NPU_SEND(p[0][1]); NPU_SEND(p[0][2]); NPU_SEND(p[1][0]); NPU_SEND(p[1][1]); NPU_SEND(p[1][2]); mat. B[j*mat. Brow+k]; NPU_SEND(p[2][0]); NPU_SEND(p[2][1]); } NPU_SEND(p[2][2]); NPU_RECEIVE(pixel); mat. CT[i*mat. Bcol+j]+=sum; dst. Img. set. Pixel(x, y, pixel); } } } Aggregation, matrix multiply, image processing… } 2

Department of Computer Science Execution Model Read D$, little comp. , write D$ Natural Read D$, send to accel, write D$ Speedup In-order OOO 2 OOO 4 1. 0 1. 5 2. 2 Speedup In-order OOO 2 OOO 4 Dy. SER 1. 0 1. 5 2. 7 SSE 1. 0 1. 7 2. 9 NPU 1. 0 1. 6 2. 2 3

Core becomes bottleneck (Power) ~ 2. 3 W ~ 6. 0 W Address Computation + Data Access < 40% 4

Department of Computer Science Goal: To more efficiently access memory, obtain OOO’s performance without power overheads 5

Department of Computer Science Memory Access Dataflow A specialized dataflow architecture to access memory (Processor pipeline turned off) Big idea: exposing the concept of triggering events & actions Processor MAD Core (off) Cache Accelerator 6

Department of Computer Science What does the core do? address ready, control variable resolved, value returned from cache React Access memory with loads and stores Create Events Compute patterns computes the address and control variables 7

Department of Computer Science MAD ISA Primitives Dataflow Graph Nodes – Analogous to compute instructions & reg state Actions – Analogous to ld/st and move instructions Events – Analogous to program counter sequencing Conventional RISC/CISC ISA: 1) Register state 2) Compute instructions 3) LD/St instruction 4) Program counter and control flow Arch. Primer! 8

Department of Computer Science Transforming ISA Base. B 1 Pseudo Program + for(i=0; i<n; ++j) + { } n + i < a[i] = accel(a[i], b[i]) Computation Ports RISC ISA. L 0 ld, st, addi, ble, $r 0+$r 1 -> $acc 0 $r 2+$r 1 -> $acc 1 $acc 2 -> $r 0+$r 1, 1 -> $r 1 $r 4, $r 1, . L 0 Named registers Branch, PC. . Data Movement 9

Department of Computer Science Transforming ISA Base. B 1 + + n + i < Named Event Queues # Dataflow N 0: $eq 7 + N 1: $eq 7 + N 2: $eq 7 + N 3: $eq 7 < Graph Nodes base A -> $eq 0, $eq 2 #Addr A base B -> $eq 4 #Addr B 1 -> $eq 6 #i++ n -> $eq 8 #i<n 10

Department of Computer Science MAD ISA (cont’d) Static Dataflow: Computations Dynamic Dataflow: Event-Condition. Actions (ECA) rules on Event if Condition do Action A combination of primitive dataflow events (the arrival of data) data states load, store, or moves 11

Department of Computer Science Transforming ISA Named Event Queues MAD ISA Data Movement # ECA Rules On $eq 0 , if , do A 0: ld, $eq 0 ->$eq 1 On $eq 2∧eq 3 , if , do A 1: st, $eq 3 ->$eq 2 On $eq 4 , if , do A 2: ld, $eq 4 ->$eq 5 On $eq 8∧eq 6 , if $eq 8(true), do A 3: mv, $eq 6 ->$eq 7, $eq 8 -> # Dataflow-Graph Nodes N 0: $eq 7 + base A -> $eq 0, $eq 2 #Addr A N 1: $eq 7 + base B -> $eq 4 #Addr B N 2: $eq 7 + 1 -> $eq 6 #i++ N 3: $eq 7 < n -> $eq 8 #i<n EQ States (Conditions) Computation 12

Department of Computer Science Microarchitecture Matching Events Move data Data-driven computation 13

Department of Computer Science MAD Execution Code Gen. Accelerator Processor Off MAD ISA MAD (Access) 14

Department of Computer Science Evaluation Methodology Baseline: In-order, OOO 2 and OOO 4 MAD integration: – 256 Dataflow Nodes, 64 Event Queues – Integrated to OOO 2/OOO 4’s LSU Natural and Induced Memory Access Phases – Accelerators: Dy. SER, SIMD, NPU, C-Cores Reproduce/reuse benchmarks relevant to each accelerator 15

Department of Computer Science Evaluation & Analysis Performance – Explicit static & dynamic dataflow, larger instruction window, less speculative – Can MAD match 2/4 -OOO? MAD should consume less energy/power 16

Department of Computer Science Summary - Performance MAD’s performance is similar to OOO 4 – MAD can utilize OOO 2’s LSU better, MAD+OOO 2 > OOO 2, with OOO 4 MAD can be better than OOO 4 – In Dy. SER programs, there are more opportunities for OOO 4 to speculatively execute memory instructions 17

Department of Computer Science Summary Energy ~Half energy compared to OOO 2 – Compared to In-Order, OOO 2 delivers better performance but does not save energy ~30% energy compared to OOO 4 18

Department of Computer Science Power: Natural Phases OOO 2, OOO 4, MAD 2, MAD 4 MAD < sum(Fetch, Decode, Dispatch, Issue, Execute, Write. Back) – LSU: More than 2 -OOO, similar to 4 -OOO 19

Department of Computer Science Summary MAD is an novel and useful customization for memory access phases Performance improvement and Power reduction Flexible & effective for accelerators 20

Department of Computer Science Questions 21