Efficient Execution of Augmented Reality Applications on Mobile
Efficient Execution of Augmented Reality Applications on Mobile Programmable Accelerators Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke University of Michigan December 10, 2013 1 University of Michigan Electrical Engineering and Computer Science
Augmented Reality • Physical world + Computer generated inputs Commerce 2 University of Michigan Electrical Engineering and Computer Science
Augmented Reality • Physical world + Computer generated inputs Commerce Information 3 University of Michigan Electrical Engineering and Computer Science
Augmented Reality • Physical world + Computer generated inputs Commerce Information Games Compared to multimedia applications, 1. User interactive 2. Computationally intensive 4 University of Michigan Electrical Engineering and Computer Science
Application Characteristics Software Pipelinable Data Parallel ag e av er AR ck tra 3 D g in m sv h itc at io lo ca liz sp ar di st n 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% ity Execution Time (%) Remaining 1. • 69% in data Extracting parallel loops Kernels (DLP loops) Feature => SIMD / Coarse-Grained Reconfigurable Architecture (CGRA) Virtual Object Rendering Kernel 2. • 15% in software pipelinable loops (SWP loops) • =>Video CGRA Conferencing with Virtual Object Manipulation 5 University of Michigan Electrical Engineering and Computer Science
SIMD vs. CGRA • SIMD – Identical lanes – Shared instruction fetch (Same schedule across PEs) – SIMD memory access PE# : SIMD lane instruction PE 0 PE 1 PE 2 PE 3 PE 4 PE 5 PE 6 PE 8 PE 7 6 PE 9 PE 10 PE 11 PE 12 PE 13 PE 14 PE 15 University of Michigan Electrical Engineering and Computer Science
SIMD vs. CGRA Homogeneous CGRA • Heterogeneous CGRA unitsefficient than homogeneous CGRA – Identical More energy Mesh-like interconnects – Less performance compared to homogeneous CGRA – Software pipelining PE 0 PE 1 PE 2 PE 3 PE 4 PE 5 PE 6 PE 7 PE# : Processing Element with All Units PE# : Processing Element with Multipliers PE# : Processing Element with Memory Units PE 8 PE 9 PE 10 PE 11 PE# : Processing Element without Complex Units PE 12 PE 13 PE 14 PE 15 7 University of Michigan Electrical Engineering and Computer Science
SIMD vs. CGRA Homo. CGRA SIMD Hetero. CGRA 1. 8 1. 6 1. 4 Normalized Energy Normalized Execution Time SIMD 1. 2 1 0. 8 0. 6 0. 4 0. 2 0 0 SWP DLP total Hetero. CGRA 1. 2 0. 2 DLP Homo. CGRA SWP total 1. In DLP loops, SIMD > CGRA 2. In total execution time and energy, CGRA > SIMD (due to SWP loops) 3. In energy consumption, heterogeneous CGRA > homogeneous CGRA (20% less energy with only 4% performance loss) 8 University of Michigan Electrical Engineering and Computer Science
Adding SIMD Support for CGRA • Heterogeneous CGRA – Grouping multiple PEs to form an identical SIMD core PE 0 PE 1 PE 2 PE 3 PE 4 PE 5 PE 6 PE 7 PE 8 PE 9 PE 10 PE 11 PE 12 PE 13 PE 14 PE 15 SIMD Core 1. How do we obtain the efficiency of single instruction fetch? 2. How do we achieve the efficiency of SIMD memory access? 9 University of Michigan Electrical Engineering and Computer Science
Efficient Instruction Fetch • Fetch instruction once from memory – Pass around the instruction to the next SIMD core – Last SIMD core stores the instruction in a recycle buffer SIMD Core 0 SIMD Core 3 0 iteration 4 iteration 3 SIMD Core 1 SIMD Core 2 iteration 1 iteration 2 10 University of Michigan Electrical Engineering and Computer Science
SIMD Memory Access • Single memory request, multiple responses – Split transaction • Enables forwarding (Request ID) – SIMD mode flag, stride information Mem. Unit 0 Mem. Unit 1 Mem. Unit 2 Mem. Unit 3 Bank 0 Bank 1 Bank 2 Bank 3 11 University of Michigan Electrical Engineering and Computer Science
Experimental Setup • Baseline – Heterogeneous CGRA with 16 PEs • 4 PEs with memory units, 4 PEs with multipliers • Our solution – Baseline + SIMD support • 1 cycle latency ring network, 16 -entry recycle buffer • Compiler – IMPACT frontend compiler – Edge-centric modulo scheduler – ADRES framework • Power – 65 nm technology @ 200 MHz/1 V – CACTI 12 University of Michigan Electrical Engineering and Computer Science
Evaluation for DLP Loops 16 cores. ILP (SIMD) 4 SIMD withinvs. the loops cores (Our solution) Proposed SIMD Baseline 1 0. 8 0. 6 0. 4 0. 2 SIMD 1 0. 8 0. 6 0. 4 0. 2 g av AR 3 D g in m tra ck sv h st itc n at io lo ca sp ar di liz ity g av AR 3 D g in m tra ck sv itc st at io liz lo ca sp ar h 0 n 0 di Proposed 1. 2 Normalized Energy 1. 2 ity Normalized Execution Time Baseline - Our solution is 14. 1% slower compared to SIMD. - Our solution achieves nearly the same energy efficiency as SIMD. 13 University of Michigan Electrical Engineering and Computer Science
Evaluation for Total Execution SIMD Baseline SIMD g av AR 3 D g in m tra ck sv h itc lo st n liz ity sp ar di at io Normalized Energy g av AR 3 D g in m tra ck sv h itc st n at io liz lo ca sp ar di Proposed 1. 8 1. 6 1. 4 1. 2 1 0. 8 0. 6 0. 4 0. 2 0 ca Proposed 1. 6 1. 4 1. 2 1 0. 8 0. 6 0. 4 0. 2 0 ity Normalized Execution Time Baseline Our solution achieves 17. 6% speedup with 16. 9% less energy compared to baseline heterogeneous CGRA. 14 University of Michigan Electrical Engineering and Computer Science
Conclusion • Best performing / energy-efficient solution – DLP loops : SIMD – Whole application : CGRA • Two techniques to implement SIMD support efficiently for CGRAs. – Efficient instruction fetch : Ring network + recycle buffer – SIMD memory access : Split transaction + stride information in header – Results in 3. 4% power saving • CGRAs with SIMD support improves overall performance by 17. 6% with 16. 9% less energy. 15 University of Michigan Electrical Engineering and Computer Science
Questions? For more information http: //cccp. eecs. umich. edu jasonjk@umich. edu 16 University of Michigan Electrical Engineering and Computer Science
CGRA Memory Access • Resolve bank conflicts through buffering – Compiler accounts for additional buffering delay Mem. Unit 0 Mem. Unit 1 Mem. Unit 2 Mem. Unit 3 Bank 0 Bank 1 Bank 2 Bank 3 17 University of Michigan Electrical Engineering and Computer Science
Compilation Flow Program Loop Classification DLP ILP Matching SWP Low ILP High ILP Acyclic Scheduling Modulo Scheduling Code Generation Executable 18 University of Michigan Electrical Engineering and Computer Science
Power Analysis 1. 2 1 (-) Savings from memory (+) Overheads from ring network, recycle buffer, and SIMD memory access 0. 8 Memory Control 0. 6 RF 0. 4 FU 0. 2 0 CGRA mode (Baseline) SIMD mode - SIMD mode further saves power by 3. 4%. 19 University of Michigan Electrical Engineering and Computer Science
Resource Utilization in DLP Loops CGRA mode SIMD mode Resource Utilization 0. 6 0. 5 0. 4 0. 3 0. 2 0. 1 g av AR 3 D g tra ck in m sv h itc st liz ca lo di sp ar at io ity n 0 SIMD mode can utilize 13. 6% more resources in DLP loops. - Compiler generates more efficient schedule with fewer resources. (Less routing, less exploration) 20 University of Michigan Electrical Engineering and Computer Science
PE 0 PE 1 PE 2 PE 3 PE 4 PE 5 PE 6 PE 7 PE 8 PE 9 PE 10 PE 11 PE 12 PE 13 PE 14 PE 15 21 University of Michigan Electrical Engineering and Computer Science
- Slides: 21