Design and Modeling of Specialized Architectures PHD DISSERTATION
- Slides: 53
Design and Modeling of Specialized Architectures PHD DISSERTATION DEFENSE Yakun Sophia Shao May 9 th, 2016 Harvard University
Moore’s Law 2 Harvard University
CMOS Scaling is Slowing Down 180 nm 130 nm 90 nm 65 nm 45 nm 32 nm 22 nm 14 nm 10 nm http: //www. anandtech. com/show/9447/intel-10 nm-and-kaby-lake 3 Harvard University
CMOS Technology Scaling Technological Fallow Period 4 Harvard University
Potential for Specialized Architectures 16 Encryption 17 Hearing Aid 18 FIR for disk read 19 MPEG Encoder 20 802. 11 Baseband [Zhang and Brodersen] 5 Harvard University
Cores, GPUs, and Accelerators: Apple A 8 So. C Out-of-Core Accelerators 6 Harvard University
Cores, GPUs, and Accelerators: Apple A 8 So. C Out-of-Core Accelerators 7 Harvard University
Cores, GPUs, and Accelerators: Apple A 8 So. C Out-of-Core Accelerators Maltiel Consulting estimates 8 Our estimates Harvard University
Challenges in Accelerators § Flexibility – Fixed-function accelerators are only designed for the target applications. § Programmability – Today’s accelerators are explicitly managed by programmers. 9
Today’s So. C OMAP 4 So. C 10 Harvard University
Today’s So. C ARM Core s Audio DSP Video DSP Face Imaging GPUDMA USB SD System Bus USB DMA Secondary Bus Tertiary Bus OMAP 4 So. C 11 Harvard University
Challenges in Accelerators § Flexibility – Fixed-function accelerators are only designed for the target applications. § Programmability – Today’s accelerators are explicitly managed by programmers. § Design Cost – Accelerator (and RTL) implementation is inherently tedious and time-consuming. 12 Harvard University
Today’s So. C CPU Buses Mem GPU/ Inter. DSP Acc Acc face Acc Acc Acc 13 Harvard University
Future Accelerator-Centric Architectures Big Cores GPU/DS P Small Cores Shared Resources Memory Interface Sea of Fine-Grained Accelerators How to decompose applications into accelerators? How to rapidly design lots of accelerators? How to design and manage the shared resources? 14 Flexibility Design Cost Programmability Harvard University
Contributions Big Cores GPU/D SP WIICA: Accelerator Workload Characterization [ISPASS’ 13] Mach. Suite: Accelerator Benchmark Suite [IISWC’ 14] Small Core s Shared Resources Aladdin: Accelerator Pre. RTL, Power-Performance Simulator [ISCA’ 14, Top. Picks’ 15] Accelerator Design w/ Memory Interface High-Level Synthesis [ISLPED’ 13_1] Sea of Fine-Grained Accelerators Research Infrastructures for Hardware Accelerators [Synthesis Lecture’ 15] Accelerator-System Co-Design [Under Review] Instruction-Level Energy Model for Xeon Phi [ISLPED’ 13_2] 15 Harvard University
Aladdin: A pre-RTL, Power. Performance Accelerator Simulator Shared Memory/Interconnect Models Unmodified C-Code Aladdin Accelerator Design Parameters (e. g. , # FU, mem. BW) Power/Area Accelerator Specific Datapath Private L 1/ Scratchpad “Accelerator Simulator” Design Accelerator-Rich So. C Fabrics and Memory Systems Performance “Design Assistant” Understand Algorithmic-HW Design Space before RTL Flexibility Programmability Design Cost 16 Harvard University
Future Accelerator-Centric Architecture Big Cores GPU/DS P Small Cores Shared Resources Memory Interface Sea of Fine-Grained Accelerators 17 Harvard University
Future Accelerator-Centric Architecture Big Cores GPU/DS P Small Cores Shared Resources Memory Interface Sea of Fine-Grained Accelerators Aladdin can rapidly evaluate large design space of accelerator-centric architectures. 18 Harvard University
Aladdin Overview Optimization Phase C Code Acc Design Parameters Optimistic IR Initial DDDG Idealistic DDDG Dynamic Data Dependence Graph Resource Program (DDDG) Constrained DDDG Performance Activity Power/Area Models Power/Area Realization Phase 19 Harvard University
Aladdin Overview Optimization Phase C Code Optimistic IR Initial DDDG Idealistic DDDG Performance Acc Design Parameters Program Constrained DDDG Resource Constrained DDDG Activity Power/Area Models Power/Area Realization Phase 20 Harvard University
From C to Design Space C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i]; 21 Harvard University
From C to Design Space IR Dynamic Trace C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i]; 0. r 0=0 //i = 0 1. r 4=load (r 0 + r 1) //load a[i] 2. r 5=load (r 0 + r 2) //load b[i] 3. r 6=r 4 + r 5 4. store(r 0 + r 3, r 6) //store c[i] 5. r 0=r 0 + 1 //++i 6. r 4=load(r 0 + r 1) //load a[i] 7. r 5=load(r 0 + r 2) //load b[i] 8. r 6=r 4 + r 5 9. store(r 0 + r 3, r 6) //store c[i] 10. r 0 = r 0 + 1 //++i … 22 Harvard University
From C to Design Space Initial DDDG C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i]; IR Trace: 0. r 0=0 //i = 0 1. r 4=load (r 0 + r 1) //load a[i] 2. r 5=load (r 0 + r 2) //load b[i] 3. r 6=r 4 + r 5 4. store(r 0 + r 3, r 6) //store c[i] 5. r 0=r 0 + 1 //++i 6. r 4=load(r 0 + r 1) //load a[i] 7. r 5=load(r 0 + r 2) //load b[i] 8. r 6=r 4 + r 5 9. store(r 0 + r 3, r 6) //store c[i] 10. r 0 = r 0 + 1 //++i … 0. i=0 5. i++ 10. i++ 11. ld a 2. ld b 6. ld a 7. ld b 3. + 12. ld b 8. + 4. st c 13. + 9. st c 14. st c 23 Harvard University
From C to Design Space Idealistic DDDG C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i]; IR Trace: 0. r 0=0 //i = 0 1. r 4=load (r 0 + r 1) //load a[i] 2. r 5=load (r 0 + r 2) //load b[i] 3. r 6=r 4 + r 5 4. store(r 0 + r 3, r 6) //store c[i] 5. r 0=r 0 + 1 //++i 6. r 4=load(r 0 + r 1) //load a[i] 7. r 5=load(r 0 + r 2) //load b[i] 8. r 6=r 4 + r 5 9. store(r 0 + r 3, r 6) //store c[i] 10. r 0 = r 0 + 1 //++i … 0. i=0 5. i++ 1. ld a 2. ld b 1. ld a 5. i++ 2. ld b 6. ld a 10. i++ 7. ld b 11. ld a 10. i++ 6. ld a 7. ld b 3. + 8. + 13. + 11. ld a 12. ld b 8. + 4. st c 9. st c 14. st c 13. + 9. st c 12. ld b 14. st c 24 Harvard University
From C to Design Space Optimization Phase: C->IR->DDDG § Include application-specific customization strategies. § Node-Level: – Bit-width Analysis – Strength Reduction – Tree-height Reduction § Loop-Level: – Remove dependences between loop index variables § Memory Optimization: – Memory-to-Register Conversion – Store-Load Forwarding – Store Buffer 25 Harvard University
From C to Design Space One Design Resource Activity Idealistic DDDG 0. i=0 1. ld a 5. i++ 2. ld b 6. ld a 7. ld b 0. i=0 15. i++ 10. i++ 1. ld a 12. ld b 16. ld a 17. ld b 2. ld b MEM 3. + 8. + 13. + 18. + 3. + + 4. st c 9. st c 14. st c 19. st c 4. st c MEM + 5. i++ 6. ld a Acc Design Parameters: ü Memory BW <= 2 ü 1 Adder 7. ld b MEM + 8. + MEM 9. st c Cycle 26 Harvard University
From C to Design Space Another Design Resource Activity Idealistic DDDG 0. i=0 1. ld a 5. i++ 2. ld b 6. ld a 15. i++ 10. i++ 7. ld b 0. i=0 11. ld a 12. ld b 16. ld a 17. ld b 1. ld a + 5. i++ 2. ld b 6. ld a 7. ld b 3. + 8. + 13. + 18. + 3. + 8. + 4. st c 9. st c 14. st c 19. st c 4. st c 9. st c Acc Design Parameters: ü Memory BW <= 4 ü 2 Adders 11. ld a 12. ld b 16. ld a + + MEM + + 15. i++ 10. i++ MEM MEM 17. ld b MEM MEM 13. + 18. + + + 14. st c 19. st c MEM Cycle 27 Harvard University
From C to Design Space Realization Phase: DDDG->Estimates § Constrain the DDDG with program and userdefined resource constraints § Program Constraints – Control Dependence – Memory Ambiguation § Resource Constraints – Loop-level Parallelism – Loop Pipelining – Memory Ports 28 Harvard University
From C to Design Space Power-Performance per Design Power Acc Design Parameters: ü Memory BW <= 4 ü 2 Adders Acc Design Parameters: ü Memory BW <= 2 ü 1 Adder Cycle 29 Harvard University
From C to Design Space of an Algorithm Power Cycle 30 Harvard University
Aladdin Validation Aladdin C Code Power/Area Performance Verilog Design Compiler Activity Model. Sim 31 Harvard University
Aladdin Validation Aladdin C Code Power/Area Performance RTL Designer Verilog HLS C Tuning Vivado HLS Design Compiler Activity Model. Sim 32 Harvard University
Aladdin Validation 33 Harvard University
Aladdin Validation 34 Harvard University
Algorithm-to-Solution Time Programming Effort RTL Generation Hand-Coded RTL C-to-RTL High Medium Designer Dependent 37 mins RTL Simulation 5 mins RTL Synthesis 45 mins Time to Solution per Design 87 mins Time to Solution (36 Designs) 52 hours 35 Harvard University
Algorithm-to-Solution Time Programming Effort RTL Generation Hand-Coded RTL C-to-RTL High Medium Designer Dependent 37 mins Aladdin N/A RTL Simulation 5 mins RTL Synthesis 45 mins Time to Solution per Design 87 mins 1 min Time to Solution (36 Designs) 52 hours 7 min 36 Harvard University
Aladdin enables pre-RTL simulation of accelerators with the rest of the So. C. gem 5 Big Cores. . . gem 5 Small Cores … Shared Cacti/Orion 2 Resources GPGPUGPU Sim DRAMSim Memory 2 Interface Sea of Fine-Grained Accelerators 37 Harvard University
CPU 0 CPU 1 L 1 $ ACC MEM L 2 $ Lane 4 Lane 3 Lane 2 Lane 1 Lane 0 Accelerator Integration SPAD Interface ARR 0 ARR 1 BUF 0 BUF 1 System Bus MC Channel Transfer Descriptors Selection CHAN 0 SRC ADDR DEST ADDR LENGTH DRAM CHAN 3 DMA 38 Harvard University
Compute is only a part of the story 39 Harvard University
Compute is only a part of the story Accelerator-System Co-Design 40 Harvard University
CPU 0 CPU 1 L 1 $ ACC MEM L 2 $ Lane 4 Lane 3 Lane 2 Lane 1 Lane 0 Accelerator Integration SPAD Interface ARR 0 ARR 1 BUF 0 BUF 1 CHAN 0 SRC ADDR DEST ADDR LENGTH DRAM CHAN 3 ACC MEM DMA Lane 3 Cache Interface TLB 41 Lane 2 Channel Transfer Descriptors Selection Lane 1 MC Lane 0 System Bus Cache Harvard University
CPU 0 L 1 $ Lane 0 Lane 1 Lane 2 Lane 3 Lane 4 gem 5 -Aladdin: An So. C Simulator CPU 1 L 1 $ ACC SPAD Interface MEM ARR 0 ARR 1 L 2 $ BUF 0 BUF 1 MC Channel Transfer Descriptors Selection CHAN 0 SRC ADDR DEST ADDR LENGTH DRAM Lane 0 Lane 1 Lane 2 Lane 3 System Bus CHAN 3 ACC MEM DMA Cache Interface TLB 42 Cache Harvard University
gem 5 -Aladdin Validation gem 5 -Aladdin Flush Latency Applicatio n DMA Latency Acc Exe Latency ARM Core DMA IP Block Kernel Vivado HLS Verilog FPGA Xilinx Zynq So. C 43 Harvard University
gem 5 -Aladdin Validation 44 Harvard University
To DMA or To Cache? § Accelerator local memory 45 Harvard University
DMA or Cache 46 Harvard University
DMA or Cache 47 Harvard University
DMA or Cache 48 Harvard University
Conclusions § Architectures with 1000 s of accelerators will be radically different; New design tools are needed. § We built Aladdin, an architectural level power, performance, and area simulator for accelerators. § We integrated Aladdin with gem 5 to model the interactions between accelerators and the rest of the So. C. § These accelerator infrastructures open up opportunities for innovation on heterogeneous architecture designs. 49 Harvard University
Contributions Big Cores GPU/D SP WIICA: Accelerator Workload Characterization [ISPASS’ 13] Mach. Suite: Accelerator Benchmark Suite [IISWC’ 14] Small Core s Shared Resources Aladdin: Accelerator Pre. RTL, Power-Performance Simulator [ISCA’ 14, Top. Picks’ 15] Accelerator Design w/ Memory Interface High-Level Synthesis [ISLPED’ 13_1] Sea of Fine-Grained Accelerators Research Infrastructures for Hardware Accelerators [Synthesis Lecture’ 15] Accelerator-System Co-Design [Under Review] Instruction-Level Energy Model for Xeon Phi [ISLPED’ 13_2] 50 Harvard University
Publications 1. Y. S. Shao, S. Xi, V. Srinivasan, G. -Y. Wei, D. Brooks, “An Holistic Approach to Accelerator. System Co-Design, ” Under Review. 2. Y. S Shao and D. Brooks, “Research Infrastructures for Hardware Accelerators, ” Synthesis Lectures on Computer Architecture, Nov 2015. 3. Y. S. Shao, B. Reagen, G. -Y. Wei, D. Brooks, “The Aladdin Approach to Accelerator Design and Modeling, ” IEEE Micro Top. Picks, May-June 2015. 4. Y. S. Shao, S. Xi, V. Srinivasan, G. -Y. Wei, D. Brooks, “Toward Cache-Friendly Hardware Accelerators, ” SCAW’ 15. 5. B. Reagen, B. Adolf, Y. S. Shao, G. -Y. Wei, D. Brooks, “Mach. Suite: Benchmarks for Accelerator Design and Customized Architectures, ” IISWC’ 14. 6. Y. S. Shao, B. Reagen, G. -Y. Wei, D. Brooks, “Aladdin: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures, ” ISCA’ 14. 7. B. Reagen, Y. S. Shao, G. -Y. Wei, D. Brooks, “Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware, ” ISLPED’ 13. 8. Y. S. Shao and D. Brooks, “Energy Characterization and Instruction-Level Energy Model of Intel’s Xeon Phi Processor, ” ISLPED’ 13. 9. Y. S. Shao and D. Brooks, “ISA-Independent Workload Characterization and its Implications for Specialized Architectures, ” ISPASS’ 13. 51 Harvard University
Acknowledgement 52 Harvard University
Thanks! 53 Harvard University
- Role modeling theory
- Dimensional modeling vs relational modeling
- Database and storage architectures
- Autoencoders, unsupervised learning, and deep architectures
- Modular and integral architecture
- Ansi sparc
- Switched backbone networks
- Scalable internet architectures
- Product architecture example
- Gui architectures
- Database system architectures
- Cdn architectures
- Scalable web architectures
- Two tier data warehouse architecture
- Stack isa example
- Ecommerce server
- Banking system architecture diagram
- Backbone network architectures
- Cache coherence for gpu architectures
- Why systolic architectures
- Answers key
- Requirements modeling in system analysis and design
- Object-oriented modeling and designs books
- Device modeling for analog and rf cmos circuit design
- Phd comics final draft
- Ing marek pavlik phd
- Phd secondary structure prediction
- Master's degree vs phd
- Adele hite
- Niels fuglsang phd
- Towson university graduate admissions
- Kevin fleming phd
- Simba cheated preterite vs imperfect
- Amy bunger phd
- Dr gottfried schlaug
- Joint program in survey methodology
- Kathleen brehony
- William j walsh
- Cofounder of google
- Eugene lai md
- Dr azmat karim
- Alison leary phd
- Jeffrey becker phd
- Lisa brenner phd
- Dissertation survival kit
- Doc. phdr. stanislav matulay, phd.,
- Craig watson md phd
- Sps upm
- Phd connector
- Bs ms phd joke
- Rebecca sutton phd
- John oakes phd
- Phd autoguiding
- Phd in name