General Purpose Processors as Processor Arrays Peter Cappello
- Slides: 64
General Purpose Processors as Processor Arrays Peter Cappello UC, Santa Barbara
VLSI Design Forces in 1986 “Nature, to be commanded, must be obeyed. ” – Sir Francis Bacon • High performance parallelism
VLSI Design Forces in 1986 • High performance parallelism
VLSI Design Forces in 1986 • Power is scarce limit resistive delay
VLSI Design Forces in 1986 • Power is scarce limit resistive delay limit long communication
VLSI Design Forces in 1986 • Power is scarce limit resistive delay limit long communication • Area is scarce limit wire crossing
VLSI Design Forces in 1986 • Power is scarce limit resistive delay limit long communication • Area is scarce limit wire crossing
VLSI Design Forces in 1986 • Power is scarce limit resistive delay limit long communication • Area is scarce limit wire crossing
VLSI Design Forces in 1986 • $$ are scarce design is expensive
VLSI Design Forces in 1986 • $$ are scarce design is expensive reuse components
VLSI Design Forces in 1986 • $$ are scarce design is expensive reuse components
VLSI Design Forces in 1986 • $$ are scarce design is expensive reuse components
VLSI Design Forces in 1986 • $$ are scarce design is expensive reuse components
VLSI Design Forces in 1986 • $$ are scarce design is expensive reuse components
VLSI Design Forces in 1986 In 2 D systolic arrays, clock skew is an issue wavefront arrays Islands of synchrony in an ocean of asynchrony
Processor Array Properties 1. Have multiple processors
Processor Array Properties 1. Have multiple processors 2. Neighbors abut (no long wires)
Processor Array Properties 1. Have multiple processors 2. Neighbors abut 3. Only neighbors communicate directly
Processor Array Properties 1. 2. 3. 4. Have multiple processors Neighbors abut Only neighbors communicate directly Have a constant # of processor types
Processor Array Properties 1. 2. 3. 4. 5. Have multiple processors Neighbors abut Only neighbors communicate directly Have a constant # of processor types Scale: larger problems larger arrays
No 3 D PA Has Properties 1 - 5 Enclose 3 D PA in minimal sphere of radius r. r
No 3 D PA Has Properties 1 - 5 Scale PA in all 3 dimensions. r
No 3 D PA Has Properties 1 - 5 1. Power consumption = Θ( r 3 ). r
No 3 D PA Has Properties 1 - 5 1. Power consumption = Θ( r 3 ). 2. Heat dissipation via surface = Θ( r 2 ). r
VLSI Design Forces in 2006 “Nature, to be commanded, must be obeyed. ” – Sir Francis Bacon • Power is scarce limit clock frequency parallelism • Power is scarce limit resistive delay limit long communication
Trends in GPP in 2006 • Chip multiprocessors (CMP) • Vector IRAM • Cell • TRIPS • RAW
Trends in GPP in 2006 Chip Multiprocessors (CMP) – Parallel processors – Crossbar
Trends in GPP in 2006 Vector IRAM – Vector Intelligent RAM • For mobile multimedia devices Stream data processing • Combine GPP and DSP – Parallel – linear array – Crossbar
Trends in GPP in 2006 Cell processor “The Department of Energy said Wednesday that it had awarded I. B. M. a contract to build a supercomputer capable of 1, 000 trillion calculations a second, using an array of 16, 000 Cell processor chips that I. B. M. designed for the coming Play. Station 3 video game machine. ” Sept. 7, 2006. NY Times. • Parallel processors – – – BIU – Bus interface unit RMT – Replacement management table SL 1 – 1 st-level cache PPE – Power. PC Element SPE – Synergistic Processor Element interconnect bus
Trends in GPP in 2006 • TRIPS Tera-op, Reliable, Intelligently adaptive Processing System The following slides are taken from a talk: "The Design and Implementation of the TRIPS Prototype Chip, " Hot. Chips 17, Palo Alto, CA, August, 2005.
• E – execution tile • R – register bank • D – 8 KB data cache • I – instruction cache • G – global control
• Instructions execute as a data flow graph – An instruction’s output is another instruction’s input. – Minimize use of register/cache for intermediate values • Register reads/writes access the register banks • Loads/stores access the data cache banks
Trends in GPP in 2006 RAW (MIT) The following slides are taken from a RAW talk: Evaluating The Raw Microprocessor: Scalability and Versatility Presented at the International Symposium on Computer Architecture, June 21, 2004.
ALU ALU ALU ALU RF ALU + ALU Replace the crossbar with a point-to-point, pipelined, routed network. >>
Distribute the Register File RF ALU RF RF RF ALU ALU ALU RF RF ALU RF RF ALU ALU RF
ALU RF RF PC I$ D$ I$ ALU RF D$ PC I$ D$ PC ALU ALU I$ RF D$ PC I$ D$ PC PC I$ D$ RF RF ALU RF PC I$ Unified Load/Store D$ Queue ALU D$ PC I$ D$ I$ RF PC I$ D$ RF ALU D$ PC PC D$ ALU RF I$ RF D$ ALU I$ D$ PC ALU Control PC I$ RF ALU Wide Fetch (16 inst) PC ALU D$ ALU I$ RF ALU PC PC ALU [ISCA 99] Distribute the rest.
Tiles! D$ PC I$ D$ RF D$ I$ ALU RF D$ PC I$ D$ RF RF ALU RF PC I$ PC ALU ALU RF PC I$ D$ I$ RF PC I$ D$ RF ALU ALU RF RF PC I$ D$ PC I$ PC D$ D$ ALU RF PC I$ RF D$ PC I$ PC ALU RF D$ I$ D$ ALU I$ I$ RF ALU PC PC ALU D$ ALU I$ RF ALU PC
Conclusions • VLSI Scalable microprocessors are possible. Constant factors are beginning to give way to asymptotics: - 16 ALU Raw – Oct 2002 - 64 ALU Raw – Now - 1, 024 ALU Raw - 2010 - 32, 768 ALU Raw – If Moore’s Law makes it to 2 nm • There is an opportunity to make processors more “versatile” i. e. , steal applications from custom chips. • Tiled Processor Architectures are a promising approach and merit further research.
GPP Predictions: In 10 Years • Encapsulate registers/cache/processors into an array. • Partition off-chip memory: Encapsulate memory & processor. Safely increase parallel access (concurrent programming) • For non-recursive applications GPP (mobile multimedia): – no bus; quasi-nearest neighbor networks. • For recursive applications GPP (gaming, control) – replace bus w/ lean on-chip short-diameter communication network. – 1 network-on-chip routes register/cache/instruction/control. – Need >= 1 K processors/chip to justify network-on-chip.
Predictions • Increasing complexity of: – Applications – Technology Increasing specialization of labor
Predictions • Increasing complexity of: – Applications – Technology Increasing specialization of labor • Rate of change of increase in complexity is increasing over time Increasing adaptability is important!
Yet another taxonomy! ARCHITECTURAL SPECIFICITY GPP CCM GENERAL SPECIFIC ASIC PROTOTYPE ASIC RECONFIGURABILITY STATIC DYNAMIC
Yet another taxonomy! ARCHITECTURAL SPECIFICITY GPP CCM GENERAL SPECIFIC ASIC PROTOTYPE ASIC RECONFIGURABILITY STATIC DYNAMIC
COMMUNICATION LATENCY ARCHITECTURAL SPECIFICITY DP TP GPP CCM GENERAL SPECIFIC ASIC PROTOTYPE ASIC RECONFIGURABILITY STATIC DYNAMIC
DP Communication Topology FPGA FPGA FPGA FPGA EDGE ISA (2 D VLIW) With Cores FFT, RISC High Throughput (iterative) Communication topology
TP Communication Topology FPGA FPGA FPGA FPGA EDGE ISA (2 D VLIW) With Cores RAM, RISC Low Latency (recursive) Communication topology
DISCIPLINE CS, DE DE CS, DE CS CS, CE CE, EE PROCESS Language design General Purpose Language Domain Specific Language Application program Compute model design Computational Model Compiling Processor architecting Compute Substrate Communicate Substrate Configurable Hardware Static Hardware Processor layout CE, EE FPGA/Circuit design EE Circuit layout EE, ME Fabrication process Fabrication Technology
Conclusion • Last 20 years witnessed dramatic advances
Conclusion • Last 20 years witnessed dramatic advances • Next 20 years will witness even more dramatic advances.
Spare slides follow
Recursive Computation via a Tree of Meshes Network?
Quasi-Scalable
Quasi-Scalable
Quasi-Scalable RF D$ GLOBAL ADDRESS LOCAL
Interleave Memory & Processor Tiles • Slightly more chips • Compiler localizes memory accesses • EDGE ISA deals with variable access times (TRIPS).
Cell architecture
Specialization of Labor High Level / Domain-Specific Language APPLICATION PROGRAMMER Computational Model Exposes Comm. Topology COMPILER COMPUTER ARCHITECT ISA Network COMPUTER ENGINEER FPGA ELECTRICAL & COMPUTER ENGINEER Fabrication
- Sei cappelli per pensare esercizi
- Maja cappello
- Viviana cappello
- Specific purpose statements
- Custom single purpose processor design
- Single purpose processor
- A special purpose processor in raster scan systems called
- Programming massively parallel processors
- Linear and nonlinear pipeline
- Interrupt handling in arm processors
- History of processors
- Handler classification
- Digital camera processors
- Advantages and disadvantages of intel processors
- Embeded processors
- Microcontrollers and embedded processors
- Comparison of word processors
- Characterization of query processors
- Parallel processors from client to cloud
- Programming massively parallel processors
- Programming massively parallel processors
- Gas processors association
- Gstreamer architecture
- Network systems design using network processors
- Two pass macro processor algorithm
- Vliw vs superscalar
- Macro processor design options
- Language and processors for requirement
- Introduction of telecommunication
- Programming massively parallel processors, kirk et al.
- Parallel arrays in data structure
- Array of arrays c++
- Ragged array
- Veteork
- C++ parallel arrays
- Why do we need arrays?
- Dynamic arrays and amortized analysis
- Que es un arreglo unidimensional en java
- Arreglo java
- Mips arrays
- Polynomial representation using arrays
- Strings in assembly language
- Global arrays in c
- Computer science arrays
- Searching and sorting arrays in c++
- Arrays visual basic
- Python parallel arrays
- What are the disadvantages of arrays
- How many arrays in 24
- 2d array pascal
- Dynamic array mips
- Creating arrays matlab
- Adt of array
- Partially filled array java
- Redundant arrays of independent disks
- Python list of arrays
- Arrays
- Day 3: arrays
- Basics of raid
- Microsoft small basic
- Dynamic p table
- Microled arrays
- Are vectors dynamic arrays
- Facts about arrays
- Peter berger sociological perspective