NC STATE UNIVERSITY Fab Scalar Niket K Choudhary
NC STATE UNIVERSITY Fab. Scalar Niket K. Choudhary, Salil V. Wadhavkar, Tanmay A. Shah, Sandeep S. Navada, Hashem H. Najaf-abadi, Eric Rotenberg Center for Efficient, Scalable and Reliable Computing (CESR) Department of Electrical and Computer Engineering North Carolina State University Eric Rotenberg © 2009 WARP’ 09 6/20/09
NC STATE UNIVERSITY High-Performance Superscalar Processor Ø Generic pipeline configuration ↑ Good performance on wide range of applications ↓ Not highest-performing for any given application ↓ Power inefficient 2 Eric Rotenberg © 2009 WARP’ 09 6/20/09
NC STATE UNIVERSITY Application-Specific Superscalar Processor generic superscalar processor App. X application-specific superscalar processor App. X 3 Eric Rotenberg © 2009 WARP’ 09 6/20/09
NC STATE UNIVERSITY 2 -way superscalar Propagation Delay 4 -way superscalar 2 -way to 4 -way: – – – Increase sizes of ILP-extracting units to expose and exploit more ILP Hide increase in propagation delays with deeper pipelining Except: worsened propagation delays not hidden for interinstruction dependences dependencies propagation delay (ns) independencies App. 1 2 -way 4 -way App. 2 2 -way 4 -way Execution Time Eric Rotenberg © 2009 WARP’ 09 6/20/09 4
NC STATE UNIVERSITY Heterogeneous Multi-core App. 1 App. 2 App. N Customize each core to an application, class of application, or class of application behavior. 5 Eric Rotenberg © 2009 WARP’ 09 6/20/09
NC STATE UNIVERSITY Challenge Ø Customization captures interplay between program, microarchitecture, and technology Ø Need real superscalar designs … Ø … and need many of them Need to try out many real superscalar designs. Need tool for automatically composing physical designs of arbitrary superscalar processors. 6 Eric Rotenberg © 2009 WARP’ 09 6/20/09
NC STATE UNIVERSITY Target both R & D Ø Research: High fidelity designs improve discovery Ø Development: Designs should be product strength 7 Eric Rotenberg © 2009 WARP’ 09 6/20/09
NC STATE UNIVERSITY Canonical Superscalar Processor Ø Different superscalar processors have same canonical pipeline stages Ø Their canonical stages differ in terms of: • Complexity § Width, i. e. , number of superscalar “ways” § Sizes of stage-specific structures • Sub-pipelining § How deeply pipelined a canonical stage is 8 Eric Rotenberg © 2009 WARP’ 09 6/20/09
NC STATE UNIVERSITY Fab. Scalar 1) Define composable interfaces of canonical pipeline stages, so that they can be stitched together to compose an overall superscalar processor. 2) Pre-design multiple versions of each canonical pipeline stage, that differ in their width and stagespecific structure sizes (complexity) and depth (subpipelining). 3) Develop a high-level superscalar synthesis tool that can automatically compose an arbitrary superscalar processor based on processor-level and stage-level constraints (frequency, power, and area), and output multiple representations (verilog, cycle-accurate C++, netlist, and physical design) of the processor. 9 Eric Rotenberg © 2009 WARP’ 09 6/20/09
NC STATE UNIVERSITY SSL and Composability fetch scalar, 1 to 3 stages 2 -way superscalar, 1 to 3 stages decode rename 10 Eric Rotenberg © 2009 WARP’ 09 6/20/09
NC STATE UNIVERSITY Status Ø Designed synthesizable verilog for a baseline superscalar processor • Starting point for populating SSL with pipeline stage designs Niket Stage Description Fetch 4 -wide, 512 -entry BTB, 128 -entry bimodal branch predictor, 8 -entry RAS, 16 -instruction fetch buffer Decode 4 -wide, ISA = PISA (MIPS-like) Rename 4 -wide, 32 -entry rename map table with 8 read and 4 write ports, 4 shadow map tables (checkpoints) Dispatch 4 -wide Issue 4 -wide issue, 32 -entry issue queue Register Read 4 -wide, 128 -entry physical register file with 8 read ports and 4 write ports Execute 1 simple ALU, 1 complex ALU, 1 branch ALU, 1 AGEN + 1 port to load-store unit Load-Store Unit 16 -entry load queue, 16 entry store queue Writeback 4 -wide Retire 4 -wide, 128 -entry active list with 4 read and 4 write ports, arch. map table with 4 read and 4 write ports 11 Eric Rotenberg © 2009 WARP’ 09 6/20/09
NC STATE UNIVERSITY Status (cont. ) Niket 12 Eric Rotenberg © 2009 WARP’ 09 6/20/09
NC STATE UNIVERSITY Status (cont. ) Niket 13 Eric Rotenberg © 2009 WARP’ 09 6/20/09
NC STATE UNIVERSITY Status (cont. ) Ø Developed cycle-accurate C++ simulator and verilog/C++ co-simulation environment • Cycle-accurate at pipeline stage level Salil IPC gap gcc gzip twolf vortex vpr 0. 45 0. 54 0. 44 0. 52 0. 48 14 Eric Rotenberg © 2009 WARP’ 09 6/20/09
NC STATE UNIVERSITY Status (cont. ) Ø Developed register file compiler • 16 R 8 W bitcell layout Superscalar processor has many specialized and highly-ported RAMbased structures Tanmay Eric Rotenberg © 2009 15 WARP’ 09 6/20/09
NC STATE UNIVERSITY Ø Ø Status (cont. ) Begun sub-pipelining key stages: fetch and issue Block-ahead pipelining [Seznec et al. ] A B A A C B B C Eric Rotenberg © 2009 C D D D Unpipelined Fetch throughput = 1 Jayneel Pipelined Fetch (no block-ahead) throughput = 1 Pipelined Fetch (with block-ahead) throughput = 2 16 WARP’ 09 6/20/09
NC STATE UNIVERSITY Ø Example Applications Superscalar customization, fast design-space exploration Sandeep 17 Eric Rotenberg © 2009 WARP’ 09 6/20/09
NC STATE UNIVERSITY Example Applications (cont. ) Ø Core-Selectability in Chip Multiprocessors Hashem Configure parallel processor for parallel workload at hand. Tiled Het. Multi-cores 18 Eric Rotenberg © 2009 WARP’ 09 6/20/09
NC STATE UNIVERSITY Example Applications (cont. ) Ø Revisit microarchitecture techniques Ø Techniques discarded for limited applicability may be valuable in workload-customized cores 19 Eric Rotenberg © 2009 WARP’ 09 6/20/09
NC STATE UNIVERSITY Example Applications (cont. ) Ø Conventional methodology flawed • • Arbitrarily pick a baseline (perhaps rules-of-thumb) Add gadget to baseline Speedup: (baseline+gadget) / (baseline) Influence of gadget depends on choice of baseline • Example: Value prediction more important with undersized IQ Ø OK methodology • • • Baseline = custom core for each benchmark Add gadget to this baseline, per benchmark Speedup: (baseline+gadget) / (baseline) Ø Better methodology • • • Baseline = custom core for each benchmark Recustomize core with gadget in place (new global optimum) Speedup: (recustomized core) / (customized core) 20 Eric Rotenberg © 2009 WARP’ 09 6/20/09
NC STATE UNIVERSITY Summary Ø Customizing superscalar cores has value in applicationspecific designs and heterogeneous multi-core chips Ø Customization captures interplay among program, microarchitecture, and technology Ø Fab. Scalar enables the composition of arbitrary superscalar processors, inclusive of technology Ø Enabled by canonical view of superscalar pipeline, and a lot of “pre-fab” by students who aren’t paid enough Supported by NSF and IBM. accepting donations http: //www. tinker. ncsu. edu/ericro/research/fabscalar. htm 21 Eric Rotenberg © 2009 WARP’ 09 6/20/09
- Slides: 21