Baring It All to Software Raw Machines E
Baring It All to Software: Raw Machines E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe, A. Agarwal (Presented by Linda Deng)
Hitting a wall • • Already in 1997? As # of transistors increases, so does wire delay New complex hardware verification costs Emerging stream-based multimedia
The radical Raw idea • Lots of simple interconnected tiles • Each tile contains: – Instruction/data memories ↑ – ALU – Registers – Configurable logic – Programmable switch for routing • Complex operations synthesized into HW
A Raw processor ↑
The programmer’s job • • Software deals with wire delay Wire delay = hops in mesh network One cycle to move from a tile to its neighbor ↑ Compiler knows # of cycles needed to move – Statically schedules operations • Register renaming, instruction scheduling, dependency checking…
What’s the big deal? • Distributed registers – Bigger register namespace higher ILP • Distributed static RAM – Shorter memory latency • No specialized logic structures in HW – Smaller tiles more tiles greater parallelism – More chip area for memory/logic – Faster clock – Less complexity easier verification
The hard-working compiler • Parallelism vs. communication/synchronization? – But the latter’s overhead is low – So partitioning can be fine-grained • Tile placement to minimize latency/bandwidth • Programs for tiles/switches (scheduling/routing) • Logic synthesis tool for configurable logic – Pattern-matching algorithms to find candidate insns
Some remaining dynamic events… • • • What happens when compiler can’t resolve? Reserve bandwidth b/w potential communicators Conservative estimates for dynamic routing Assign dependency checking to tiles Predict tile for offset, even though base is unknown
Prototype time: Raw. Logic • Implemented with FPGAs • Limited feature support – Static sequences converted into state machines – Hardwired into Raw. Logic – Inflexible, with amazingly long compilation times • Framework in C/Verilog for compilation – Produced binary code for state machines • But larger benchmarks were emulated • And Raw machine has faster clock than FPGA
The numbers
Looking ahead • “In 10 to 15 years, we believe that billiontransistor chip densities, faster switching speeds, and growing compiler sophistication will allow a Raw machine’s performance-tocost ratio to surpass that of traditional architectures for future, general-purpose workloads. ” • Agarwal’s Tilera started shipping 64 -core TILE 64 in 2007, working on 36 - and 120 -core?
- Slides: 11