A Result Forwarding Unit for a Synthesisable Asynchronous

A Result Forwarding Unit for a Synthesisable Asynchronous Processor Luis Tarazona and Doug Edwards Advanced Processor Technologies Group School of Computer Science 1

Result Forwarding • Method to reduce inter-instruction data dependencies performance penalty • Can even be used to allow out-of order execution. • Hard to implement in asynchronous processors • Earlier proposed solutions to resolve data dependencies in asynchronous processors: – Register locking (AMULET 1) – Last-result register (AMULET 2) – Asynchronous ROB (AMULET 3) – Counterflow pipelines Full-custom solutions! 2

Potential Benefits 3

Synthesisable Result Forwarding Unit Synthesisable description advantages: – Faster development – Design-space exploration – Technology mapping transparency • The description serves to: – Evaluate the capabilities of the Balsa language to describe performance-demanding systems – Highlight performance-oriented description techniques 4

The Target Processor: nano. Spa • Experimental new SPA specification • Same 3 -stage SPA pipeline architecture • Main target: Performance • No support yet for – Thumb Instructions – Interrupts – Memory Aborts – Coprocessors 5

Related Work: AMULET 3 ROB • D. A. Gilbert & J. D. Garside 1997 • Asynchronous Reorder Buffer that provides forwarding and precise exceptions handling • Implemented in single-rail • Five-process reference model for the synthesisable FU 6

nano. FU Architecture • Parameterised queue sizes: 4, 5, 6 & 8 • Dual-rail, performance-oriented description style 7

Implementation Issues • Synchronisation between processes: – Use data tokens instead of sync channels to increase performance – Speculative buffer reads to decouple arrival and forwarding – Buffer cell locking to decouple Forwarding and Allocation – Drawbacks: power and area penalty 8

Implementation Issues • CAM implementation based on comparators – relatively simple but still slow • Register bank operation: – Potential hazards in dual-rail if speculatively reading while writing • Register read must wait for Lookup to provide “default” forwarding value – Number of tokens in pipeline guarantees that writeout never conflicts with reading 9

Simulation Results Pre-layout, transistor-level simulations, 180 nm technology 10

Balsa limitations highlights • Need for: – Efficient ways of describing and synthesising associative arrays – Deadlock-safe implementation that allows concurrent writes and reads in variables (for speculative reading) – Signal-level manipulation to avoid excessive synchronisation • Some peephole optimisations (next talk) 11

Conclusions 12

Future work • To extend the nano. Spa pipeline by including a memory stage and evaluate the performance of the forwarding unit within this architecture • To implement and explore the effects of suggested optimisations and components 13

Thank you very much! Questions? Acknowledgement • Thanks to Luis Plana, Andrew, Charlie and Will for their suggestions and comments. • This work and Ph. D are supported by EPSCR and Uo. M School of Computer Science scholarships. 14