Clockless Logic Asynchronous Pipelines MOUSETRAP UltraHighSpeed TransitionSignaling Asynchronous

  • Slides: 20
Download presentation
Clockless Logic: Asynchronous Pipelines MOUSETRAP: Ultra-High-Speed Transition-Signaling Asynchronous Pipelines

Clockless Logic: Asynchronous Pipelines MOUSETRAP: Ultra-High-Speed Transition-Signaling Asynchronous Pipelines

MOUSETRAP Pipelines Simple asynchronous implementation style, uses… l transparent D-latches l simple control: 1

MOUSETRAP Pipelines Simple asynchronous implementation style, uses… l transparent D-latches l simple control: 1 gate/pipeline stage Target = static logic blocks “MOUSETRAP”: uses a “capture protocol” Latches … l are normally transparent: before new data arrives l become opaque: after data arrives (“capture” data) Control Signaling: transition-signaling = 2 -phase l simple protocol: req/ack = only 2 events per handshake (not 4) l no “return-to-zero” l each transition (up/down) signals a distinct operation Our Goal: very fast cycle time l simple inter-stage communication 2

MOUSETRAP: A Basic FIFO Stages communicate using transition-signaling: Latch Controller 1 transition per data

MOUSETRAP: A Basic FIFO Stages communicate using transition-signaling: Latch Controller 1 transition per data item! ack. N-1 ack. N En req. N done. N req. N+1 Data in Data out Data Latch Stage N-1 Stage N+1 2 data item flowing through the pipeline 1 stnd data 3

MOUSETRAP: A Basic FIFO (contd. ) Latch controller (XNOR) acts as “phase converter”: l

MOUSETRAP: A Basic FIFO (contd. ) Latch controller (XNOR) acts as “phase converter”: l 2 distinct transitions (up or down) pulsed latch enable Latch Controller 2 transitions per latch cycle ack. N-1 ack. N En req. N done. N req. N+1 Data in Data out Data Latch Stage N-1 Stage N+1 Latch re-enabled when current next stage Latch is is disabled stage is is “done” 4

MOUSETRAP: FIFO Cycle Time Latch Controller ack. N-1 ack. N 2 En req. N

MOUSETRAP: FIFO Cycle Time Latch Controller ack. N-1 ack. N 2 En req. N 3 req. N+1 done. N Data in 1 2 Data out Data Latch Stage N-1 Stage N+1 NFast N+1 computes Nre-enabled computes self-loop: to N compute disables itself Cycle Time = 5

Detailed Controller Operation Stage N’s Latch Controller ack from N+1 done from N to

Detailed Controller Operation Stage N’s Latch Controller ack from N+1 done from N to Latch ã One pulse per data item flowing through: l down transition: caused by “done” of N l up transition: caused by “done” of N+1 ã No minimum pulse width constraint! l simply, down transition should start “early enough” l can be “negative width” (no pulse!) 6

MOUSETRAP: Pipeline With Logic Simple Extension to FIFO: insert logic block + matching delay

MOUSETRAP: Pipeline With Logic Simple Extension to FIFO: insert logic block + matching delay in each stage Latch Controller ack. N-1 delay req. N done. N logic ack. N delay req. N+1 delay logic Data Latch Stage N-1 Stage N+1 Logic Blocks: can use standard single-rail (non-hazard-free) “Bundled Data” Requirement: l each “req” must arrive after data inputs valid and stable 7

Special Case: Using “Clocked Logic” Clocked-CMOS = C 2 MOS: eliminate explicit latches l

Special Case: Using “Clocked Logic” Clocked-CMOS = C 2 MOS: eliminate explicit latches l latch folded into logic itself logic inputs pull-up network A B “keeper” En En logic output En logic inputs pull-down network A General C 2 MOS gate En logic output A B C 2 MOS AND-gate 8

Gate-Level MOUSETRAP: with C 2 MOS Latch Controller ack. N-1 2 2 ack. N

Gate-Level MOUSETRAP: with C 2 MOS Latch Controller ack. N-1 2 2 ack. N 2 En, En 2 req. N done. N 2 2 req. N+1 pair of bit latches (ack, ack’) C 2 MOS logic Stage N-1 Stage N+1 (done, done’) (En, En’) Use eliminate explicit latches New Control Optimization = “Dual-Rail XNOR” C 2 MOS: l eliminate 2 inverters from critical path 9

Complex Pipelining: Forks & Joins Problems with Linear Pipelining: l handles limited applications; real

Complex Pipelining: Forks & Joins Problems with Linear Pipelining: l handles limited applications; real systems are more complex Non-Linear Pipelining: has forks/joins fork join Contribution: introduce efficient circuit structures l Forks: distribute data + control to multiple destinations l Joins: merge data + control from multiple sources Enabling technology for building complex async systems 10

Forks and Joins: Implementation ack 1 C ack 2 req 1 C req 2

Forks and Joins: Implementation ack 1 C ack 2 req 1 C req 2 req Stage N Join: merge multiple requests req Stage N Fork: merge multiple acknowledges 11

Related Protocols Day/Woods (’ 97), and Charlie Boxes (’ 00) Similarities: all use… l

Related Protocols Day/Woods (’ 97), and Charlie Boxes (’ 00) Similarities: all use… l transition signaling for handshakes l phase conversion for latch signals Differences: MOUSETRAP has… l higher throughput l ability to handle fork/join datapaths l more aggressive timing, less insensitivity to delays 12

Performance, Timing and Optzn. MOUSETRAP with Logic: Stage Latency = Cycle Time = MOUSETRAP

Performance, Timing and Optzn. MOUSETRAP with Logic: Stage Latency = Cycle Time = MOUSETRAP Using C 2 MOS Gates: Stage Latency = Cycle Time = 13

Timing Analysis Main Timing Constraint: avoid “data overrun” Data must be safely “captured” by

Timing Analysis Main Timing Constraint: avoid “data overrun” Data must be safely “captured” by Stage N before new inputs arrive from Stage N-1 l simple 1 -sided timing constraint: fast latch disable l Stage N’s “self-loop” faster than entire path through previous stage Latch Controller ack. N-1 delay ack. N req. N done. N logic delay req. N+1 logic Data Latch Stage N-1 Stage N 14

Timing Optzn: Reducing Cycle Time Analytical Cycle Time = Goal: shorten (in steady-state operation)

Timing Optzn: Reducing Cycle Time Analytical Cycle Time = Goal: shorten (in steady-state operation) Steady-state = no undue pipeline congestion Observation: l XNOR switches twice per data item: l only 2 nd (up) transition critical for performance: Solution: reduce XNOR output swing l degrade “slew” for start of pulse l allows quick pulse completion: faster rise time Still safe when congested: pulse starts on time l pulse maintained until congestion clears 15

Timing Optzn (contd. ) N “done” N+1 “done” “optimized” XNOR output latch only partly

Timing Optzn (contd. ) N “done” N+1 “done” “optimized” XNOR output latch only partly disabled; recovers quicker! “unoptimized” XNOR output (no pulse width requirement) N’s latch disabled N’s latch re-enabled 16

Comparison with Wave Pipelining Two Scenarios: l Steady State: Ø both MOUSETRAP and wave

Comparison with Wave Pipelining Two Scenarios: l Steady State: Ø both MOUSETRAP and wave pipelines act like transparent “flow through” combinational pipelines l Congestion: Ø right environment stalls: each MOUSETRAP stage safely captures data Ø internal stage slow: MOUSETRAP stages to its left safely capture data congestion properly handled in MOUSETRAP Conclusion: MOUSETRAP has potential of… speed of wave pipelining l greater robustness and flexibility l 17

Timing Issues: Handling Wide Datapaths Buffers inserted to amplify latch signals (En): Reducing Impact

Timing Issues: Handling Wide Datapaths Buffers inserted to amplify latch signals (En): Reducing Impact of Buffers: l control uses unbuffered signals l buffer delay off of critical path! datapath skewed w. r. t. control En req. N done. N req. N+1 Timing assumption: buffer delays roughly equal Stage N-1 Stage N 18

Preliminary Results Pre-Layout Simulations of FIFO’s: l do not account for wire delays, parasitics,

Preliminary Results Pre-Layout Simulations of FIFO’s: l do not account for wire delays, parasitics, etc. l careful transistor sizing/verification of timing constraints 19

Conclusions and Future Work Introduced a new asynchronous pipeline style: l Static logic blocks

Conclusions and Future Work Introduced a new asynchronous pipeline style: l Static logic blocks l Simple latches and control: Ø transparent latches, or C 2 MOS gates Ø single gate control = 1 XNOR gate/stage l Highly concurrent event-driven protocol l High throughputs obtained: Ø 3. 5 GHz in 0. 25 , 1. 9 GHz in 0. 6 Ø comparable to wave pipelines; yet more robust/less design effort l Correctly handle forks and joins in datapaths l Timing constrains: local, 1 -sided, easily met Ongoing Work: l more realistic performance measurement (incl. parasitics) l layout and fabrication 20