332 578 Deep Submicron VLSI Design Lecture 14

332: 578 Deep Submicron VLSI Design Lecture 14 Design for Clock Skew David Harris 1/26/2022 Harvey Mudd College Deep Submicron Spring VLSI 2005 Des. Lec. 14 1

Outline q q q Clock Distribution Clock Skew-Tolerant Static Circuits Traditional Domino Circuits Skew-Tolerant Domino Circuits Summary Material from: CMOS VLSI Design, by Weste and Harris, Addison-Wesley, 2005 1/26/2022 Deep Submicron VLSI Des. Lec. 14 2

Clocking q Synchronous systems use a clock to keep operations in sequence – Distinguish this from previous or next – Determine speed at which machine operates q Clock must be distributed to all the sequencing elements – Flip-flops and latches q Also distribute clock to other elements – Domino circuits and memories 1/26/2022 Deep Submicron VLSI Des. Lec. 14 3

Clock Distribution q On a small chip, the clock distribution network is just a wire – And possibly an inverter for clkb q On practical chips, the RC delay of the wire resistance and gate load is very long – Variations in this delay cause clock to get to different elements at different times – This is called clock skew q Most chips use repeaters to buffer the clock and equalize the delay – Reduces but doesn’t eliminate skew 1/26/2022 Deep Submicron VLSI Des. Lec. 14 4

Example q Skew comes from differences in gate and wire delay – With right buffer sizing, clk 1 and clk 2 could ideally arrive at the same time. – But power supply noise changes buffer delays – clk 2 and clk 3 will always see RC skew 1/26/2022 Deep Submicron VLSI Des. Lec. 14 5

Review: Skew Impact q Ideally full cycle is available for work q Skew adds sequencing overhead q Increases hold time too 1/26/2022 Deep Submicron VLSI Des. Lec. 14 6

Review: Skew Impact 1/26/2022 Deep Submicron VLSI Des. Lec. 14 7

Review: Skew Impact 1/26/2022 Deep Submicron VLSI Des. Lec. 14 8

Cycle Time Trends q Much of CPU performance comes from higher f – f is improving faster than simple process shrinks – Sequencing overhead is bigger part of cycle 1/26/2022 Deep Submicron VLSI Des. Lec. 14 9

Cycle Time Trends 1/26/2022 Deep Submicron VLSI Des. Lec. 14 10

Cycle Time Trends 1/26/2022 Deep Submicron VLSI Des. Lec. 14 11

Cycle Time Trends 1/26/2022 Deep Submicron VLSI Des. Lec. 14 12

Solutions q Reduce clock skew – Careful clock distribution network design – Plenty of metal wiring resources q Analyze clock skew – Only budget actual, not worst case skews – Local vs. global skew budgets q Tolerate clock skew – Choose circuit structures insensitive to skew 1/26/2022 Deep Submicron VLSI Des. Lec. 14 13

Clock Dist. Networks q q Ad hoc Grids H-tree Hybrid 1/26/2022 Deep Submicron VLSI Des. Lec. 14 14

Clock Grids q q Use grid on two or more levels to carry clock Make wires wide to reduce RC delay Ensures low skew between nearby points But possibly large skew across die 1/26/2022 Deep Submicron VLSI Des. Lec. 14 15

Alpha Clock Grids 1/26/2022 Deep Submicron VLSI Des. Lec. 14 16

Alpha Clock Grids 1/26/2022 Deep Submicron VLSI Des. Lec. 14 17

Alpha Clock Grids 1/26/2022 Deep Submicron VLSI Des. Lec. 14 18

H-Trees q Fractal structure – Gets clock arbitrarily close to any point – Matched delay along all paths q Delay variations cause skew q A and B might see big skew 1/26/2022 Deep Submicron VLSI Des. Lec. 14 19

Itanium 2 H-Tree q Four levels of buffering: – Primary driver – Repeater – Second-level clock buffer (SLCB) – Gater q Route around obstructions 1/26/2022 Deep Submicron VLSI Des. Lec. 14 20

Itanium 2 H-Tree 1/26/2022 Deep Submicron VLSI Des. Lec. 14 21

Hybrid Networks q Use H-tree to distribute clock to many points q Tie these points together with a grid q Ex: IBM Power 4, Power. PC – H-tree drives 16 -64 sector buffers – Buffers drive total of 1024 points – All points shorted together with grid 1/26/2022 Deep Submicron VLSI Des. Lec. 14 22

Skew Tolerance q Flip-flops are sensitive to skew because of hard edges – Data launches at latest rising edge of clock – Must setup before earliest next rising edge of clock – Overhead would shrink if we can soften edge q Latches tolerate moderate amounts of skew – Data can arrive anytime latch is transparent 1/26/2022 Deep Submicron VLSI Des. Lec. 14 23

Skew: Latches 2 -Phase Latches 1/26/2022 Deep Submicron VLSI Des. Lec. 14 24

Skew: Latches Pulsed Latches 1/26/2022 Deep Submicron VLSI Des. Lec. 14 25

Dynamic Circuit Review q Static circuits are slow because fat p. MOS load input q Dynamic gates use precharge to remove p. MOS transistors from the inputs – Precharge: f = 0 output forced high – Evaluate: f = 1 output may pull low 1/26/2022 Deep Submicron VLSI Des. Lec. 14 26

Domino Circuits q Dynamic inputs must monotonically rise during evaluation – Place inverting stage between each dynamic gate – Dynamic / static pair called domino gate q Domino gates can be safely cascaded 1/26/2022 Deep Submicron VLSI Des. Lec. 14 27

Domino Timing q Domino gates are 1. 5 – 2 x faster than static CMOS – Lower logical effort because of reduced Cin q Challenge is to keep precharge off critical path q Look at clocking schemes for precharge and eval – Traditional schemes have severe overhead – Skew-tolerant domino hides this overhead 1/26/2022 Deep Submicron VLSI Des. Lec. 14 28

Traditional Domino Ckts. q Hide precharge time by ping-ponging between halfcycles – One evaluates while other precharges – Latches hold results during precharge 1/26/2022 Deep Submicron VLSI Des. Lec. 14 29

Clock Skew q Skew increases sequencing overhead – Traditional domino has hard edges – Evaluate at latest rising edge – Setup at latch by earliest falling edge 1/26/2022 Deep Submicron VLSI Des. Lec. 14 30

Time Borrowing q Logic may not exactly fit half-cycle – No flexibility to borrow time to balance logic between half cycles q Traditional domino sequencing overhead is about 25% of cycle time in fast systems! 1/26/2022 Deep Submicron VLSI Des. Lec. 14 31

Relaxing the Timing q Sequencing overhead caused by hard edges – Data departs dynamic gate on late rising edge – Must setup at latch on early falling edge q Latch functions – Prevent glitches on inputs of domino gates – Holds results during precharge q Is the latch really necessary? – No glitches if inputs come from other domino – Can we hold the results in another way? 1/26/2022 Deep Submicron VLSI Des. Lec. 14 32

Skew-Tolerant Domino q Use overlapping clocks to eliminate latches at phase boundaries. – Second phase evaluates using results of first 1/26/2022 Deep Submicron VLSI Des. Lec. 14 33

Skew-Tolerant Domino 1/26/2022 Deep Submicron VLSI Des. Lec. 14 34

Full Keeper q After second phase evaluates, first phase precharges q Input to second phase falls – Violates monotonicity? q But we no longer need the value q Now the second gate has a floating output – Need full keeper to hold it either high or low 1/26/2022 Deep Submicron VLSI Des. Lec. 14 35

Time Borrowing q Overlap can be used to – Tolerate clock skew – Permit time borrowing q No sequencing overhead 1/26/2022 Deep Submicron VLSI Des. Lec. 14 36

Multiple Phases q With more clock phases, each phase overlaps more – Permits more skew tolerance and time borrowing 1/26/2022 Deep Submicron VLSI Des. Lec. 14 37

Clock Generation 1/26/2022 Deep Submicron VLSI Des. Lec. 14 38

Methods to Avoid Skew 1/26/2022 Deep Submicron VLSI Des. Lec. 14 39

Summary q Clock skew effectively increases setup and hold times in systems with hard edges q Managing skew – Reduce: good clock distribution network – Analyze: local vs. global skew – Tolerate: use systems with soft edges q Flip-flops and traditional domino are costly q Latches and skew-tolerant domino perform at full speed even with moderate clock skews 1/26/2022 Deep Submicron VLSI Des. Lec. 14 40