Chapter 12 Arithmetic Building Blocks Boonchuay Supmonchai Integrated

Chapter 12 Arithmetic Building Blocks Boonchuay Supmonchai Integrated Design Application Research (IDAR) Laboratory August 20, 2004; Revised - July 5, 2005 2102 -545 Digital ICs

B. Supmonchai Goals of This Chapter q Designing for Performance, area, or power § Adders § Multipliers § Shifters q q Logic and System Optimizations for datapath modules Power-Delay trade-offs in datapaths 2102 -545 Digital ICs Arithmetic Building Blocks 2

B. Supmonchai Review: A Generic Processor RAM, ROM, Shift Register Memory I n p u t/ O u tp u t Switches, Arbiters, Bus Drivers Control FSM, PLA, Counter, Random Logic Datapath Adder, Multiplier, Shifter, Comparator, etc. 2102 -545 Digital ICs Arithmetic Building Blocks 3

B. Supmonchai Bit-Sliced Architecture n-bit Data In Identical Processing Elements Register Bi t 0 Shifter Bi t 1 … B i t n -2 B i t n -1 Control Adder Datapath Unit Multiplexer q Modular n-bit Data Out § Easy to design and verify q Potential to be fast § Easy to expand 2102 -545 Digital ICs Arithmetic Building Blocks 4

B. Supmonchai Example: Itanium Bit-Sliced Design 2102 -545 Digital ICs Arithmetic Building Blocks 5

B. Supmonchai Example: Itanium Integer Datapath Itanium has 6 integer execution units (ALU) 2102 -545 Digital ICs Arithmetic Building Blocks 6

B. Supmonchai One-Bit Binary Full Adder (FA( Cin A B Cin Cout S Carry Status 0 0 0 kill 0 0 1 kill 0 1 0 0 1 propagate 0 1 1 1 0 propagate Cout 1 0 0 0 1 propagate 1 0 1 1 0 propagate S = A B Cin Cout = AB + ACin + BCin 1 1 0 generate 1 1 1 generate A B q 1 -bit Full Adder (FA) S A VERY common operation - so worth spending some time trying to optimize § Often in the critical path, so need to look at both logic level and circuit level optimizations 2102 -545 Digital ICs Arithmetic Building Blocks 7

B. Supmonchai Propagate, Generate, and Delete (Kill( q Define 3 new variable which ONLY depend on A, B Generate (G) = AB Propagate (P) = A B Delete(D) = A B q (FA itself generates a carry) (FA passes along carry) (FA stops propagation of carry) Then we can write S and Cout in terms of G, P, P and Cin S(G, P, C) = P Cin Cout(G, P, C) = G + PCin q We can also write S and Cout in terms of D, P, P and Cin q Sometimes an alternative definition for P can be used Propagate (P) = A + B 2102 -545 Digital ICs Arithmetic Building Blocks 8

B. Supmonchai FA CMOS Implementation: First Try A A A B B B Cin A 2102 -545 Digital ICs B B Cin S Cin B B A Cin Cin A A Cout B 32 Transistors Majority Function Maj(A, B, C) outputs 0 or 1 whichever has greater numbers at the inputs Arithmetic Building Blocks 9

B. Supmonchai Improved CMOS Implementation q A more compact design is based on the observation that S can be factored to reuse the Cout term S = ABCin+ (A + B + Cin)Cout A B Cin S S Cout Minority Function Cout 2102 -545 Digital ICs Arithmetic Building Blocks 10

B. Supmonchai Improved CMOS Implementation II 28 Transistors 2102 -545 Digital ICs Arithmetic Building Blocks 11

B. Supmonchai Notes on Improved CMOS FA q Note that the PMOS network is identical to the NMOS network rather than being the complement. § This is possible because of the inversion property which says that the function of complemented inputs is equal to the complement of the function. § This simplification reduces the number of series transistors and makes the layout more uniform q This design has a greater delay to compute S than Cout § Most of the time the extra delay computing S has little effect on the critical path because carry is the signal that propagates § With proper sizing this delay on S can be minimized 2102 -545 Digital ICs Arithmetic Building Blocks 12

B. Supmonchai Inversion Property q The function must be symmetric 2102 -545 Digital ICs Arithmetic Building Blocks 13

B. Supmonchai TG-Based FA Cin A P B XOR 16 Transistors 2102 -545 Digital ICs S 2 -to-1 MUX XOR Cout Extra delay - slower Arithmetic Building Blocks 14

B. Supmonchai Complementary PT Logic (CPL) FA B B Cin S A B Cin B 28 transistors dual rail Voltage drop Problems Cin A Cout B Cin Cout A Cin B Faster, Lower Power, and small area than full static CMOS 2102 -545 Digital ICs Arithmetic Building Blocks 15

B. Supmonchai Mirror Adder PUN and PDN are symmetrical not complemented 24+4 transistors A 8 B 8 A 8 Cin A 4 4 B 4 4 2 generate A Cin 4 !Cout = AB + ACin + BCin 2102 -545 Digital ICs 4 B kill 0 -propagate 1 -propagate A 8 2 B 2 Cin 2 B 6 A 6 Cin 3 A 3 B 3 !S S = ABCin+ (A + B + Cin)Cout Arithmetic Building Blocks 17

B. Supmonchai Mirror Adder Features q q The NMOS and PMOS chains are completely symmetrical with a maximum of two series transistors in the carry circuitry, guaranteeing identical rise and fall transitions if the NMOS and PMOS devices are properly sized. When laying out the cell, the most critical issue is the minimization of the capacitances at node !Cout (four diffusion capacitances, two internal gate capacitances, and two inverter gate capacitances). § q Shared diffusions can reduce the stack node capacitances. The transistors connected to Cin are placed closest to the output. 2102 -545 Digital ICs Arithmetic Building Blocks 18

B. Supmonchai Mirror Adder Sizing Issues q q q Only the transistors in the carry stage have to be optimized for optimal speed. All transistors in the sum stage can be minimal size Assume PMOS/NMOS ratio of 2. Each input in the carry circuit has a logical effort of 2 so the optimal fanout for each is also 2. Since !Cout drives 2 internal and 2 inverter transistor gates (to form Cout for the bit adder) the carry circuit should be oversized 2102 -545 Digital ICs Arithmetic Building Blocks 19

B. Supmonchai Mirror Adder Stick Diagram 2102 -545 Digital ICs Arithmetic Building Blocks 20

B. Supmonchai Ripple Carry Adder (RCA( A 3 B 3 Cout = C 4 FA S 3 A 2 B 2 C 3 FA S 2 A 1 B 1 C 2 FA S 1 A 0 B 0 C 1 FA C 0 = Cin S 0 tripple t. FA(A, B Cout) + (N - 2)t. FA(Cin Cout) + t. FA(Cin S) Worst Case Delay : tripple = O(N) Slow! 2102 -545 Digital ICs Make the fastest possible carry path Arithmetic Building Blocks 21

B. Supmonchai Exploiting the Inversion Property A 3 B 3 Cout = C 4 FA A 2 B 2 C 3 FA S 3 q q S 2 A 1 B 1 C 2 FA A 0 B 0 C 1 FA C 0 = Cin S 1 S 0 inverted cell regular cell Now need two “flavors” of FAs Minimizes the critical path (the carry chain) by eliminating inverters between the FAs § Need increasing the transistor sizes on the carry chain portion of the mirror adder. 2102 -545 Digital ICs Arithmetic Building Blocks 22

B. Supmonchai Fast Carry Chain Design q The key to fast addition is a low latency carry network q What matters is whether in a given position a carry is q § Generated Gi = A i B i § Propagated Pi = Ai Bi (sometimes use Ai | Bi) § Annihilated (killed) Ki = !Ai !Bi Giving a carry recurrence of C i+1 = Gi + Pi. Ci C 1 = G 0 + P 0 C 2 = G 1 + P 1 G 0 + P 1 P 0 C 3 = G 2 + P 2 G 1 + P 2 P 1 G 0 + P 2 P 1 P 0 C 4 = G 3 + P 3 G 2 + P 3 P 2 G 1 + P 3 P 2 P 1 G 0 + P 3 P 2 P 1 P 0 C 0 2102 -545 Digital ICs Arithmetic Building Blocks 23

B. Supmonchai Manchester Carry Chain q Switches controlled by Gi and Pi Static q Components of total delay Domino § time to form the switch control signals Gi and Pi § setup time for the switches § signal propagation delay through N switches in the worst case 2102 -545 Digital ICs Arithmetic Building Blocks 24

B. Supmonchai 4 -bit Sliced MCC Adder A 3 B 3 & G P A 2 B 2 & G P A 1 B 1 & G P A 0 B 0 & G P !C 4 !C 0 !C 3 2102 -545 Digital ICs clk !C 1 !C 2 S 3 S 2 S 1 S 0 Arithmetic Building Blocks 25

B. Supmonchai Domino MCC Circuit 3 P 3 3 1 Ci, 4 P 2 3 2 P 1 3 3 P 0 3 clk 4 1 G 3 2 G 2 3 G 1 4 G 0 5 2 3 4 5 6 Ci, 0 clk G 0 + P 0 C 0 G 2 + P 2 G 1 + P 2 P 1 G 0 + P 2 P 1 P 0 C 0 G 1 + P 1 G 0 + P 1 P 0 C 0 G 3 + P 3 G 2 + P 3 P 2 G 1 + P 3 P 2 P 1 G 0 + P 3 P 2 P 1 P 0 C 0 2102 -545 Digital ICs Arithmetic Building Blocks 26

B. Supmonchai MCC Stick Diagram 2102 -545 Digital ICs Arithmetic Building Blocks 27

B. Supmonchai Notes on MCC Adder q q q When clock is low, the carry nodes precharge; when clock goes high if Gi is high, Ci+1 is asserted (goes low) To prevent Gi from affecting Ci, the signal Pi must be computed as the xor (rather than the or) which is not a problem since we need the xor of Ai and Bi for computing the sum anyway Delay is roughly proportional to n**2 (as n pass transistors are connected in series) § we usually limit each group to 4 stages, then buffer the carry chain with an inverter between each group 2102 -545 Digital ICs Arithmetic Building Blocks 28

B. Supmonchai Binary Adder Landscape Bit-Serial Adders Synchronous Word Parallel Adders Ripple Carry Adders (RCA) Asynchronous Adders Carry Prop Min Adders t = O(N), A = O(N) Signed-Digit Adders Fast Carry Prop Adders Residue Adder t = O(1), A = O(N) Manchester Carry Chain t = O(N) A = O(N) 2102 -545 Digital ICs Carry Select Parallel Prefix Conditional Sum t = O(log N) A = O(N log N) Arithmetic Building Blocks Carry Skip t = O( N) A = O(N) 29

B. Supmonchai Carry-Skip (Carry-Bypass) Adder A 3 B 3 C 0, 3 1 FA A 2 B 2 C 3 FA A 1 B 1 C 2 FA A 0 B 0 C 1 FA Ci, 0 Co, 3 0 S 3 S 2 BP = P 0 P 1 P 2 P 3 S 1 S 0 “Block Propagate” If (P 0 & P 1 & P 2 & P 3 = 1) then Co, 3 = Ci, 0 otherwise the block itself kills or generates the carry internally 2102 -545 Digital ICs Arithmetic Building Blocks 30

B. Supmonchai Carry-Skip Chain Implementation block carry-out BP (By-Pass) block carry-in P 3 P 2 P 1 P 0 Cout Cin G 3 BP 2102 -545 Digital ICs G 2 G 1 G 0 Only 10% to 20% area overhead Only two “gate delays” to produce Cout if skip occurs Arithmetic Building Blocks 31

B. Supmonchai 4 -bit Block Carry-Skip Adder bits 12 to 15 Setup tcarry Carry Propagation Sum tsum bits 8 to 11 bits 4 to 7 bits 0 to 3 Setup Carry Propagation Sum Sum tskip tsetup Ci, 0 Worst-case delay carry from bit 0 to bit 15 = carry generated in bit 0, ripples through bits 1, 2, and 3, skips the middle two groups (B is the group size in bits), ripples in the last group from bit 12 to bit 15 tadd = tsetup + B tcarry + ((N/B) -1) tskip + B tcarry + tsum 2102 -545 Digital ICs Arithmetic Building Blocks 32

B. Supmonchai Optimal Block Size and Time q Assuming one stage of ripple (tcarry) has the same delay as one skip logic stage (tskip) and both are 1 t. CSk. A = 1 + B + (N/B-1) + B + 1 tsetup ripple in block 0 skips ripple in last block tsum = 2 B + N/ B + 1 q So the optimal block size, B, is dt. CSk. A/d. B = 0 (N/2) = Bopt q And the optimal time is Optimal t. CSk. A = 2( (2 N)) + 1 2102 -545 Digital ICs Arithmetic Building Blocks 33

B. Supmonchai Variations of Carry-Skip Adders I q Variable block sized Carry-Skip Adders § A carry that is generated in, or absorbed by, one of the inner blocks travels a shorter distance through the skip blocks § Hence a CSA adder can have bigger blocks for the inner carries without increasing the overall delay Cout Cin NB Blocks t. CSk. A = 2 B + O(NB) 2102 -545 Digital ICs Arithmetic Building Blocks 34

B. Supmonchai Variations of Carry-Skip Adders II q Multiple Levels of Skip Logic § CSAs with large number of bits suffer from linear carry propagation delay time. § Added higher levels of skip logic, a CSA can skip more blocks at a time Cout Cin skip level 1 skip level 2 t. CSk. A = 2 B + O(log. BN) 2102 -545 Digital ICs Arithmetic Building Blocks AND of the first level skip signals (BP’s) 35

B. Supmonchai Carry-Skip Adder Comparisons B=4 B=2 2102 -545 Digital ICs B=5 B=6 B=3 Arithmetic Building Blocks 36

B. Supmonchai Carry Select Adders A’s q q Idea: Idea Precompute the carry out of each block for both carry_in = 0 and carry_in = 1 (can be done for all blocks in parallel) and then select the correct one More cost effective than the ripple carry adder B’s 4 -bit Setup P’s G’s “ 0” Carry Propagation 0 “ 1” Carry Propagation 1 Cout Multiplexer Cin C’s Sum Generation S’s 2102 -545 Digital ICs Arithmetic Building Blocks 37

B. Supmonchai Carry Select Adder: Critical Path Cout bits 12 to 15 bits 8 to 11 A’s B’s bits 4 to 7 A’s B’s bits 0 to 3 A’s B’s Setup P’s G’s “ 0” carry “ 1” carry Mux Mux C’s C’s Sum Gen S’s S’s Cin tadd = tsetup + B tcarry + (N/B) tmux + tsum 2102 -545 Digital ICs Arithmetic Building Blocks 38

B. Supmonchai Square Root Carry Select Adders Balance Delay - Making later block bigger tadd = tsetup + 2 tcarry + √N tmux + tsum 2102 -545 Digital ICs Arithmetic Building Blocks 39

B. Supmonchai Adder Delays - Comparison 2102 -545 Digital ICs Arithmetic Building Blocks 40

B. Supmonchai Look. Ahead - Basic Idea Carry Network Co, k = f(Ak, Bk, Co, k-1) = Gk + Pk. Co, k-1 2102 -545 Digital ICs Arithmetic Building Blocks 41

B. Supmonchai Look-Ahead: Topology By expanding carry generation all the way: C 1 = G 0 + P 0 C 2 = G 1 + P 1 G 0 + P 1 P 0 C 3 = G 2 + P 2 G 1 + P 2 P 1 G 0 + P 2 P 1 P 0 C 4 = G 3 + P 3 G 2 + P 3 P 2 G 1 + P 3 P 2 P 1 G 0 + P 3 P 2 P 1 P 0 C 0 … 2102 -545 Digital ICs Arithmetic Building Blocks 42

B. Supmonchai Logarithmic Look-Ahead Adder 2102 -545 Digital ICs Arithmetic Building Blocks 43

B. Supmonchai Parallel Prefix Adders (PPAs( q Define carry operator € on (G, P) signal pairs (G’’, P’’) (G’, P’) G’’ where G = G’’ + P’’G’ P = P’’P’ € (G, P) G’ !G P’’ § € is associative, i. e. , [(g’’’, p’’’) € (g’’, p’’)] € (g’, p’) = (g’’’, p’’’) € [(g’’, p’’) € (g’, p’)] € € € 2102 -545 Digital ICs € Arithmetic Building Blocks 44

B. Supmonchai PPA General Structure q Given P and G terms for each bit position, computing all the carries is equal to finding all the prefixes in parallel (G 0, P 0) € (G 1, P 1) € (G 2, P 2) € … € (GN-2, PN-2) € (GN-1, PN-1) q Since € is associative, we can group them in any order § but note that it is not commutative Pi, Gi logic (1 unit delay) Ci parallel prefix logic tree (1 unit delay per level) Si logic (1 unit delay) 2102 -545 Digital ICs q Measures to consider § number of € cells § tree cell depth (time) § tree cell area § cell fan-in and fan-out § max wiring length § wiring congestion § delay path variation (glitching) Arithmetic Building Blocks 45

B. Supmonchai € € € G 8 G 7 P 8 P 7 € G 6 G 5 P 6 P 5 € € G 4 G 3 P 4 P 3 € € G 2 p 2 G 1 P 1 G 0 P 0 C in € € C 2 C 1 € T = log 2 N € G 11 G 10 G 9 p 11 P 10 p 9 € € € € T = log 2 N - 2 Parallel Prefix Computation G 15 G 14 G 13 G 12 p 15 p 14 p 13 P 12 A = 2 log 2 N Brent-Kung PPA € € C 16 C 15 C 14 C 13 C 12 C 11 C 10 C 9 € C 8 C 7 € C 6 C 5 € C 4 C 3 A = N/2 2102 -545 Digital ICs Arithmetic Building Blocks 46

B. Supmonchai Kogge-Stone PPF Adder G 2 G 1 G 0 P 2 P 1 P 0 C in € € € € € € € € € € € € C 8 C 7 C 6 C 5 C 4 C 16 C 15 C 14 C 13 C 12 C 11 C 10 C 9 C 3 € € C 2 C 1 A = log 2 N G 5 G 4 G 3 P 5 P 4 P 3 T = log 2 N Parallel Prefix Computation G 15 G 14 G 13 G 12 G 11 G 10 G 9 G 8 G 7 G 6 P 15 P 14 P 13 P 12 P 11 P 10 P 9 P 8 P 7 P 6 A=N 2102 -545 Digital ICs Arithmetic Building Blocks 47

B. Supmonchai More Adder Comparisons 2102 -545 Digital ICs Arithmetic Building Blocks 48

B. Supmonchai Adder Speed Comparisons 2102 -545 Digital ICs Arithmetic Building Blocks 49

B. Supmonchai Adder Average Power Comparisons 2102 -545 Digital ICs Arithmetic Building Blocks 50

B. Supmonchai Binary Multiplication - Basics q Given two unsigned binary numbers X (M bits) and Y (N bits) where Xi, Yj {0, 1} q The multiplication operation Z = X Y is 2102 -545 Digital ICs Arithmetic Building Blocks 52

B. Supmonchai Binary Multiplication Operation q Binary Multiplication as repeated additions N M 1 0 1 0 multiplicand 1 0 1 1 multiplier N 101010 partial 101010 product 000000 array 101010 can be formed in parallel 1 1 1 0 0 1 1 1 0 double precision product 2 N 2102 -545 Digital ICs Arithmetic Building Blocks 53

B. Supmonchai Shift-and-Add Multiplication q Right Shift and Add (N bits N bits) N Multiplicand “ 0” N N 1 0 N N-bit Adder Multiplier *Left shift requires 2 n-bit adder N+1 Bit out tshift&add_mult = O(N · tadder) = O(N 2) for an RCA 2102 -545 Digital ICs Arithmetic Building Blocks 54

B. Supmonchai Improving Multipliers q Making them faster (therefore, bigger area) area § Use faster adders § Use higher radix (e. g. , base 4) multiplication Ø Use multiplier recoding to simplify multiple formation § Form partial product array in parallel and add it in parallel q Making them smaller (i. e. , slower) slower § Use array multipliers Ø Very regular structure with only short wires to nearest neighbor cells. Thus, very simple and efficient layout in VLSI Ø Can be easily and efficiently pipelined 2102 -545 Digital ICs Arithmetic Building Blocks 55

B. Supmonchai fast carry propagate adder (CPA) 2102 -545 Digital ICs 0 D Q (‘ier) 0 D 0 D (‘icand) PP Accumulation partial product array reduction tree Final Addition multiple forming circuits PP Generation Array (or Tree) Multiplier Structure mux + reduction tree (log N) + CPA (log N) P (product) Arithmetic Building Blocks 56

B. Supmonchai Partial Product (PP) Generation q Each row in the partial-product array is either a copy of the multiplicand or a row of zeros X 7 X 6 X 5 X 4 X 3 X 2 X 1 X 0 Yi PP 7 q PP 6 PP 5 PP 4 PP 3 PP 2 PP 1 PP 0 Careful optimization of the PP generation can lead to some substantial delay and area reduction. § Booth’s and modified Booth’s recording 2102 -545 Digital ICs Arithmetic Building Blocks 57

B. Supmonchai Array Multiplier Implementation HA: Half Adder FA: Full Adder CP: Critical Path CP 1 HW for One Partial Product CP 2 * Assume tadd = tcarry tarray_mult = [(M -1)+(N - 2)] tcarry + (N - 1) tsum + tand = O(N) 2102 -545 Digital ICs Arithmetic Building Blocks 58

B. Supmonchai Carry-Save Multiplier q q The idea is to “save” the (PP) carry and add it in the next adder stage In the final addition a fast carry-propagate (e. g. , carrylookahead) adder is used. 6 HAs 6 FAs Unique and Shorter CP t. CSM = (N - 1) tcarry + tmerge + tand = O(N) 2102 -545 Digital ICs Arithmetic Building Blocks 59

B. Supmonchai CSM Floorplan Regularity makes the generation of structure amenable to automation 2102 -545 Digital ICs Arithmetic Building Blocks 60

B. Supmonchai Wallace-Tree Multiplier First Stage Partial Products Bit 6 5 4 3 2 1 0 Position Rearranging PPs As, F and t part s HA nses h t wi he de e tre om t r e v fr Co ting r sta Second Stage 6 5 4 3 2 1 0 FA 6 5 4 3 2 1 0 HA Final Adder 6 5 4 3 2 1 0 HA Any Types of adder can be used GOAL: Minimize depth (# of stages) with min. no. of adder elements 2102 -545 Digital ICs Arithmetic Building Blocks 61

B. Supmonchai Wallace-Tree Multiplier Implementation HA 3 HAs and 3 FAs for the reduction process (stage 1 + stage 2) Any type of adder can be used for the final adder 2102 -545 Digital ICs Arithmetic Building Blocks 62

B. Supmonchai Notes on Wallace-Tree Multiplier q Wallace tree substantially saves hardware for large multipliers § Number of partial products is reduced by two-thirds per stage q The propagation delay is found to be bound, t. WTM = O(log 3/2 (N)) q Although substantially faster than CSM, WTM structure is very irregular § Difficulty in finding efficient VLSI layout q Many of today’s high performance multipliers use higher order (e. g. 4 -2) compressors in stead of 3 -2 compressors (FAs) 2102 -545 Digital ICs Arithmetic Building Blocks 63

B. Supmonchai Parallel Programmable Shifters q Shifting a data word left or right over a constant amount is a trivial hardware operation and is implemented by the appropriate signal wiring Shifters are used in multipliers, floating point units Data In Control = Shift amount Shift direction Shift type (logical, arith, circular) Shifter Data Out q Consume lots of area if done in random logic gates 2102 -545 Digital ICs Arithmetic Building Blocks 64

B. Supmonchai A Programmable Binary Shifter Exactly one signal is active Ai Ai-1 right nop left Bi Bi-1 A 0 0 1 0 A 1 A 0 1 0 0 0 A 1 A 0 0 0 1 A 0 0 2102 -545 Digital ICs Arithmetic Building Blocks 65

B. Supmonchai 4 -bit Barrel Shifter A 3 B 3 Sh 1 A 2 A 1 B 2 Example: Sh 0 = 1 B 3 B 2 B 1 B 0 = A 3 A 2 A 1 A 0 Sh 1 = 1 B 3 B 2 B 1 B 0 = A 3 A 2 A 1 Sh 2 = 1 B 3 B 2 B 1 B 0 = A 3 A 3 A 2 Sh 2 B 1 Sh 3 = 1 B 3 B 2 B 1 B 0 = A 3 A 3 Arithmetic shift A 0 B 0 Sh 0 2102 -545 Digital ICs Sh 1 Sh 2 Sh 3 Area dominated by wiring Arithmetic Building Blocks 67

B. Supmonchai Notes on Barrel Shifter q q Note that signal goes through at most one FET (so constant propagation delay (in theory)) Also note, that the FET diffusion capacitance on an output wire increases linearly with the shift width but the FET diffusion capacitance on the input data lines increases quadratically (i. e. , N 2 for circular shifter) Size of cell is bounded by the pitch of the metal wires A decoder is usually needed for shift control signals since the amount of shift are normally given in (encoded) binary number. 2102 -545 Digital ICs Arithmetic Building Blocks 68

B. Supmonchai 4 -bit Barrel Shifter Layout Widthbarrel ~ 2 pm N N = max shift distance, pm = metal pitch 2102 -545 Digital ICs Arithmetic Building Blocks 69

B. Supmonchai 8 -bit Logarithmic Shifter 0 1 Sh 1 !Sh 1 1 0 Sh 2 !Sh 2 0 1 Sh 3 !Sh 3 A 3 B 3 A 2 B 2 A 1 B 1 A 0 B 0 log N stages 2102 -545 Digital ICs Arithmetic Building Blocks 71

B. Supmonchai 8 -bit Logarithmic Shifter Layout Slice 1 2 4 A 3 B 3 A 2 B 2 A 1 B 1 A 0 B 0 Widthlog ~ pm(2 K+(1+2+…+2 K-1)) = pm(2 K+2 K-1) K = log 2 N 2102 -545 Digital ICs Arithmetic Building Blocks 72

B. Supmonchai Shifter Implementation Comparisons Barrel Logarithmic Width Speed N K 2 N pm 1 + N diffs pm(2 K+2 K-1) K + 2 diffs 8 3 16 pm 1+8 13 pm 3+2 16 4 32 pm 1 + 16 23 pm 4+2 32 5 64 pm 1 + 32 41 pm 5+2 64 6 128 pm 1 + 64 75 pm 6+2 q Barrel Shifter is better for small shifters (faster, not much bigger) while Log Shifter is preferred for larger shifters. § Log Shifters are always smaller q For large shifter we may have to start worrying about the number of pass transistors in series. 2102 -545 Digital ICs Arithmetic Building Blocks 73

B. Supmonchai Decoders q Decodes inputs to activate one of many outputs Enable In 0 In 1 2 -to-4 Decoder Out 0 = In 0 In 1 Out 1 = In 0 In 1 Out 2 = In 0 In 1 Out 3 = In 0 In 1 q Cost of 2 -to-4 Decoder § two inverters, four 2 -input NAND gates, four inverters plus enable logic § how about cost for a 3 -to-8, 4 -to-16, etc. decoder? 2102 -545 Digital ICs Arithmetic Building Blocks 74

B. Supmonchai Dynamic NOR Decoder Vdd GND on on B 3 1 0 B 2 1 0 B 1 1 0 B 0 1 1 precharge 0 1 A 0 0 A 0 1 A 1 0 A 1 Active HIGH Outputs 1 Capacitance of the output wires increases linearly with the decoder size 2102 -545 Digital ICs Arithmetic Building Blocks 76

B. Supmonchai Dynamic NAND Decoder GND Active LOW Outputs B 3 1 1 B 2 1 1 B 1 1 1 on A 0 0 2102 -545 Digital ICs A 0 1 on A 1 0 A 1 1 Arithmetic Building Blocks B 0 1 0 precharge 0 1 78

B. Supmonchai Notes on Dynamic Decoders q In Dynamic NOR decoder signal goes through at most one FET § So constant propagation delay (in theory) § However, some output wires may have two or more parallel paths to GND - effectively shortening the transition time q On the contrary, signal in dynamic NAND decoder pass through a series of FET § The number of FETs rises linearly with the decoder size § Thus it will be slower than the NOR implementation if the gate capacitance dominates diffusion capacitance q For the NAND decoder all the input signals must be low during precharge else Vdd and GND will be connected! 2102 -545 Digital ICs Arithmetic Building Blocks 79

B. Supmonchai Building Bigger Decoders Active low enable, Active low output 1 0 1 enable 2 x 4. . . 2 x 4 1 x 2 2 x 4 A 4 0 A 3 A 2 0 0 A 1 A 0 0 1 Need to catch the output that goes to zero before it precharges again 2102 -545 Digital ICs Arithmetic Building Blocks 80

B. Supmonchai Layout of Bit-Sliced Datapaths Must have enough drive capacity to handle large fan-out Sized for peak current Horizontal gap for feeding signals to the cells downstream 2102 -545 Digital ICs Arithmetic Building Blocks 81

B. Supmonchai Optimizing Bit-sliced Datapaths Without feedthroughs or pitch matching (4. 2 m 2) 2102 -545 Digital ICs With feedthroughs (3. 2 m 2) Arithmetic Building Blocks With feedthroughs and pitch matching (2. 2 m 2) 82