RTL Level Power Optimization Techniques Pr Amara AMARA

  • Slides: 58
Download presentation
RTL Level Power Optimization Techniques Pr. Amara AMARA Institut Supérieur d’Electronique de Paris, France

RTL Level Power Optimization Techniques Pr. Amara AMARA Institut Supérieur d’Electronique de Paris, France Dr. Philippe Royannez Texas Instruments Villeneuve-Loubet, France Low-Power Electronics Design, CRC Press, USA, 2004 6/13/2021 Amara AMARA DA-IICT, India 2005 1

RTL Level Power Optimization Techniques l Introduction l Dynamic, static and leakage components l

RTL Level Power Optimization Techniques l Introduction l Dynamic, static and leakage components l Low Power design Techniques l Clock Gating l Data paths l Buses 6/13/2021 Amara AMARA DA-IICT, India 2005 2

Hottest chips published in ISSCC Power per chip [W] 1000 x 1. 4 100

Hottest chips published in ISSCC Power per chip [W] 1000 x 1. 4 100 10 1 0. 1 x 4 /3 s ar e y /3 s r a e y MPU DSP 0. 01 1980 1985 1990 1995 2000 (Sakurai 2003) Year Amara AMARA, NEWCAS 2004 Montreal Quebec

Dynamic Power P = α CL VS VDD FCLK The switching activity a is

Dynamic Power P = α CL VS VDD FCLK The switching activity a is the average percentage of the nodes that actually toggles 0 ->1 in the total chip l The switching activity includes glitches spurious activity l The switching activity increases dramatically with pipelining l CL is the total equivalent Capacitance l CL includes both gate and wire capacitance l CL is an average capacitance (Caps vary with biasing, Xtalk, …) l 6/13/2021 Amara AMARA DA-IICT, India 2005 4

Leakage Power Issues Pleakage : 4 dominant mechanisms • Subthreshold leakage • Gate tunneling

Leakage Power Issues Pleakage : 4 dominant mechanisms • Subthreshold leakage • Gate tunneling leakage • Reverse-bias diffusion leakage • Gate-induced drain leakage (GIDL) Gate Source N+ IPT Drain ISTH IG+Iii IGIDL N+ IR P Bulk 6/13/2021 Amara AMARA DA-IICT, India 2005 5

Active leakage may dominate 10000 Leakage Power per chip [W] 1000 x 1. 4

Active leakage may dominate 10000 Leakage Power per chip [W] 1000 x 1. 4 100 0. 1 ar e y /3 s r a 10 1 1/100 x 4 /3 ye s Dynamic x 1. 1 / 3 years ITRS requirement MPU Processors published in ISSCC DSP 0. 01 1980 1985 1990 1995 2000 2005 2010 2015 Year Amara AMARA, NEWCAS 2004 (Sak 2003) Montreal Quebec

Principle for Dynamic Power Reduction l Lowering switching probability (a) q q l Embedded

Principle for Dynamic Power Reduction l Lowering switching probability (a) q q l Embedded memory, Gate sizing Low-k Lowering supply voltage (VS , VDD) q q l P = α CL VS VDD FCLK Lowering load capacitance (CL) q q l Gated clock, Conditional F/F Low transition coding Most effective (∝VDD 2) and popular, but at the cost of speed degradation VTH should also be lowered for high-speed circuit operation Lowering operating frequency (f. CLK) q q Better algorithm, parallelism Never employed in PC, but will be important for portable devices 6/13/2021 Amara AMARA DA-IICT, India 2005 7

RTL Level Power Optimization Techniques l Introduction l Dynamic, static and leakage components l

RTL Level Power Optimization Techniques l Introduction l Dynamic, static and leakage components l Low Power design Techniques l Clock Gating l Data paths l Buses 6/13/2021 Amara AMARA DA-IICT, India 2005 8

Clock Gating Most effective power optimization technique l Supported by most of the EDA

Clock Gating Most effective power optimization technique l Supported by most of the EDA tools l Effective at register level as well as at clock network level l Different approaches: l l 6/13/2021 Functional approach Activity-driven Observability Don’t Care-Driven Amara AMARA DA-IICT, India 2005 9

Clock Gating Principle l Goal Disable or suppress transitions from propagating to parts of

Clock Gating Principle l Goal Disable or suppress transitions from propagating to parts of the clock path (FFs, clock network and logic) under a given IDLE condition. l Principle To each sequential functional unit is associated a block CG which inhibits the clock signal when the IDLE condition is true. The IDLE condition is computed by function Fcg 6/13/2021 Amara AMARA DA-IICT, India 2005 10

Clock Gating Implementation Flip-Flop-Based Design Simplest way to implement block CG but subject to

Clock Gating Implementation Flip-Flop-Based Design Simplest way to implement block CG but subject to spikes. When CLK is low, spikes are filtered by the AND When CLK is high, spikes are filtered by the Latch 6/13/2021 When CLK is high, spikes are filtered by the NOR When CLK is low, spikes are filtered by the Latch Amara AMARA DA-IICT, India 2005 11

Flip-Flop-Based Design: Physical Constraint Two separate processes in the same hierarchy l Physically Close

Flip-Flop-Based Design: Physical Constraint Two separate processes in the same hierarchy l Physically Close to reduce the impact on the skew and to prevent from unwanted optimizations l Use tool specific attributes l 6/13/2021 Amara AMARA DA-IICT, India 2005 12

Example CG : Process (CLK, CTRL) variable qint : std_logic ; begin if clk

Example CG : Process (CLK, CTRL) variable qint : std_logic ; begin if clk = '0' then qint : = ctrl; end if; CLKG <= (not Qint) and CLK ; end process; 6/13/2021 process (CLKG) begin if CLKG = '1' then if WR ='1' then RF(conv_integer(adr. A)) <= datain ; else A <= RF(conv_integer(adr. A)) ; B <= RF(conv_integer(adr. B)) ; end if; end process ; Amara AMARA DA-IICT, India 2005 13

Automatic GC insertion functional approach l Detect conditional behavior in the VHDL description: l

Automatic GC insertion functional approach l Detect conditional behavior in the VHDL description: l If then else statement, Case statement … l Identify the idle conditions (under which the clock of the element can be shut off) l Insert clock gating circuits if the user constraints are met (delay, power). l Generate modified VHDL description 6/13/2021 Amara AMARA DA-IICT, India 2005 14

Example P 1 : process (clk) begin if clk'event and clk='1' then X <=

Example P 1 : process (clk) begin if clk'event and clk='1' then X <= A + B ; D <= E ; end if ; end process ; P 2 : process (Gclk) begin if Gclk'event and Gclk='1' then if (load='1') then Y <= C ; end if ; end process ; P 3: process (clk, load) Variable c_load: std_logic; begin if clk = '0' then c_load <= load ; end if ; Gclk <= clk and c_load ; end process ; 6/13/2021 P 0 : process (clk) begin if (clk'event and clk='1') then X <= A + B; D <= E; if (load='1') then Y <= C; end if ; end process ; P 0 P 1 Without CG P 2 with CG P 3 CG circuit Amara AMARA DA-IICT, India 2005 15

Automatic GC insertion Activity-Driven Most of the clock toggles are not needed l Power

Automatic GC insertion Activity-Driven Most of the clock toggles are not needed l Power is wasted in the clock driver and in the register l Good candidate for Clock Gating l 6/13/2021 Amara AMARA DA-IICT, India 2005 16

Automatic GC insertion Activity-Driven l Given a realistic test bench: Sort the flips-flips according

Automatic GC insertion Activity-Driven l Given a realistic test bench: Sort the flips-flips according to increasing switching activity l For a predefined threshold, select a subset of low frequency flip-flops (SLF-FFs) l Locate or create an enable signal l Apply Clock Gating to the selected subset of flip-flops if the user constraints are met. l 6/13/2021 Amara AMARA DA-IICT, India 2005 17

Automatic GC insertion Observability Don’t Care-Driven If out_bus is not active during a given

Automatic GC insertion Observability Don’t Care-Driven If out_bus is not active during a given clock cycle, CG can be applied to R 1 and R 2 l An ODC boolean function is evaluated at each clock cycle to set properly the CG control signal for the next clock cycle. l This function is derived by backward traversal of the circuit using ODC method. l 6/13/2021 Amara AMARA DA-IICT, India 2005 18

ODC Method: Basics ODC(Data 0) = Sel ODC(Data 1) = Sel’ Data 0 0

ODC Method: Basics ODC(Data 0) = Sel ODC(Data 1) = Sel’ Data 0 0 ODC(Data) = En’ En Out Q Data 1 Data En Sel 6/13/2021 ODC(Data) = EN’@T-1 CK Amara AMARA DA-IICT, India 2005 19

Automatic GC insertion ODC-Driven CG Cell ODC Boolean Function CK 6/13/2021 R 1_en R

Automatic GC insertion ODC-Driven CG Cell ODC Boolean Function CK 6/13/2021 R 1_en R 2_en Mux_sel Amara AMARA DA-IICT, India 2005 ENB 20

Design Issues: Timing In most power design flows: CG is inserted before clock tree

Design Issues: Timing In most power design flows: CG is inserted before clock tree synthesis l To avoid setup and/or hold time violation: l l l 6/13/2021 Evaluate these critical times Set appropriate tool’s dependent variables to specify these times Amara AMARA DA-IICT, India 2005 21

Design Issues: Testability CG introduces multiple clock domains in the design l Insert a

Design Issues: Testability CG introduces multiple clock domains in the design l Insert a control point (OR gate) controlled by an additional signal: Scan _mode l This signal eliminates the function of the clock gate during the test phase l 6/13/2021 Amara AMARA DA-IICT, India 2005 22

How effective is Clock-gating? Without clock gating 30. 6 m. W With clock gating

How effective is Clock-gating? Without clock gating 30. 6 m. W With clock gating 8. 5 m. W 0 5 10 15 VDE 20 25 Power [m. W] 90% of F/F’s were clock-gated. 70% power reduction by clock-gating alone. M. Ohashi, Matsushita, ISSCC 2002 6/13/2021 Amara AMARA DA-IICT, India 2005 MIF 896 Kb SRAM DEU DSP/ HIF MPEG 4 decoder 23

RTL Level Power Optimization Techniques l Introduction l Dynamic, static and leakage components l

RTL Level Power Optimization Techniques l Introduction l Dynamic, static and leakage components l Low Power design Techniques l Clock Gating l Data paths l Buses 6/13/2021 Amara AMARA DA-IICT, India 2005 24

Data Paths l An important amount of energy may be wasted in the data

Data Paths l An important amount of energy may be wasted in the data path. Many techniques have been proposed: Computational Kernel l Pre-computation logic l Guarded evaluation (operand isolation) l Control signal gating l Glitch reduction… l 6/13/2021 Amara AMARA DA-IICT, India 2005 25

Computation Kernels l. A sequential circuit may have a large number of reachable states.

Computation Kernels l. A sequential circuit may have a large number of reachable states. l Only a subset is visited when the circuit is in its Steady-State. l Definition: CK of a sequential circuit is a logic block that mimics the typical behavior of the original network. 6/13/2021 Amara AMARA DA-IICT, India 2005 26

Computation Kernels l IDEA: l l Extract the computational Kernel (K) from the Original

Computation Kernels l IDEA: l l Extract the computational Kernel (K) from the Original Circuit description (OC) K is usually: l l Small Fast Low Power Realize a parallel implementation of K and OC that: l l 6/13/2021 Uses K as often as possible Uses OC otherwise Amara AMARA DA-IICT, India 2005 27

Computation Kernels OC S 5 S 7 S 2 S 1 S 0 p

Computation Kernels OC S 5 S 7 S 2 S 1 S 0 p 0 S 4 S 3 S 5 S 1 S 6 S 0 Kernel 6/13/2021 p 1 S 7 S 2 S 4 S 3 p 4 S 6 S 4 NK p 0, p 1, p 4 > pthreshold Amara AMARA DA-IICT, India 2005 28

Computation Kernels l l SEL=1 K can compute next state and outputs SEL=0 C

Computation Kernels l l SEL=1 K can compute next state and outputs SEL=0 C must compute next state and outputs X p FF 0 Comb Logic t Mux 1 r 0 Kernel u S S Mux 1 O s FF FF 6/13/2021 SEL Amara AMARA DA-IICT, India 2005 29

Data Path: Pre-Computation Principle: l Partition the inputs into pre-computed and gated inputs l

Data Path: Pre-Computation Principle: l Partition the inputs into pre-computed and gated inputs l If output Y is independent of gated inputs then predictor G generates a signal that freezes the outputs of R 2. l Function G is not unique best trade-off to find 6/13/2021 Amara AMARA DA-IICT, India 2005 30

Data Path: Guarded Evaluation Applicable to combin. Blocks emb. within logic l If Y

Data Path: Guarded Evaluation Applicable to combin. Blocks emb. within logic l If Y is idle, transparent latches are inserted to all inputs l Control circuitry is added to determine the IDLE condition l The IDLE condition is used to disable the latches. l 6/13/2021 Amara AMARA DA-IICT, India 2005 31

Data Path: Control-Signal Gating The control-signal technique takes advantage of a fine granularity analysis

Data Path: Control-Signal Gating The control-signal technique takes advantage of a fine granularity analysis to reduce the switching activity in the data path buses l Principle: detect when a bus is not used and stop the propagation of the switching activity through the module(s) driving the bus Power Management Unit to generate l ODC-based technique the gated signals that control steering l modules 6/13/2021 Amara AMARA DA-IICT, India 2005 32

Data Path: Control-Signal Gating (1) R 1_en_gated = reg 1_en AND (not(mux_sel OR (not

Data Path: Control-Signal Gating (1) R 1_en_gated = reg 1_en AND (not(mux_sel OR (not enb)))@(T+1) R 2_en_gated = reg 2_en AND (not mux_sel OR not enb))@(T+1) (2) (mux_sel_gated) @T = (mux_sel_gated) @(T-1) if ((not enb) @(T+1) = = True) (3) The suffix @T means the value of a variable or a function at the current clock cycle, @T-1 the value one clock cycle before and finally @T+1 the value at the next clock cycle. 6/13/2021 Amara AMARA DA-IICT, India 2005 33

Data Path: Control-Signal Gating R 1 EG : process (Clk) begin if (Clk'Event and

Data Path: Control-Signal Gating R 1 EG : process (Clk) begin if (Clk'Event and Clk='1') then R 1_en_tmp <= NOT(mux_sel OR (NOT Enb)) ; end if ; R 1_en_gated <= R 1_en_tmp AND reg 1_en; end process ; R 2 EG : process (Clk) begin if (Clk'Event and Clk='1') then R 2_en_tmp<= NOT(mux_sel) OR (NOT Enb)) ; end if ; R 2_en_gated <= R 2_en_tmp AND reg 2_en; end process ; MSG : process (Clk) begin if (Clk'Event and Clk='1') then Enb_int <= NOT Enb ; if ( Enb_int = '0' ) then mux_sel_gated <= mux_sel ; end if ; end process ; 6/13/2021 Amara AMARA DA-IICT, India 2005 Equation 1 Equation 2 Equation 3 34

RTL Level Power Optimization Techniques l Introduction l Dynamic, static and leakage components l

RTL Level Power Optimization Techniques l Introduction l Dynamic, static and leakage components l Low Power design Techniques l Clock Gating l Data paths l Buses 6/13/2021 Amara AMARA DA-IICT, India 2005 35

Bus Coding l Advanced So. C characterized by: l l Long buses with high

Bus Coding l Advanced So. C characterized by: l l Long buses with high capacitance and a significant switching activity. Techniques proposed: l l l 6/13/2021 Low swing bus Charge recycling bus Bus pipelining Bus multiplexing Bus encoding Amara AMARA DA-IICT, India 2005 36

Bus Coding b(t) Sender Receiver b(t) Sender b(t) B(t) Encoder Decoder Receiver Less switching

Bus Coding b(t) Sender Receiver b(t) Sender b(t) B(t) Encoder Decoder Receiver Less switching activity b(t): Source word 6/13/2021 B(t): Code word Amara AMARA DA-IICT, India 2005 37

Bus Coding l Different approaches: Bus-Invert Coding and its variants (four) l Transition Signaling

Bus Coding l Different approaches: Bus-Invert Coding and its variants (four) l Transition Signaling Code l Offset Code l T 0 Code and its variants (four) l Limited-Weight Code (ie. One-hot code) l Etc… l 6/13/2021 Amara AMARA DA-IICT, India 2005 38

Bus Invert Coding l The encoding depends on Hamming distance between the present bus

Bus Invert Coding l The encoding depends on Hamming distance between the present bus value B(t) and the next bus value B(t+1) (B(t), INV(t)) = (b(t), 0) (b’(t), 1) if H <= N/2 Otherwise N: number of bus lines, H: Hamming Distance 6/13/2021 Amara AMARA DA-IICT, India 2005 39

Bus Invert Coding Binary (31 Trs) BIC (19 Trs) 00101010 00111011 11010100 11110100 00001101

Bus Invert Coding Binary (31 Trs) BIC (19 Trs) 00101010 00111011 11010100 11110100 00001101 01110110 0001 10000100 00101010 00111011 00101011 11110100 00001101 10001001 0001 10000100 6/13/2021 2 7 1 6 6 5 4 Amara AMARA DA-IICT, India 2005 0 1 1 0 0 2 2 1 3 3 4 4 40

Bus Invert Coding l Characteristics: Redundant bit consumes power l Switching activity on highly

Bus Invert Coding l Characteristics: Redundant bit consumes power l Switching activity on highly capacitive buses is reduced at the expense of additional switching activity in the decoder/encoder l Effective when the data to be transmitted is randomly distributed in time ( µP cache) l Not efficient for address bus encoding l 6/13/2021 Amara AMARA DA-IICT, India 2005 41

Bus Invert Coding (Variants) l Partial BIC: l l l Interleaving BIC: l l

Bus Invert Coding (Variants) l Partial BIC: l l l Interleaving BIC: l l Breaks source words into 2 parts and apply BIC on one part only Effective if certain bits of the data stream exhibit a strong spatio-temporal correlation. Similar tp PBIC but partitioning and encoding are dynamically changed M-bits BIC: l l 6/13/2021 Breaks source words into M parts and encodes separately each one Extra cost due to increasing number of INV signals Amara AMARA DA-IICT, India 2005 42

Bus Invert Coding (Variants) l BIC in Time: l The decoder decodes the last

Bus Invert Coding (Variants) l BIC in Time: l The decoder decodes the last n code words based on the INV vector received as the n+1 code word Code word B 1, B 2………………. Xi……………. . Bn INV vector INV 1, INV 2………. INVi…………. INVn Source word b 1, b 2………………. bi……………. . bn Bn+1 bi = (Bi, INVi) 6/13/2021 Amara AMARA DA-IICT, India 2005 43

T 0 Code l Exploit the sequentiality of the address buses l Redundant line

T 0 Code l Exploit the sequentiality of the address buses l Redundant line INC is added to the bus l When two addresses to be transmitted are sequential, the address bus is frozen and INC is set to 1 l Zero-Transition for ideally consecutive addresses 6/13/2021 Amara AMARA DA-IICT, India 2005 44

T 0 Code: Principle Encoder (B(t), INC(t)) (B(t-1), 1) If b(t) = b(t-1) +

T 0 Code: Principle Encoder (B(t), INC(t)) (B(t-1), 1) If b(t) = b(t-1) + S (b(t), 0) Otherwise Decoder b(t) (B(t-1) + S) If INC = 1 B(t) If INC = 0 S may be known by the encoder and the decoder or send on the bus 6/13/2021 Amara AMARA DA-IICT, India 2005 45

T 0 Code: example 4 5 6 7 8 00000100 00000101 00000110 00000111 00001000

T 0 Code: example 4 5 6 7 8 00000100 00000101 00000110 00000111 00001000 1 2 1 4 3 1 4 Binary encoding: 16 Transitions 6/13/2021 00000100 00000100 00000110 0 1 1 1 0 0 0 2 1 0 T 0 encoding: 4 Transitions Amara AMARA DA-IICT, India 2005 46

T 0 Code: Implementation Encoder Decoder 6/13/2021 Amara AMARA DA-IICT, India 2005 47

T 0 Code: Implementation Encoder Decoder 6/13/2021 Amara AMARA DA-IICT, India 2005 47

T 0 Code l Suitable for address bus encoding when sequential addresses transmitted on

T 0 Code l Suitable for address bus encoding when sequential addresses transmitted on the bus dominate. l The encoder inserts one clock cycle delay l Extra area and delay l Power saving achieved if the probability of sequential addresses appearing in the bus is higher than a technology dependent threshold 6/13/2021 Amara AMARA DA-IICT, India 2005 48

T 0 Code: Variants TO-BI Code: l l l 6/13/2021 Suitable when address bus

T 0 Code: Variants TO-BI Code: l l l 6/13/2021 Suitable when address bus is used to transmit instructions and address values If the address are sequential, TO code is applied and the bus is frozen, otherwise the Bus Invert coding is applied Two redundant lines are necessary Amara AMARA DA-IICT, India 2005 49

Conclusions Guidelines for power optimization at RTL level have been presented l It’s the

Conclusions Guidelines for power optimization at RTL level have been presented l It’s the responsibility of the designer to find the good tradeoff between power, performance area and complexity l Tools implementing some of these techniques are available: l l l 6/13/2021 Synopsys Atrenta Bull. Dast Sequence Design Amara AMARA DA-IICT, India 2005 50

n n n n n N. Raghavan, V. Akella and S. Bakshi, “ Automatic

n n n n n N. Raghavan, V. Akella and S. Bakshi, “ Automatic insertion of gated clocks at register transfer level “, Proc. 12 th international Conference on VLSI design, January 1999. F. Theeuwen and E. Seelen, “ Power reduction through clock gating by symbolic manipulation “ Proc. Symp. Logic and Architecture Design, Dec. 1996, pp 131 -136 M. Ohnishi, A. Yamada, H. Noda, and T. Kambe, “ A method of redundant clocking detection and power reduction at RT-level design” in Proc. 1997 Int. Symp. Low Power Electronics and Design, Monterey, CA, Aug. 1997, pp. 184 -191. C. Arm, J-M Masgonty, and C. Piguet, “Double-Latch Clocking Scheme for Low Power I. P. Cores”, PATMOS 2000, Goettingen, Germany September 13 -15, 2000 Thierry Schneider, “ VHDL: méthodologie de design et techniques avancées” Dunod, Paris 2001. M. Alidina, J. Monteiro, S. Devadas, A. Gosh, and M. Papaefthymiou, “Precomputation-based sequential logic optimization for low power”, in Proc. 1994 Int. Compter Aided Design, San Jose, CA, Nov. 1994, pp 74 -81. V. Tiwari, S. Malik, and P. Ashar, “Guarded evaluation: pushing power management to logic synthesis/design”, in Proc. Low Power Design Symposium, Dana Point, CA, Apr. 1995, pp. 221 -226. Hema Kapadia, Luca Benini, and Giovanni De Micheli, “ Reducing Switching Activity on Datapath Buses with Control-Signal Gating”, in IEEE Journal of Solid State Circuits, vol. 34, pp. 405 -414, Marsh 1999. P. Babighian, L. Benini, E. Macii: “A Scalable ODC-Based Algorithm for RTL Insertion of Gated Clocks “, Proceedings of DATE 2004, 1620 February 2004, Paris G. DE Micheli, “ Synthesis and Optimization of Digital Circuits”, New York, Mac. Graw-Hill, 1994. M. Stan and W. Burleson, “Bus-Invert coding for Low Power IO”, IEEE Transactions on VLSI Systems, Vol. 3, no. 1, pp. 49 -58, Mar. 1995. C. Tsui and M. Pedram, “Low Power State Assignment Targeting Two and Multi level Logic Implementation”, ACM/IEEE Inter. Conf. on CAD, pp. 82 -87, Nov. 1994. L Benini and G. De. Micheli, “Transformation and Synthesis of FSMs for Low Power and Gated Clock Implementation”, ACM/SIGDA ISLP’ 95, April. 1995. [L Benini, G. De. Micheli and F. Vermulen, “Finite State Machine Partitioning for Low Power”, IEEE ISCAS’ 98, pp. 5 -8, May. 1998. R. Shelar, M. P. Desai, “Orthogonal partitioning and Gated Clock Architecture for Low Power realization of FSMs”, IEEE ASIC/SOC’ 2000, pp. 266 -270, Sept. 2000 L. Benini, G. De Micheli, E. MAcii, G. Odasso, M. Poncino: Kernel-Based Power Optimization of RTL Components: Exact and approximate Extraction Algorithm “, DAC 99, New Orleans Wei-Chung Cheng, Massoud Pedram, “Memory Bus encoding for low Power: A tutorial “, IEEE 2001 L. Benini, G. De Micheli, E. Macii, D. Sciuto, C. Silvano: “Asymptotic zero-transition activity encoding for address buses in low power microprocessors-based systems “ Proc. Of the Seventh Great Lakes Symp. On VLSI, 1997 L. Benini, G. De Micheli, E. Macii , D. Sciuto, C. Silvano: “Address Bus Encoding Techniques for System-Level Power Optimization “, DATE, 1998 Massoud Pedram : “ Low Power Address Bus Encoding Techniques “, CAD Seminar, UCLA CS Dpt. 2001 6/13/2021 Amara AMARA, DA-IICT, India 2005 51

Finite State Machine IN Next State Logic Output Logic OUT CLK n n Very

Finite State Machine IN Next State Logic Output Logic OUT CLK n n Very common parts of digital systems Basic structure: State register + combinational blocks Power burnt either in the state register or the combinational blocks Various techniques for power optimization: ¨ ¨ ¨ 6/13/2021 Gated clock FSM State encoding FSM partitioning Amara AMARA, NEWCAS 2004 Montreal Quebec 52

Gated Clock FSM Next State Logic IN Output OUT Logic CLK n Most FSM

Gated Clock FSM Next State Logic IN Output OUT Logic CLK n Most FSM remains most of the time in some state (wait states) and leave on simple conditions ¨ ¨ n n Interrupt request Time out counter Clock can be stopped in wait states IDLE condition can be easily computed from interrupt or counter status bits 6/13/2021 Amara AMARA, NEWCAS 2004 Montreal Quebec 53

Gated-Clock FSM n n The gain should be more important if the FSM controls

Gated-Clock FSM n n The gain should be more important if the FSM controls a large Data Path which inputs are not used during the WAIT sate Freeze the inputs and/or gate the clock of the Data Path 6/13/2021 Amara AMARA, NEWCAS 2004 Montreal Quebec 54

FSM State Encoding Next State Logic IN Output Logic OUT E(M 1) = 2(0.

FSM State Encoding Next State Logic IN Output Logic OUT E(M 1) = 2(0. 3+0. 4) + 1(0. 1+0. 1) = 1. 6 CLK E(M 2) = 1(0. 3+0. 4)+2 x 0. 1+1 x 0. 1 = 1. 0 n n n Intuitively: Reduce hamming distance between states which transition is most likely to occur Most of the power is burnt in the Next State Logic Activity Practical solution: Encode for low power the subset of states that spans the high probability edges and the remaining states are encoded for area efficiency 6/13/2021 Amara AMARA, NEWCAS 2004 Montreal Quebec 55

FSM Example State 1 Encoding States from RESET to S 29: sequentially chained with

FSM Example State 1 Encoding States from RESET to S 29: sequentially chained with 100% probability of transition Gray encoding is the best choice If C 0 has much lower probability than C 1 Gray code should not be Incremented from S 30 and S 31 n 6/13/2021 Amara AMARA, NEWCAS 2004 Montreal Quebec 56

FSM State. Example Encoding 2 Output power reduction: n Good practice is to group

FSM State. Example Encoding 2 Output power reduction: n Good practice is to group states that generate the same output and assign them codes with minimum hamming distance n BEEF sequence detection and flag Y generation n Gray-like encoding + AMARA, orthogonal encoding (MSB = Y) Amara NEWCAS 2004 6/13/2021 Montreal Quebec 57

FSM Partitioning Partition One Large mostly inactive One small FSM often active Similar to

FSM Partitioning Partition One Large mostly inactive One small FSM often active Similar to software Main body Several subroutine n Gate the clock and Data of the large one when inactive (current state is in a small sub-FSM) 6/13/2021 Amara AMARA, NEWCAS 2004 Montreal Quebec 58