Clockless Logic or How do I make hardware

  • Slides: 31
Download presentation
Clockless Logic or How do I make hardware fast, powerefficient, less noisy, and easy-to-design?

Clockless Logic or How do I make hardware fast, powerefficient, less noisy, and easy-to-design? Montek Singh Tue, Jan 14, 2003 1

Course Information (1) Course Number: COMP 290 -084 Time and Place l Tue/Thu 3:

Course Information (1) Course Number: COMP 290 -084 Time and Place l Tue/Thu 3: 30 -4: 45 pm, Sitterson Hall 325 Instructor l Montek Singh l montek@cs. unc. edu (not singh@cs!) l SN 245, 962 -1832 l Office hours: most afternoons/by appointment Teaching Assistant l None Course Web Page l http: //www. cs. unc. edu/~montek 2

Course Information (2) Prerequisites: l undergraduate knowledge of: digital logic, algorithms, discrete math (sets

Course Information (2) Prerequisites: l undergraduate knowledge of: digital logic, algorithms, discrete math (sets and graphs) l no knowledge of advanced circuit design or of VLSI is assumed Ø relevant topics will be covered in class as needed l you are assumed to know the following topics: Ø digital logic: Boolean algebra, logic gates, and latches and registers Ø algorithms: search techniques, enumeration, divide and conquer, and time complexity Ø discrete math: elementary set theory and graph theory 3

Course Information (3) Reading Material: l Papers and technical reports supplied by instructor Course

Course Information (3) Reading Material: l Papers and technical reports supplied by instructor Course Content: l The following topics will be covered: Ø Introduction to clockless logic Ø Graphical representation of asynchronous systems Ø Algorithms for logic synthesis – Combinational – Sequential Ø Design techniques – High-performance – Low-power Ø Formal methods (performance analysis and verification) Ø Case studies of real-world asynchronous processors 4

Course Information (4) Grading l 30% homework assignments l 35% class project Ø your

Course Information (4) Grading l 30% homework assignments l 35% class project Ø your choice of topic: from pure algorithms to VLSI design l 30% exams l 5% class participation Honor Code is in effect l encouraged to discuss ideas/concepts l work handed in must be your own 5

Lecture 1: Introduction ã What is asynchronous design? ã Why do we want to

Lecture 1: Introduction ã What is asynchronous design? ã Why do we want to study it? ã How 6 is data represented in an asynchronous system? is information exchanged?

Introduction: Clocked Digital Design Most current digital systems are synchronous: l Clock: a global

Introduction: Clocked Digital Design Most current digital systems are synchronous: l Clock: a global signal that paces operation of all components clock Benefit of clocking: enables discrete-time representation l l all components operate exactly once per clock tick component outputs need to be ready by next clock tick Ø allows “glitchy” or incorrect outputs between clock ticks 7

Microelectronics Trends Current and Future Trends: Significant Challenges l Large-Scale “Systems-on-a-Chip” (So. C) Ø

Microelectronics Trends Current and Future Trends: Significant Challenges l Large-Scale “Systems-on-a-Chip” (So. C) Ø 100 Million ~ 1 Billion transistors/chip l Very High Speeds Ø multiple Giga. Hertz clock rates l Explosive Growth in Consumer Electronics Ø demand for ever-increasing functionality … Ø … with very low power consumption (limited battery life) l Higher Portability/Modularity/Reusability Ø “plug ’n play” components, robust interfaces 8

Challenges to Clocked Design Breakdown of Single-Clock Paradigm: l Chip will be partitioned into

Challenges to Clocked Design Breakdown of Single-Clock Paradigm: l Chip will be partitioned into multiple timing domains Ø challenge: gluing together multiple timing domains – glue logic is susceptible to “metastability” (=incorrect values transferred) and latency overheads Increasing Difficulties with Clocked Design: l Clock distribution: requires significant designer effort l Performance bottleneck: a single slow component l Clock burns large fraction of chip power (~40 -70%) l Fixed clock rate: poor match for Ø designing reusable components Ø interfacing with mixed-timing environments 9

What is Asynchronous Design? ã Digital design with no centralized clock ã Synchronization using

What is Asynchronous Design? ã Digital design with no centralized clock ã Synchronization using local “handshaking” handshaking interface clock Synchronous System (Centralized Control) Asynchronous System (Distributed Control) 10

Why Asynchronous Design? (1) ã Higher Performance l May obtain “average-case” operation (not “worst-case”)

Why Asynchronous Design? (1) ã Higher Performance l May obtain “average-case” operation (not “worst-case”) Ø not limited by slowest component l Avoids overheads of multi-GHz clock distribution ã Lower Power l No clock power expended l Inactive components consume negligible power ã Better Electromagnetic Compatibility l Smooth radiation spectra: no clock spikes l Much less interference with sensitive receivers [e. g. , Philips pagers, smartcards] ã Greater Flexibility/Modularity l Naturally adapt to variable-speed environments l Supports reusable components 11

Why Asynchronous Design? (2) ã The world already is mostly asynchronous! l Events at

Why Asynchronous Design? (2) ã The world already is mostly asynchronous! l Events at the level of (or in between) large-scale systems are asynchronous Ø several seconds to several milliseconds Ø e. g. , PC-printer communication, keyboard inputs, network comm. l Events at the board level (or between chips) are often asynchronous Ø milliseconds to 100 nanoseconds Ø e. g. , CPU-memory interface, interface with I/O subsystem (interrupts) l Events within a chip, at the level of functional units (e. g. , adders, control logic) are currently synchronous Ø several nanoseconds to 100 picoseconds l Events at the level of a single logic gate are asynchronous Ø 10 picoseconds l Events at the quantum level are asynchronous Ø picoseconds to femtoseconds ã So, why bother with clocks at all? ! l make everything asynchronous greater elegance and robustness 12

Challenges of Asynchronous Design ã Hazards: potential “glitches” on wire clock tick clean signals

Challenges of Asynchronous Design ã Hazards: potential “glitches” on wire clock tick clean signals hazardous signals no problem for clocked systems l communication must be hazard-free! l special design challenge = “hazard-free synthesis” ã Testability Issues: l absence of clock means no “single-stepping” ã Lack of Commercial CAD Tools: l chicken-and-egg problem 13

Asynchronous Design: Past & Present Async Design: In existence for 50 years, but …

Asynchronous Design: Past & Present Async Design: In existence for 50 years, but … … many recent technical advances: l Hazard-Free Circuit Design: Ø several practical techniques for controllers [Stanford/Columbia] l Design for Testability: Ø several test solutions, e. g. Philips Research l Maturing Computer-Aided-Design (“CAD”) Tools: Ø software tools for automated design [Philips, Columbia, Manchester] l Successful Fabricated Chips: Ø embedded processors, high-speed pipelines, consumer electronics… 14

Recent Commercial Interest Several commercial asynchronous chips: l Philips: asynchronous 80 c 51 microcontrollers

Recent Commercial Interest Several commercial asynchronous chips: l Philips: asynchronous 80 c 51 microcontrollers Ø used in commercial pagers [1998] and smartcards [2001] l Univ. of Manchester: async ARM processor [2000] l Motorola: async divider in Power. PC chip [2000] l HAL: async floating-point divider Ø in HAL-I and II processors [early 1990’s] Recent experimental chips: l IBM, Sun and Intel: Ø fast pipelines, arbiters, instruction-length decoder… l IBM/Columbia/UNC: asynchronous digital FIR filter Several recent startups: l Theseus Logic, Fulcrum, Self-Timed Solutions… 15

A 5 -minute Homework Problem Alice and Bob live on opposite sides of a

A 5 -minute Homework Problem Alice and Bob live on opposite sides of a wide river: Alice Bob Alice is supposed to send a message (say, a “Yes”/”No”) across to Bob around midnight. Both have flashlights, but neither owns a watch. What should they do? Suggest several strategies, and discuss pros and cons of each. 16

Solution 1 Alice uses 2 lamps: l 1 to indicate that she is ready

Solution 1 Alice uses 2 lamps: l 1 to indicate that she is ready with the message, and l 1 for the message itself Bob uses 1 lamp: l to indicate that he has received the message go ti t Alice ye s/ no re ad y Bob 17

Solution 2 Alice uses 2 lamps: l Green lamp to indicate “yes” l Red

Solution 2 Alice uses 2 lamps: l Green lamp to indicate “yes” l Red lamp to indicate “no” Bob uses 1 lamp: l to indicate that he has received the message go ti t Alice no ye s Bob 18

Solution 3 What if Alice and Bob could keep time? Alice uses 1 lamp

Solution 3 What if Alice and Bob could keep time? Alice uses 1 lamp for the message: l At 12 midnight: turns on lamp if message = “yes” l At 12: 01: turns lamp off Bob needs no lamps! l Takes down the message between 12 and 12: 01 Pros: Fewer signals, lesser processing needed Cons: Alice and Bob must keep their clocks closely synchronized l If Bob’s watch is off by a minute, incorrect communication possible 19

Data Representation Styles: “Bundled Data” Single-rail “Bundled Datapath”: simplest approach l widely used Features:

Data Representation Styles: “Bundled Data” Single-rail “Bundled Datapath”: simplest approach l widely used Features: l datapath: 1 wire per bit (e. g. standard sync blocks) l matched delay: produces delayed “done” signal Ø worst-case delay: longer than slowest path request bit 1 bit n matched delay function block done bit 1 done indicates valid data bit m + Practical style: can reuse sync components; small area – Fixed (worst-case) completion time 20

Data Representation Styles: Dual-Rail Dual-rail: uses 2 wires per data bit 1 bit n

Data Representation Styles: Dual-Rail Dual-rail: uses 2 wires per data bit 1 bit n bit m Each Dual-Rail Pair: provides both data value and validity + provides robust data-dependent completion – needs completion detectors 21

Dual-Rail (contd. ) Dual-Rail Completion Detector: l combines dual-rail signals l indicates when all

Dual-Rail (contd. ) Dual-Rail Completion Detector: l combines dual-rail signals l indicates when all bits are valid (or reset) C-element: lif all inputs=1, output 1 lif all inputs=0, output 0 lelse, maintain output value OR bit 0 0 bit 1 OR bitn OR C Done ã OR together 2 rails per bit ã Merge results using a Müller “C-element” 22

Handshaking Styles: 4 -phase 4 -Phase: requires 4 events per handshake Request get ready

Handshaking Styles: 4 -phase 4 -Phase: requires 4 events per handshake Request get ready for next event start event done ready for next event Acknowledge + “Level-sensitive” simpler logic implementation – Overhead of “return-to-zero” (RTZ or resetting) l extra events which do no useful computation 23

Handshaking Styles: 2 -phase 2 -Phase: requires 2 events per handshake Request start next

Handshaking Styles: 2 -phase 2 -Phase: requires 2 events per handshake Request start next event start event done next event done Acknowledge + Elegant: no return-to-zero – Slower logic implementation: l logic primitives are inherently level-sensitive, not event-based (at least in CMOS) 24

Handshaking + Data Representation Several combinations possible: l dual-rail 4 -phase, single-rail 4 -phase,

Handshaking + Data Representation Several combinations possible: l dual-rail 4 -phase, single-rail 4 -phase, dual-rail 2 -phase, and single-rail 2 -phase Example: dual-rail 4 -phase bit 1 A bit m B ack l dual-rail data: functions as an implicit “request” l 4 -phase cycle: between acknowledge and implicit request 25

Other Data Representation Styles ã Level-Encoded Dual-Rail (LEDR) l 2 wires per bit: “data”

Other Data Representation Styles ã Level-Encoded Dual-Rail (LEDR) l 2 wires per bit: “data” and “phase” l exactly one wire per bit changes value Ø if new value is different, “data” wire changes value Ø else “phase” wire change value data phase ã M-of-N Codes l N wires used for a data word l M wires (M <= N) change value l Values of N and M: have impact on… Ø information transmitted, power consumed and logic complexity ã Knuth codes, Huffman codes, … 26

Which to use? Depends on several performance parameters: l speed Ø single-rail vs. dual-rail

Which to use? Depends on several performance parameters: l speed Ø single-rail vs. dual-rail – single-rail may be faster (if designed aggressively) – dual-rail may be faster (if completion times vary widely) Ø 2 -phase vs. 4 -phase – 2 -phase may be faster (if logic overhead is small) – 4 -phase may be faster (if overhead of return-to-zero is small) l power consumption Ø 2 -phase typically has fewer gate transitions ( lower power) l amount of logic used (#gates/wires/pins chip area) Ø single-rail needs fewer gates/wires/pins l design and verification effort Ø dual-rail, 1 -of-N, M-of-N, Knuth codes…: – delay-insensitive: robust in the presence of arbitrary delays Ø single-rail: requires greater timing verification effort 27

Sutherland’s Micropipelines Seminal Paper 28

Sutherland’s Micropipelines Seminal Paper 28

Focus of Sutherland’s Turing Award Lecture: Pipelining Motivation: Pipelining is at the heart of

Focus of Sutherland’s Turing Award Lecture: Pipelining Motivation: Pipelining is at the heart of nearly all high-performance digital systems Additional Benefits: l Low power l Interfacing with mixed systems l Modular and scalable design 29

Background: Pipelining What is Pipelining? : Breaking up a complex operation on a stream

Background: Pipelining What is Pipelining? : Breaking up a complex operation on a stream of data into simpler sequential operations fetch decode execute A “coarse-grain” pipeline (e. g. simple processor) Storage elements (latches/registers) A “fine-grain” pipeline (e. g. pipelined adder) Throughput = #data items processed/second + Throughput: significantly increased – Latency: somewhat degraded 30

Focus of Async Community Our Focus: Extremely fine-grain pipelines l “gate-level” pipelining = use

Focus of Async Community Our Focus: Extremely fine-grain pipelines l “gate-level” pipelining = use narrowest possible stages l each stage consists of only a single level of logic gates some of the fastest existing digital pipelines to date Application areas: l multimedia hardware (graphics accelerators, video DSP’s, …) Ø naturally pipelined systems, throughput is critical Ø input is often “bursty” l optical networking Ø serializing/deserializing FIFO’s l genomic string matching? Ø KMP style string matching: variable skip lengths 31