14 332 331 Computer Architecture and Assembly Language
14: 332: 331 Computer Architecture and Assembly Language Spring 2005 Week 12 Buses and I/O system [Adapted from Dave Patterson’s UCB CS 152 slides and Mary Jane Irwin’s PSU CSE 331 slides] 331 Week 12. 1 Spring 2005
Head’s Up q This week’s material l Buses: Connecting I/O devices - Reading assignment – PH 8. 4 l Memory hierarchies - Reading assignment – PH 7. 1 and B. 8 9 q Reminders l Next week’s material l Basics of caches - Reading assignment – PH 7. 2 331 Week 12. 2 Spring 2005
Review: Major Components of a Computer Processor Control Devices Memory Datapath Input Secondary Memory (Disk) Main Memory Cache 331 Week 12. 3 Output Spring 2005
Input and Output Devices q I/O devices are incredibly diverse wrt l l l Behavior Partner Data rate Device Behavior Partner Data rate (KB/sec) Keyboard Mouse Laser printer input output human Graphics display Network/LAN output input or output storage human machine 60, 000. 00 500. 00 6000. 00 machine 100. 00 2000. 00 10, 000. 00 Floppy disk Magnetic disk 331 Week 12. 4 0. 01 0. 02 200. 00 Spring 2005
Magnetic Disk q Purpose l l Long term, nonvolatile storage Lowest level in the memory hierarchy - slow, large, inexpensive q q General structure l A rotating platter coated with a magnetic surface l Use a moveable read/write head to access the disk Advantages of hard disks over floppy disks l l Platters are more rigid (metal or glass) so they can be larger Higher density because it can be controlled more precisely Higher data rate because it spins faster Can incorporate more than one platter 331 Week 12. 5 Spring 2005
Organization of a Magnetic Disk Sector Platters Track q Typical numbers (depending on the disk size) l l l 1 to 15 (2 surface) platters per disk with 1” to 8” diameter 1, 000 to 5, 000 tracks per surface 63 to 256 sectors per track - the smallest unit that can be read/written (typically 512 to 1, 024 B) l Traditionally all tracks have the same number of sectors - Newer disks with smart controllers can record more sectors on the outer tracks (constant bit density) 331 Week 12. 6 Spring 2005
Magnetic Disk Characteristic q Cylinder: all the tracks under the heads at a given point on all surfaces q Read/write data is a three stage process: l Seek time: position the arm over the proper track (6 to 14 ms avg. ) - due to locality of disk references the actual average seek time may be only 25% to 33% of the advertised number l l l Track Sector Cylinder Head Platter Rotational latency: wait for the desired sector to rotate under the read/write head (½ of 1/RPM) Transfer time: transfer a block of bits (sector) under the read write head (2 to 20 MB/sec typical) Controller time: the overhead the disk controller imposes in performing an disk I/O access (typically < 2 ms) 331 Week 12. 7 Spring 2005
Magnetic Disk Examples Characteristic Disk diameter (inches) Capacity MTTF (k hr’s) # of platters heads Sun X 6713 A Toshiba MK 2016 3. 5 73 GB 1, 200 2. 5 20 GB 300 2 4 # cylinders # B/sector # sectors/track Rotation speed (RPM) Max. Avg. seek time (ms) Avg. rot. latency (ms) Transfer rate (PIO) Power (watts) Volume (in 3) Weight (oz) 331 Week 12. 8 16, 383 10, 000 ? 6. 6 3 35 MB/sec 512 63 4, 200 24 13 7. 14 16. 6 MB/sec < 2. 5 4. 01 3. 49 Spring 2005
I/O System Interconnect Issues Processor bus Main Memory Receiver Keyboard q A bus is a shared communication link (a set of wires used to connect multiple subsystems) l Performance Expandability l Resilience in the face of failure – fault tolerance l 331 Week 12. 9 Spring 2005
Performance Measures q Latency (execution time, response time) is the total time from the start to finish of one instruction or action l q Throughput – total amount of work done in a given amount of time l l q usually used to measure processor performance aka execution bandwidth the number of operations performed per second Bandwidth – amount of information communicated across an interconnect (e. g. , a bus) per unit time l l 331 Week 12. 10 the bit width of the operation * rate of the operation usually used to measure I/O performance Spring 2005
I/O System Expandability q Usually have more than one I/O device in the system l each I/O device is controlled by an I/O Controller interrupt signals Processor Cache Memory I/O Bus Main Memory I/O Controller Disk 331 Week 12. 11 Disk I/O Controller Terminal I/O Controller Network Spring 2005
Quiz q What is disk seek time, and what is rotational time? 331 Week 12. 12 Spring 2005
Bus Characteristics Control Lines Data Lines q q Control lines l Signal requests and acknowledgments l Indicate what type of information is on the data lines Data lines l q Data, complex commands, and addresses Bus transaction consists of l l Sending the address Receiving (or sending) the data 331 Week 12. 13 Spring 2005
Output (Read) Bus Transaction q Defined by what they do to memory l read = output: transfers data from memory (read) to I/O device (write) Step 1: Processor sends read request and read address to memory Control Main Memory Processor Data Step 2: Memory accesses data Control Main Memory Processor Data Step 3: Memory transfers data to disk Control Main Memory Processor Data 331 Week 12. 14 Spring 2005
Input (Write) Bus Transaction q Defined by what they do to memory l write = input: transfers data from I/O device (read) to memory (write) Step 1: Processor sends write request and write address to memory Control Main Memory Processor Data Step 2: Disk transfers data to memory Control Main Memory Processor Data 331 Week 12. 15 Spring 2005
Advantages and Disadvantages of Buses q Advantages l Versatility: - New devices can be added easily - Peripherals can be moved between computer systems that use the same bus standard l Low Cost: - A single set of wires is shared in multiple ways q Disadvantages l It creates a communication bottleneck - The bus bandwidth limits the maximum I/O throughput l The maximum bus speed is largely limited by - The length of the bus - The number of devices on the bus l 331 Week 12. 16 It needs to support a range of devices with widely varying latencies and data transfer rates Spring 2005
Types of Buses q Processor Memory Bus (proprietary) l Short and high speed Matched to the memory system to maximize the memory processor bandwidth l Optimized for cache block transfers l q I/O Bus (industry standard, e. g. , SCSI, USB, ISA, IDE) l Usually is lengthy and slower l Needs to accommodate a wide range of I/O devices Connects to the processor memory bus or backplane bus l q Backplane Bus (industry standard, e. g. , PCI) l l The backplane is an interconnection structure within the chassis Used as an intermediary bus connecting I/O busses to the processor memory bus 331 Week 12. 17 Spring 2005
A Two Bus System Processor Memory Bus Adaptor I/O Bus q Bus Adaptor I/O Bus I/O buses tap into the processor memory bus via Bus Adaptors (that do speed matching between buses) l l Processor memory bus: mainly for processor memory traffic I/O busses: provide expansion slots for I/O devices 331 Week 12. 18 Spring 2005
A Three Bus System Processor Memory Bus Adaptor Backplane Bus Adaptor q I/O Bus A small number of Backplane Buses tap into the Processor Memory Bus l l q I/O Bus Processor Memory Bus is used for processor memory traffic I/O buses are connected to the Backplane Bus Advantage: loading on the Processor Memory Bus is greatly reduced 331 Week 12. 19 Spring 2005
I/O System Example (Apple Mac 7200) q Typical of midrange to high end desktop system in 1997 Processor Cache Memory Main Memory PCI Interface/ Memory Controller Processor Memory Bus Audio I/O Serial ports I/O Controller CDRom Disk Tape 331 Week 12. 20 SCSI bus PCI I/O Controller Graphic Terminal I/O Controller Network Spring 2005
Example: Pentium System Organization Processor-Memory Bus Memory controller (“Northbridge”) PCI Bus I/O Busses http: //developer. intel. com/design/chipsets/850/animate. htm? iid=PCG+devside& 331 Week 12. 21 Spring 2005
Synchronous and Asynchronous Buses q Synchronous Bus l l Includes a clock in the control lines A fixed protocol for communication that is relative to the clock Advantage: involves very little logic and can run very fast Disadvantages: - Every device on the bus must run at the same clock rate - To avoid clock skew, they cannot be long if they are fast q Asynchronous Bus l It is not clocked, so requires handshaking protocol (req, ack) - Implemented with additional control lines l Advantages: - Can accommodate a wide range of devices - Can be lengthened without worrying about clock skew or synchronization problems l Disadvantage: slow(er) 331 Week 12. 22 Spring 2005
Asynchronous Handshaking Protocol q Output (read) data from memory to an I/O device. Read. Req Data Ack 1 2 addr data 3 4 6 5 7 Data. Rdy I/O device signals a request by raising Read. Req and putting the addr on the data lines 1. Memory sees Read. Req, reads addr from data lines, and raises Ack 2. I/O device sees Ack and releases the Read. Req and data lines 3. Memory sees Read. Req go low and drops Ack 4. When memory has data ready, it places it on data lines and raises Data. Rdy 5. I/O device sees Data. Rdy, reads the data from data lines, and raises Ack 6. Memory sees Ack, releases the data lines, and drops Data. Rdy 7. I/O device sees Data. Rdy go low and drops Ack 331 Week 12. 23 Spring 2005
Key Characteristics of Two Bus Standards Characteristic Type Data bus width(signals) Clocking Theoretical Peak bandwidth Hot plugable Max. devices Max. length (copper wire) 331 Week 12. 24 Firewire (1394) USB 2. 0 I/O 4 I/O 2 asynchronous 50 MB/sec (Firewire 400) or 100 MB/sec (Firewire 800) Yes 63 4. 5 meters asynchronous 0. 2 MB/sec (low speed), 1. 5 MB/sec (full) or 60 MB/sec (high) yes 127 5 meters Spring 2005
Review: Major Components of a Computer Processor Control Datapath 331 Week 12. 25 Devices Memory Input Output Spring 2005
A Typical Memory Hierarchy q By taking advantage of the principle of locality: l l Present the user with as much memory as is available in the cheapest technology. Provide access at the speed offered by the fastest technology. On-Chip Components Control e. DRAM Instr Data Cache . 1’s 10’s 100’s Size (bytes): 100’s K’s 10 K’s M’s Cost: 331 Week 12. 26 ITLB DTLB Speed (ns): Datapath Reg. File Second Level Cache (SRAM) highest Main Memory (DRAM) Secondary Memory (Disk) 1, 000’s T’s lowest Spring 2005
Characteristics of the Memory Hierarchy Processor 4 -8 bytes (word) Increasing distance from the processor in access time L 1$ 8 -32 bytes (block) L 2$ 1 block Inclusive– what is in L 1$ is a subset of what is in L 2$ is a subset of what is in MM that is a subset of is in SM Main Memory 1, 023+ bytes (disk sector = page) Secondary Memory (Relative) size of the memory at each level 331 Week 12. 27 Spring 2005
Memory Hierarchy Technologies q Random Access l l “Random” is good: access time is the same for all locations DRAM: Dynamic Random Access Memory - High density (1 transistor cells), low power, cheap, slow - Dynamic: need to be “refreshed” regularly (~ every 8 ms) l SRAM: Static Random Access Memory - Low density (6 transistor cells), high power, expensive, fast - Static: content will last “forever” (until power turned off) q l Size: DRAM/SRAM 4 to 8 l Cost/Cycle time: SRAM/DRAM 8 to 16 “Non so random” Access Technology l Access time varies from location to location and from time to time (e. g. , Disk, CDROM) 331 Week 12. 28 Spring 2005
Classical SRAM Organization (~Square) bit (data) lines r o w d e c o d e r row address RAM Cell Array word (row) select Column Selector & I/O Circuits data word 331 Week 12. 29 Each intersection represents a 6 -T SRAM cell column address One memory row holds a block of data, so the column address selects the requested word from that block Spring 2005
Classical DRAM Organization (~Square Planes). . . bit (data) lines r o w d e c o d e r Each intersection represents a 1 -T DRAM cell RAM Cell Array word (row) select column address row address Column Selector & I/O Circuits data bit 331 Week 12. 30 . . . data bit d da or ta w The column address selects the requested bit from the row in each plane Spring 2005
RAM Memory Definitions q Caches use SRAM for speed q Main Memory is DRAM for density l Addresses divided into 2 halves (row and column) - RAS or Row Access Strobe triggering row decoder - CAS or Column Access Strobe triggering column selector q Performance of Main Memory DRAMs l Latency: Time to access one word - Access Time: time between request and when word arrives - Cycle Time: time between requests - Usually cycle time > access time l Bandwidth: How much data can be supplied per unit time - width of the data channel * the rate at which it can be used 331 Week 12. 31 Spring 2005
Classical DRAM Operation DRAM Organization: l l l N rows x N column x M bit Read or Write M bit at a time Each M bit access requires a RAS / CAS cycle N cols DRAM N rows q Column Address Row Address M bit Output M bits Cycle Time 1 st M bit Access 2 nd M bit Access RAS CAS Row Address 331 Week 12. 32 Col Address Row Address Col Address Spring 2005
Ways to Improve DRAM Performance q Memory interleaving q Fast Page Mode DRAMs – FPM DRAMs l q Extended Data Out DRAMs – EDO DRAMs l q www. chips. ibm. com/products/memory/88 H 2011. pdf Synchronous DRAMS – SDRAMS l q www. usa. samsungsemi. com/products/newsummary/asyncdram/K 4 F 661612 D. h tm www. usa. samsungsemi. com/products/newsummary/sdramcomp/K 4 S 641632 D. htm Rambus DRAMS l l www. rambus. com/developer/quickfind_documents. html www. usa. samsungsemi. com/products/newsummary/rambuscomp/K 4 R 271669 B. htm q Double Data Rate DRAMs – DDR DRAMS 331 www. usa. samsungsemi. com/products/newsummary/ddrsyncdram/K 4 D 62323 H A. htm Week 12. 33 Spring 2005 l
Increasing Bandwidth Interleaving Access pattern without Interleaving: Cycle Time CPU Memory Access Time D 1 available Start Access for D 1 Start Access for D 2 available Memory Bank 0 Access pattern with 4 way Interleaving: CPU Memory Bank 1 Access Bank 0 Memory Bank 2 Access Bank 1 Access Bank 2 Access Bank 3 Memory Bank 3 We can Access Bank 0 again 331 Week 12. 34 Spring 2005
Problems with Interleaving q How many banks? l l q Increasing DRAM sizes => fewer chips => harder to have banks l q Ideally, the number of banks number of clocks we have to wait to access the next word in the bank Only works for sequential accesses (i. e. , first word requested in first bank, second word requested in second bank, etc. ) Growth bits/chip DRAM : 50% 60%/yr Only can use for very large memory systems (e. g. , those encountered in supercomputer systems) 331 Week 12. 35 Spring 2005
Fast Page Mode DRAM Operation Fast Page Mode DRAM l q N x M “SRAM” to save a row DRAM After a row is read into the SRAM “register” l l N cols Row Address N rows q Column Address Only CAS is needed to access other M-bit blocks on that row RAS remains asserted while CAS is toggled N x M “SRAM” M bits M bit Output 1 st M bit Access 2 nd M bit 3 rd M bit 4 th M bit RAS CAS Row Address 331 Week 12. 36 Col Address Spring 2005
Why Care About the Memory Hierarchy? Processor DRAM Memory Gap Performance 1000 CPU “Moore’s Law” µProc 60%/year (2 X/1. 5 yr) Processor-Memory Performance Gap: (grows 50% / year) 100 10 DRAM 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 1 DRAM 9%/year (2 X/10 yrs) Time 331 Week 12. 37 Spring 2005
Memory Hierarchy: Goals q Fact: Large memories are slow, fast memories are small q How do we create a memory that gives the illusion of being large, cheap and fast (most of the time)? by taking advantage of q The Principle of Locality: Programs access a relatively small portion of the address space at any instant of time. Probability of reference 0 331 Week 12. 38 Address Space 2 n - 1 Spring 2005
Memory Hierarchy: Why Does it Work? q Temporal Locality (Locality in Time): => Keep most recently accessed data items closer to the processor q Spatial Locality (Locality in Space): => Move blocks consists of contiguous words to the upper levels To Processor Upper Level Memory Lower Level Memory Blk X From Processor 331 Week 12. 39 Blk Y Spring 2005
Memory Hierarchy: Terminology q Hit: data appears in some block in the upper level (Block X) l l Hit Rate: the fraction of memory accesses found in the upper level Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss To Processor Upper Level Memory Lower Level Memory Blk X From Processor q Blk Y Miss: data needs to be retrieve from a block in the lower level (Block Y) l Miss Rate = 1 (Hit Rate) l Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor l Hit Time << Miss Penalty 331 Week 12. 40 Spring 2005
How is the Hierarchy Managed? q registers < > memory l q cache < > main memory l q by compiler (programmer? ) by the hardware main memory < > disks l by the hardware and operating system (virtual memory) l by the programmer (files) 331 Week 12. 41 Spring 2005
Summary q DRAM is slow but cheap and dense l q SRAM is fast but expensive and not very dense l q Good choice for providing the user FAST access time Two different types of locality l l q Good choice for presenting the user with a BIG memory system Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon. Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon. By taking advantage of the principle of locality: l Present the user with as much memory as is available in the cheapest technology. l Provide access at the speed offered by the fastest technology. 331 Week 12. 42 Spring 2005
- Slides: 42