STORAGE AND IO JehanFranois Pris jfparisuh edu Chapter

  • Slides: 92
Download presentation
STORAGE AND I/O Jehan-François Pâris jfparis@uh. edu

STORAGE AND I/O Jehan-François Pâris jfparis@uh. edu

Chapter Organization • Availability and Reliability • Technology review – Solid-state storage devices –

Chapter Organization • Availability and Reliability • Technology review – Solid-state storage devices – I/O Operations – Reliable Arrays of Inexpensive Disks

DEPENDABILITY

DEPENDABILITY

Reliability and Availability • Reliability – Probability R(t) that system will be up at

Reliability and Availability • Reliability – Probability R(t) that system will be up at time t if it was up at time t = 0 • Availability – Fraction of time the system is up • Reliability and availability do not measure the same thing!

Which matters? • It depends: – Reliability for real-time systems • Flight control •

Which matters? • It depends: – Reliability for real-time systems • Flight control • Process control, … – Availability for many other applications • DSL service • File server, web server, …

MTTF, MMTR and MTBF • MTTF is mean time to failure • MTTR is

MTTF, MMTR and MTBF • MTTF is mean time to failure • MTTR is mean time to repair • 1/MTTF is failure rate l • MTTBF, the mean time between failures, is MTBF = MTTF + MTTR

Reliability • As a first approximation R(t) = exp(–t/MTTF) – Not true if failure

Reliability • As a first approximation R(t) = exp(–t/MTTF) – Not true if failure rate varies over time

Availability • Measured by (MTTF)/(MTTF + MTTR) = MTTF/MTBF – MTTR is very important

Availability • Measured by (MTTF)/(MTTF + MTTR) = MTTF/MTBF – MTTR is very important • A good MTTR requires that we detect quickly the failure

The nine notation • Availability is often expressed in "nines" – 99 percent is

The nine notation • Availability is often expressed in "nines" – 99 percent is two nines – 99. 9 percent is three nines –… • Formula is –log 10 (1 – A) • Example: –log 10 (1 – 0. 999) = –log 10 (10 -3) = 3

Example • A server crashes on the average once a month • When this

Example • A server crashes on the average once a month • When this happens, it takes 12 hours to reboot it • What is the server availability ?

Solution • MTBF = 30 days • MTTR = 12 hours = ½ day

Solution • MTBF = 30 days • MTTR = 12 hours = ½ day • MTTF = 29 ½ days • Availability is 29. 5/30 =98. 3 %

Keep in mind • A 99 percent availability is not as great as we

Keep in mind • A 99 percent availability is not as great as we might think – One hour down every 100 hours • Fifteen minutes down every 24 hours

Example • A disk drive has a MTTF of 20 years. • What is

Example • A disk drive has a MTTF of 20 years. • What is the probability that the data it contains will not be lost over a period of five years?

Example • A disk farm contains 100 disks whose MTTF is 20 years. •

Example • A disk farm contains 100 disks whose MTTF is 20 years. • What is the probability that no data will be lost over a period of five years?

Solution • The aggregate failure rate of the disk farm is 100 x 1/20

Solution • The aggregate failure rate of the disk farm is 100 x 1/20 =5 failures/year • The mean time to failure of the farm is 1/5 year • We apply the formula R(t) = exp(–t/MTTF) = -exp(– 5× 5) = 1. 4 × 10 -11

TECHNOLOGY OVERVIEW

TECHNOLOGY OVERVIEW

Disk drives • See previous chapter • Recall that the disk access time is

Disk drives • See previous chapter • Recall that the disk access time is the sum of – The disk seek time (to get to the right track) – The disk rotational latency – The actual transfer time

Flash drives • Widely used in flash drives, most MP 3 players and some

Flash drives • Widely used in flash drives, most MP 3 players and some small portable computers • Similar technology as EEPROM • Two technologies

What about flash? • Widely used in flash drives, most MP 3 players and

What about flash? • Widely used in flash drives, most MP 3 players and some small portable computers • Several important limitations – Limited write bandwidth • Must erase a whole block of data before overwriting them – Limited endurance • 10, 000 to 100, 000 write cycles

Storage Class Memories • Solid-state storage – Non-volatile – Much faster than conventional disks

Storage Class Memories • Solid-state storage – Non-volatile – Much faster than conventional disks • Numerous proposals: – Ferro-electric RAM (FRAM) – Magneto-resistive RAM (MRAM) – Phase-Change Memories (PCM)

Phase-Change Memories No moving parts Crossbar organization A data cell

Phase-Change Memories No moving parts Crossbar organization A data cell

Phase-Change Memories • Cells contain a chalcogenide material that has two states – Amorphous

Phase-Change Memories • Cells contain a chalcogenide material that has two states – Amorphous with high electrical resistivity – Crystalline with low electrical resistivity • Quickly cooling material from above fusion point leaves it in amorphous state • Slowly cooling material from above

Projections • • • Target date Access time Data Rate Write Endurance Read Endurance

Projections • • • Target date Access time Data Rate Write Endurance Read Endurance Capacity growth MTTF Cost 2012 100 ns 200– 1000 MB/s 109 write cycles no upper limit 16 GB > 40% per year 10– 50 million hours < $2/GB

Interesting Issues (I) • Disks will remain much cheaper than SCM for some time

Interesting Issues (I) • Disks will remain much cheaper than SCM for some time • Could use SCMs as intermediary level between main memory and disks Main memory SCM Disk

A last comment • The technology is still experimental • Not sure when it

A last comment • The technology is still experimental • Not sure when it will come to the market • Might even never come to the market

Interesting Issues (II) • Rather narrow gap between SCM access times and main memory

Interesting Issues (II) • Rather narrow gap between SCM access times and main memory access times • Main memory and SCM will interact – As the L 3 cache interact with the main memory – Not as the main memory now interacts with the disk

RAID Arrays

RAID Arrays

Today’s Motivation • We use RAID today for – Increasing disk throughput by allowing

Today’s Motivation • We use RAID today for – Increasing disk throughput by allowing parallel access – Eliminating the need to make disk backups • Disks are too big to be backed up in an efficient fashion

RAID LEVEL 0 • No replication • Advantages: – Simple to implement – No

RAID LEVEL 0 • No replication • Advantages: – Simple to implement – No overhead • Disadvantage: – If array has n disks failure rate is n times the failure rate of a single disk

RAID levels 0 and 1 RAID level 0 RAID level 1 Mirrors

RAID levels 0 and 1 RAID level 0 RAID level 1 Mirrors

RAID LEVEL 1 • Mirroring: – Two copies of each disk block • Advantages:

RAID LEVEL 1 • Mirroring: – Two copies of each disk block • Advantages: – Simple to implement – Fault-tolerant • Disadvantage: – Requires twice the disk capacity of normal file systems

RAID LEVEL 2 • Instead of duplicating the data blocks we use an error

RAID LEVEL 2 • Instead of duplicating the data blocks we use an error correction code • Very bad idea because disk drives either work correctly or do not work at all – Only possible errors are omission errors – We need an omission correction code

RAID levels 2 and 3 RAID level 2 RAID level 3 Check disks Parity

RAID levels 2 and 3 RAID level 2 RAID level 3 Check disks Parity disk

RAID LEVEL 3 • Requires N+1 disk drives – N drives contain data (1/N

RAID LEVEL 3 • Requires N+1 disk drives – N drives contain data (1/N of each data block) • Block b[k] now partitioned into N fragments b[k, 1], b[k, 2], . . . b[k, N] – Parity drive contains exclusive or of these N fragments p[k] = b[k, 1] b[k, 2] . . . b[k, N]

How parity works? • Truth table for XOR (same as parity) A 0 0

How parity works? • Truth table for XOR (same as parity) A 0 0 1 1 B 0 1 A B 0 1 1 0

Recovering from a disk failure • Small RAID level 3 array with data disks

Recovering from a disk failure • Small RAID level 3 array with data disks D 0 and D 1 and parity disk P can tolerate failure of either D 0 or D 1 D 0 0 0 1 1 D 1 0 1 P 0 1 1 0 D 1 P=D D 0 P=D 1 0 0 1 1

How RAID level 3 works (I) • Assume we have N + 1 disks

How RAID level 3 works (I) • Assume we have N + 1 disks • Each block is partitioned into N equal chunks Block Chunk N = 4 in example

How RAID level 3 works (II) • XOR data chunks to compute the parity

How RAID level 3 works (II) • XOR data chunks to compute the parity chunk Parity • Each chunk is written into a separate disk Parity

How RAID level 3 works (III) • Each read/write involves all disks in RAID

How RAID level 3 works (III) • Each read/write involves all disks in RAID array – Cannot do two or more reads/writes in parallel – Performance of array not better than that of a single disk

RAID LEVEL 4 (I) • Requires N+1 disk drives – N drives contain data

RAID LEVEL 4 (I) • Requires N+1 disk drives – N drives contain data • Individual blocks, not chunks – Blocks with same disk address form a stripe x x ?

RAID LEVEL 4 (II) • Parity drive contains exclusive or of the N blocks

RAID LEVEL 4 (II) • Parity drive contains exclusive or of the N blocks in stripe p[k] = b[k] b[k+1] . . . b[k+N-1] • Parity block now reflects contents of several blocks! • Can now do parallel reads/writes

RAID levels 4 and 5 RAID level 4 RAID level 5 Bottlenec k

RAID levels 4 and 5 RAID level 4 RAID level 5 Bottlenec k

RAID LEVEL 5 • Single parity drive of RAID level 4 is involved in

RAID LEVEL 5 • Single parity drive of RAID level 4 is involved in every write – Will limit parallelism • RAID-5 distribute the parity blocks among the N+1 drives – Much better

The small write problem • Specific to RAID 5 • Happens when we want

The small write problem • Specific to RAID 5 • Happens when we want to update a single block – Block belongs to a stripe – How can we compute the new value of the parity block p[k] b[k+1] b[k+2]. . .

First solution • Read values of N-1 other blocks in stripe • Recompute p[k]

First solution • Read values of N-1 other blocks in stripe • Recompute p[k] = b[k] b[k+1] . . . b[k+N-1] • Solution requires – N-1 reads – 2 writes (new block and new parity block)

Second solution • Assume we want to update block b[m] • Read old values

Second solution • Assume we want to update block b[m] • Read old values of b[m] and parity block p[k] • Compute p[k] = new b[m] old p[k] • Solution requires – 2 reads (old values of block and parity block)

RAID level 6 (I) • Not part of the original proposal – Two check

RAID level 6 (I) • Not part of the original proposal – Two check disks – Tolerates two disk failures – More complex updates

RAID level 6 (II) • Has become more popular as disks become – Bigger

RAID level 6 (II) • Has become more popular as disks become – Bigger – More vulnerable to irrecoverable read errors • Most frequent cause for RAID level 5 array failures is – Irrecoverable read error occurring

RAID level 6 (III) • Typical array size is 12 disks • Space overhead

RAID level 6 (III) • Typical array size is 12 disks • Space overhead is 2/12 = 16. 7 % • Sole real issue is cost of small writes – Three reads and three writes: • Read old value of block being updated, old parity block P, old party block Q • Write new value of block being updated, new parity block P, new

CONCLUSION (II) • Low cost of disk drives made RAID level 1 attractive for

CONCLUSION (II) • Low cost of disk drives made RAID level 1 attractive for small installations • Otherwise pick – RAID level 5 for higher parallelism – RAID level 6 for higher protection • Can tolerate one disk failure and irrecoverable read errors

A review question • Consider an array consisting of four 750 GB disks •

A review question • Consider an array consisting of four 750 GB disks • What is the storage capacity of the array if we organize it – As a RAID level 0 array? – As a RAID level 1 array? – As a RAID level 5 array?

The answers • Consider an array consisting of four 750 GB disks • What

The answers • Consider an array consisting of four 750 GB disks • What is the storage capacity of the array if we organize it – As a RAID level 0 array? 3 TB – As a RAID level 1 array? 1. 5 TB – As a RAID level 5 array? TB 2. 25

CONNECTING I/O DEVICES

CONNECTING I/O DEVICES

Busses • Connecting computer subsystems with each other was traditionally done through busses •

Busses • Connecting computer subsystems with each other was traditionally done through busses • A bus is a shared communication link connecting multiple devices • Transmit several bits at a time – Parallel buses

Busses

Busses

Examples • Processor-memory busses – Connect CPU with memory modules – Short and high-speed

Examples • Processor-memory busses – Connect CPU with memory modules – Short and high-speed • I/O busses – Longer – Wide range of data bandwidths – Connect to memory through processor-memory bus of backplane

Standards • Firewire – For external use – 63 devices per channel – 4

Standards • Firewire – For external use – 63 devices per channel – 4 signal lines – 400 Mb/s or 800 Mb/s – Up to 4. 5 m

Standards • USB 2. 0 – For external use – 127 devices per channels

Standards • USB 2. 0 – For external use – 127 devices per channels – 2 signal lines – 1. 5 Mb/s (Low Speed), 12 Mb/s (Full Speed) and 480 Mb/s (Hi Speed) – Up to 5 m

Standards • USB 3. 0 – For external use – Adds a 5 Gb/s

Standards • USB 3. 0 – For external use – Adds a 5 Gb/s transfer rate (Super Speed) – Maximum distance is still 5 m

Standards • PCI Express – For internal use – 1 device per channel –

Standards • PCI Express – For internal use – 1 device per channel – 2 signal lines per "lane" – Multiples of 250 MB/s: • 1 x, 2 x, 4 x, 8 x, 16 x and 32 x – Up to 0. 5 m

Standards • Serial ATA – For internal use – Connects cheap disks to computer

Standards • Serial ATA – For internal use – Connects cheap disks to computer – 1 device per channel – 4 data lines – 300 MB/s – Up to 1 m

Standards • Serial Attached SCSI (SAS) – For external use – 4 devices per

Standards • Serial Attached SCSI (SAS) – For external use – 4 devices per channel – 4 data lines – 300 MB/s – Up to 8 m

Synchronous busses • Include a clock in the control lines • Bus protocols expressed

Synchronous busses • Include a clock in the control lines • Bus protocols expressed in actions to be taken at each clock pulse • Have very simple protocols • Disadvantages – All bus devices must run at same clock rate – Due to clock skew issues, cannot be both fast and long

Asynchronous busses • Have no clock • Can accommodate a wide variety of devices

Asynchronous busses • Have no clock • Can accommodate a wide variety of devices • Have no clock skew issues • Require a handshaking protocol before any transmission – Implemented with extra control lines

Advantages of busses • Cheap – One bus can link many devices • Flexible

Advantages of busses • Cheap – One bus can link many devices • Flexible – Can add devices

Disadvantages of busses • Shared devices – can become bottlenecks • Hard to run

Disadvantages of busses • Shared devices – can become bottlenecks • Hard to run many parallel lines at high clock speeds

New trend • Away from parallel shared buses • Towards serial point-to-point switched interconnections

New trend • Away from parallel shared buses • Towards serial point-to-point switched interconnections – Serial • One bit at a time – Point-to-point • Each line links a specific device to another specific device

x 86 bus organization • Processor connects to peripherals through two chips (bridges) –

x 86 bus organization • Processor connects to peripherals through two chips (bridges) – North Bridge – South Bridge

x 86 bus organization North Bridg e South Bridg e

x 86 bus organization North Bridg e South Bridg e

North bridge • Essentially a DMA controller – Lets disk controller access main memory

North bridge • Essentially a DMA controller – Lets disk controller access main memory w/o any intervention of the CPU • Connects CPU to – Main memory – Optional graphics card – South Bridge

South Bridge • Connects North bridge to a wide variety of I/O busses

South Bridge • Connects North bridge to a wide variety of I/O busses

Communicating with I/O devices • Two solutions – Memory-mapped I/O – Special I/O instructions

Communicating with I/O devices • Two solutions – Memory-mapped I/O – Special I/O instructions

Memory mapped I/O • A portion of the address space reserved for I/O operations

Memory mapped I/O • A portion of the address space reserved for I/O operations – Writes to any to these addresses are interpreted as I/O commands – Reading from these addresses gives access to • Error bit • I/O completion bit • Data being read

Memory mapped I/O • User processes cannot access these addresses – Only the kernel

Memory mapped I/O • User processes cannot access these addresses – Only the kernel • Prevents user processes from accessing the disk in an uncontrolled fashion

Dedicated I/O instructions • Privileged instructions that cannot be executed by user processes –

Dedicated I/O instructions • Privileged instructions that cannot be executed by user processes – Only the kernel • Prevents user processes from accessing the disk in an uncontrolled fashion

Polling • Simplest way for an I/O device to communicate with the CPU •

Polling • Simplest way for an I/O device to communicate with the CPU • CPU periodically checks the status of pending I/O operations – High CPU overhead

I/O completion interrupts • Notify the CPU that an I/O operation has completed •

I/O completion interrupts • Notify the CPU that an I/O operation has completed • Allows the CPU to do something else while waiting for the completion of an I/O operation – Multiprogramming • I/O completion interrupts are processed by CPU between instructions – No internal instruction state to save

Interrupts levels • See previous chapter

Interrupts levels • See previous chapter

Direct memory access • DMA • Lets disk controller access main memory w/o any

Direct memory access • DMA • Lets disk controller access main memory w/o any intervention of the CPU

DMA and virtual memory • A single DMA transfer may cross page boundaries with

DMA and virtual memory • A single DMA transfer may cross page boundaries with – One page being in main memory – One missing page

Solutions • Make DMA work with virtual addresses – Issue is then dealt by

Solutions • Make DMA work with virtual addresses – Issue is then dealt by the virtual memory subsystem • Break DMA transfers crossing page boundaries into chains of transfers that do not cross page boundaries

Solutions • Make DMA work with virtual addresses – Issue is then dealt by

Solutions • Make DMA work with virtual addresses – Issue is then dealt by the virtual memory subsystem • Break DMA transfers crossing page boundaries into chains of transfers that do not cores page boundaries

An Example Page Break into Page DMA transfer DMA Page

An Example Page Break into Page DMA transfer DMA Page

DMA and cache hierarchy • Three approaches for handling temporary inconsistencies between caches and

DMA and cache hierarchy • Three approaches for handling temporary inconsistencies between caches and main memory

Solutions 1. Running all DMA accesses to the cache – Bad solution 2. Have

Solutions 1. Running all DMA accesses to the cache – Bad solution 2. Have OS selectively – Invalidate affected cache entries when performing a read – Forcing immediate flush of dirty cache entries when performing a write

Benchmarking I/O

Benchmarking I/O

Benchmarks • Specific benchmarks for – Transaction processing • Emphasis on speed and graceful

Benchmarks • Specific benchmarks for – Transaction processing • Emphasis on speed and graceful recovery from failures –Atomic transactions: • All or nothing behavior

An important observation • Very difficult to operate a disk subsystem at a reasonable

An important observation • Very difficult to operate a disk subsystem at a reasonable fraction of its maximum throughput – Unless we access sequentially very large ranges of data • 512 KB and more

Major fallacies • Since rated MTTFs of disk drives exceed one million hours, disk

Major fallacies • Since rated MTTFs of disk drives exceed one million hours, disk can last more than 100 years – MTTF expresses failure rate during the disk actual lifetime • Disk failure rates in the field match the MMTTFS mentioned in the manufacturers’ literature – They are up to ten times higher

Major fallacies • Neglecting to do end-to-end checks –… • Using magnetic tapes to

Major fallacies • Neglecting to do end-to-end checks –… • Using magnetic tapes to back up disks – Tape formats can become quickly obsolescent – Disk bit densities have grown much faster than tape data densities.

Can you read these? No No On an old PC

Can you read these? No No On an old PC

But you can still read this

But you can still read this