Storage Alternate Futures Jim Gray Microsoft Research http

  • Slides: 53
Download presentation
Storage: Alternate Futures Jim Gray Microsoft Research http: //Research. Microsoft. com/~Gray/talks IBM Almaden, 1

Storage: Alternate Futures Jim Gray Microsoft Research http: //Research. Microsoft. com/~Gray/talks IBM Almaden, 1 December 1999 Yotta Zetta Exa Peta Tera Giga Mega 1 Kilo

Acknowledgments: Thank You!! • Dave Patterson: – Convinced me that processors are moving to

Acknowledgments: Thank You!! • Dave Patterson: – Convinced me that processors are moving to the devices. • Kim Keeton and Erik Riedell – Showed that many useful subtasks can be done by disk-processors, and quantified execution interval • Remzi Dusseau – Re-validated Amdahl's laws 2

Outline • The Surprise-Free Future (5 years) – – – 500 mips cpus for

Outline • The Surprise-Free Future (5 years) – – – 500 mips cpus for 10$ 1 Gb RAM chips MAD at 50 Gbpsi 10 GBps SANs are ubiquitous 1 GBps WANs are ubiquitous • Some consequences – – – Absurd (? ) consequences. Auto-manage storage Raid 10 replaces Raid 5 Disc-packs Disk is the archive media of choice • A surprising future? – Disks (and other useful things) become supercomputers. – Apps run “in the disk” 3

The Surprise-free Storage Future • • 1 Gb RAM chips MAD at 50 Gbpsi

The Surprise-free Storage Future • • 1 Gb RAM chips MAD at 50 Gbpsi Drives shrink one quantum Standard IO 10 GBps SANs are ubiquitous 1 Gbps WANs are ubiquitous 5 bips cpus for 1 K$ and 500 mips cpus for 10$ 4

1 Gb RAM Chips • Moving to 256 Mb chips now • 1 Gb

1 Gb RAM Chips • Moving to 256 Mb chips now • 1 Gb will be “standard” in 5 years, 4 Gb will be premium product. • Note: – 256 Mb = 32 MB: the smallest memory – 1 Gb = 128 MB: the smallest memory 5

System On A Chip • Integrate Processing with memory on one chip – –

System On A Chip • Integrate Processing with memory on one chip – – chip is 75% memory now 1 MB cache >> 1960 supercomputers 256 Mb memory chip is 32 MB! IRAM, CRAM, PIM, … projects abound • Integrate Networking with processing on one chip – system bus is a kind of network – ATM, Fiber. Channel, Ethernet, . . Logic on chip. – Direct IO (no intermediate bus) • Functionally specialized cards shrink to a chip. 6

500 mips System On A Chip for 10$ • 486 now 7$ 233 MHz

500 mips System On A Chip for 10$ • 486 now 7$ 233 MHz ARM for 10$ system on a chip http: //www. cirrus. com/news/products 99/news-product 14. html AMD/Celeron 266 ~ 30$ • In 5 years, today’s leading edge will be – System on chip (cpu, cache, mem ctlr, multiple IO) – Low cost – Low-power – Have integrated IO • High end is 5 BIPS cpus 7

Standard IO in 5 Years • Probably • Replace PCI with something better will

Standard IO in 5 Years • Probably • Replace PCI with something better will still need a mezzanine bus standard • Multiple serial links directly from processor • Fast (10 GBps/link) for a few meters • System Area Networks (SANS) ubiquitous (VIA morphs to SIO? ) 8

Ubiquitous 10 GBps SANs in 5 years • 1 Gbps Ethernet are reality now.

Ubiquitous 10 GBps SANs in 5 years • 1 Gbps Ethernet are reality now. – Also Fiber. Channel , Myri. Net, Giga. Net, Server. Net, , ATM, … • 10 Gbps x 4 WDM deployed now (OC 192) – 3 Tbps WDM working in lab • In 5 years, expect 10 x, progress is astonishing • Gilder’s law: Bandwidth grows 3 x/year http: //www. forbes. com/asap/97/0407/090. htm 1 GBps 120 MBps (1 Gbps) 80 MBps 40 MBps 20 Mbsp 9

Thin Client’s mean HUGE servers • • • AOL hosting customer pictures Hotmail allows

Thin Client’s mean HUGE servers • • • AOL hosting customer pictures Hotmail allows 5 MB/user, 50 M users Web sites offer electronic vaulting for SOHO. Intelli. Mirror: replicate client state on server Terminal server: timesharing returns …. Many more. 10

Remember Your Roots? 11

Remember Your Roots? 11

MAD at 50 Gbpsi • MAD: Magnetic Aerial Density: 3 -10 Mbpsi in products

MAD at 50 Gbpsi • MAD: Magnetic Aerial Density: 3 -10 Mbpsi in products 28 Mbpsi in lab 50 Mbpsi = paramagnetic limit but…. People have ideas. • Capacity: rise 10 x in 5 years (conservative) • Bandwidth: rise 4 x in 5 years (density+rpm) • Disk: 50 GB to 500 GB, • 60 -80 MBps • 1 k$/TB • 15 minute to 3 hour scan time. 12

The “Absurd” Disk • 2. 5 hr scan time (poor sequential access) • 1

The “Absurd” Disk • 2. 5 hr scan time (poor sequential access) • 1 aps / 5 GB (VERY cold data) • It’s a tape! 100 MB/s 200 Kaps 1 TB 13

Disk vs Tape • Disk • Tape – – – – – 47 GB

Disk vs Tape • Disk • Tape – – – – – 47 GB 15 MBps 5 ms seek time 3 ms rotate latency 9$/GB for drive 3$/GB for ctlrs/cabinet – 4 TB/rack 40 GB 5 MBps 30 sec pick time Many minute seek time 5$/GB for media 10$/GB for drive+library – 10 TB/rack The price advantage of tape is narrowing, and the performance advantage of disk is growing Guestimates Cern: 200 TB 3480 tapes 2 col = 50 GB Rack = 1 TB =20 drives 14

Standard Storage Metrics • Capacity: – RAM: MB and $/MB: today at 512 MB

Standard Storage Metrics • Capacity: – RAM: MB and $/MB: today at 512 MB and 3$/MB – Disk: GB and $/GB: today at 50 GB and 10$/GB – Tape: TB and $/TB: today at 50 GB and 12 k$/TB (nearline) • Access time (latency) – RAM: 100 ns – Disk: 10 ms – Tape: 30 second pick, 30 second position • Transfer rate – RAM: – Disk: – Tape: 1 GB/s 15 MB/s - - - Arrays can go to 1 GB/s 5 MB/s - - - striping is problematic, but “works” 15

New Storage Metrics: Kaps, Maps, SCAN? • Kaps: How many kilobyte objects served per

New Storage Metrics: Kaps, Maps, SCAN? • Kaps: How many kilobyte objects served per second – The file server, transaction processing metric – This is the OLD metric. • Maps: How many megabyte objects served per second – The Multi-Media metric • SCAN: How long to scan all the data – the data mining and utility metric • And – Kaps/$, Maps/$, TBscan/$ 16

The Access Time Myth • The Myth: seek or pick time dominates • The

The Access Time Myth • The Myth: seek or pick time dominates • The reality: (1) Queuing dominates • (2) Transfer dominates BLOBs • (3) Disk seeks often short • Implication: many cheap servers better than one fast expensive server – shorter queues – parallel transfer – lower cost/access and cost/byte Wait Transfer Rotate Seek • This is obvious for disk arrays • This even more obvious for tape arrays Rotate Seek 19

Storage Ratios Changed • 10 x better access time • 10 x more bandwidth

Storage Ratios Changed • 10 x better access time • 10 x more bandwidth • 4, 000 x lower media price • DRAM/disk media price ratio changed – 1970 -1990 100: 1 – 1990 -1995 10: 1 – 1995 -1997 50: 1 – today ~ 0. 1$p. MB disk 3$p. MB dram 30: 1 20

Data on Disk Can Move to RAM in 8 years 30: 1 6 years

Data on Disk Can Move to RAM in 8 years 30: 1 6 years 21

Outline • The Surprise-Free Future (5 years) – – – 500 mips cpus for

Outline • The Surprise-Free Future (5 years) – – – 500 mips cpus for 10$ 1 Gb RAM chips MAD at 50 Gbpsi 10 GBps SANs are ubiquitous 1 GBps WANs are ubiquitous • Some consequences – – – Absurd (? ) consequences. Auto-manage storage Raid 10 replaces Raid 5 Disc-packs Disk is the archive media of choice • A surprising future? – Disks (and other useful things) become supercomputers. – Apps run “in the disk”. 22

The (absurd? ) consequences • 256 way n. UMA? • Huge main memories: now:

The (absurd? ) consequences • 256 way n. UMA? • Huge main memories: now: • • • 1 GB RAM chips MAD at 50 Gbpsi Drives shrink one quantum 10 GBps SANs are ubiquitous 500 mips cpus for 10$ 5 bips cpus at high end 500 MB - 64 GB memories then: 10 GB - 1 TB memories • Huge disks now: 5 -50 GB 3. 5” disks then: 50 -500 GB disks • Petabyte storage farms – (that you can’t back up or restore). • Disks >> tapes – “Small” disks: One platter one inch 10 GB • SAN convergence 1 GBps point to point is easy 23

 • • The Absurd? Consequences Further segregate processing from storage Poor locality Much

• • The Absurd? Consequences Further segregate processing from storage Poor locality Much useless data movement Amdahl’s laws: bus: 10 B/ips io: 1 b/ips RAM Memory Processors 10 TBps ~ 1 Tips Disks 100 GBps ~ 1 TB 24 ~ 100 TB

Storage Latency: How Far Away is the Data? 10 9 Andromeda Tape /Optical Robot

Storage Latency: How Far Away is the Data? 10 9 Andromeda Tape /Optical Robot 10 6 Disk 100 10 2 1 Memory On Board Cache On Chip Cache Registers 2, 000 Years Pluto Olympia 2 Years 1. 5 hr This Hotel 10 min This Room My Head 1 min 25

Consequences • • • Auto. Manage Storage Sixpacks (for arm-limited apps) Raid 5 ->

Consequences • • • Auto. Manage Storage Sixpacks (for arm-limited apps) Raid 5 -> Raid 10 Disk-to-disk backup Smart disks 26

Auto Manage Storage • 1980 rule of thumb: – A Data. Admin per 10

Auto Manage Storage • 1980 rule of thumb: – A Data. Admin per 10 GB, Sys. Admin per mips • 2000 rule of thumb – A Data. Admin per 5 TB – Sys. Admin per 100 clones (varies with app). • Problem: – 5 TB is 60 k$ today, 10 k$ in a few years. – Admin cost >> storage cost? ? ? • Challenge: – Automate ALL storage admin tasks 27

The “Absurd” Disk • 2. 5 hr scan time (poor sequential access) • 1

The “Absurd” Disk • 2. 5 hr scan time (poor sequential access) • 1 aps / 5 GB (VERY cold data) • It’s a tape! 100 MB/s 200 Kaps 1 TB 28

Extreme case: 1 TB disk: Alternatives • Use all the heads in parallel –

Extreme case: 1 TB disk: Alternatives • Use all the heads in parallel – Scan in 30 minutes – Still one Kaps/5 GB 500 MB/s 1 TB 200 Kaps • Use one platter per arm – Share power/sheetmetal – Scan in 30 minutes – One KAPS per GB 500 MB/s 1, 000 Kaps 200 GB each 29

Drives shrink (1. 8”, 1”) • • 150 kaps for 500 GB is VERY

Drives shrink (1. 8”, 1”) • • 150 kaps for 500 GB is VERY cold data 3 GB/platter today, 30 GB/platter in 5 years. Most disks are ½ full TPC benchmarks use 9 GB drives (need arms or bandwidth). • One solution: smaller form factor – More arms per GB – More arms per rack – More arms per Watt 30

Prediction: 6 -packs • One way or another, when disks get huge – Will

Prediction: 6 -packs • One way or another, when disks get huge – Will be packaged as multiple arms – Parallel heads gives bandwidth – Independent arms gives bandwidth & aps • Package shares power, package, interfaces… 31

Stripes, Mirrors, Parity (RAID 0, 1, 5) • RAID 0: Stripes – bandwidth 0,

Stripes, Mirrors, Parity (RAID 0, 1, 5) • RAID 0: Stripes – bandwidth 0, 3, 6, . . 1, 4, 7, . . 2, 5, 8, . . • RAID 1: Mirrors, Shadows, … – Fault tolerance – Reads faster, writes 2 x slower 0, 1, 2, . . • RAID 5: Parity – Fault tolerance – Reads faster – Writes 4 x or 6 x slower. 0, 2, P 2, . . 1, P 1, 4, . . P 0, 3, 5, . . 32

RAID 10 (strips of mirrors) Wins “wastes space, saves arms” RAID 5: • Performance

RAID 10 (strips of mirrors) Wins “wastes space, saves arms” RAID 5: • Performance – 225 reads/sec – 70 writes/sec – Write • 4 logical IO, • 2 seek + 1. 7 rotate • SAVES SPACE • Performance degrades on failure RAID 1 • Performance – 250 reads/sec – 100 writes/sec – Write • 2 logical IO • 2 seek 0. 7 rotate • SAVES ARMS • Performance improves on failure 33

The Storage Rack Today • 140 arms • 4 TB • 24 racks 24

The Storage Rack Today • 140 arms • 4 TB • 24 racks 24 storage processors 6+1 in rack • Disks = 2. 5 GBps IO • Controllers = 1. 2 GBps IO • Ports 500 MBps IO 34

Storage Rack in 5 years? • 140 arms • 50 TB • 24 racks

Storage Rack in 5 years? • 140 arms • 50 TB • 24 racks 24 storage processors 6+1 in rack • Disks = 14 GBps IO • Controllers = 5 GBps IO • Ports 1 GBps IO • My suggestion: move the processors into the storage racks. 35

It’s hard to archive a Peta. Byte It takes a LONG time to restore

It’s hard to archive a Peta. Byte It takes a LONG time to restore it. • • Store it in two (or more) places online (on disk? ). Scrub it continuously (look for errors) On failure, refresh lost copy from safe copy. Can organize the two copies differently (e. g. : one by time, one by space) 36

Crazy Disk Ideas • Disk Farm on a card: surface mount disks • Disk

Crazy Disk Ideas • Disk Farm on a card: surface mount disks • Disk (magnetic store) on a chip: (micro machines in Silicon) • Full Apps (e. g. SAP, Exchange/Notes, . . ) in the disk controller (a processor with 128 MB dram) The Innovator's Dilemma: When New Technologies Cause Great Firms to Fail Clayton M. Christensen. ISBN: 0875845851 ASIC 37

The Disk Farm On a Card • • The 500 GB disc card An

The Disk Farm On a Card • • The 500 GB disc card An array of discs Can be used as 100 discs 1 striped disc 50 Fault Tolerant discs. . etc LOTS of accesses/second bandwidth 14" 38

Functionally Specialized Cards • Storage P mips processor ASIC Today: P=50 mips • Network

Functionally Specialized Cards • Storage P mips processor ASIC Today: P=50 mips • Network M MB DRAM M= 2 MB In a few years ASIC P= 200 mips M= 64 MB • Display ASIC 39

Data Gravity Processing Moves to Transducers • Move Processing to data sources • Move

Data Gravity Processing Moves to Transducers • Move Processing to data sources • Move to where the power (and sheet metal) is • Processor in – Modem – Display – Microphones (speech recognition) & cameras (vision) – Storage: Data storage and analysis 40

It’s Already True of Printers Peripheral = Cyber. Brick • You buy a printer

It’s Already True of Printers Peripheral = Cyber. Brick • You buy a printer • You get a – several network interfaces – A Postscript engine • • cpu, memory, software, a spooler (soon) – and… a print engine. 41

Kilo Mega Disks Become Supercomputers Giga Tera Peta Exa Zetta • 100 x in

Kilo Mega Disks Become Supercomputers Giga Tera Peta Exa Zetta • 100 x in 10 years 2 TB 3. 5” drive • Shrink to 1” is 200 GB • Disk replaces tape? Yotta • Disk is super computer! 42

All Device Controllers will be Cray 1’s • TODAY – Disk controller is 10

All Device Controllers will be Cray 1’s • TODAY – Disk controller is 10 mips risc engine with 2 MB DRAM – NIC is similar power • SOON – Will become 100 mips systems with 100 MB DRAM. Central Processor & Memory • They are nodes in a federation (can run Oracle on NT in disk controller). • Advantages – – – Uniform programming model Great tools Security Economics (cyberbricks) Move computation to data (minimize traffic) Tera Byte Backplane 43

With Tera Byte Interconnect and Super Computer Adapters • Processing is incidental to –

With Tera Byte Interconnect and Super Computer Adapters • Processing is incidental to – Networking – Storage – UI • Disk Controller/NIC is – faster than device – close to device – Can borrow device package & power Tera Byte Backplane • So use idle capacity for computation. • Run app in device. • Both Kim Keeton (UCB) and Erik Riedel (CMU) thesis investigate this show benefits of this approach. 44

Implications Conventional Radical • Move app to • Offload device handling to NIC/device controller

Implications Conventional Radical • Move app to • Offload device handling to NIC/device controller NIC/HBA • higher-higher level • higher level protocols: CORBA / I 2 O, NASD, VIA, IP, TCP… COM+. • SMP and Cluster • Cluster parallelism is important. VERY important. Central Processor & Memory Tera Byte Backplane 45

How Do They Talk to Each Other? • • – CORBA? COM+? RMI? –

How Do They Talk to Each Other? • • – CORBA? COM+? RMI? – One or all of the above. Applications ? RPC streams datagrams • Huge leverage in high-level interfaces. • Same old distributed system story. ? RPC streams datagrams Applications Each node has an OS Each node has local resources: A federation. Each node does not completely trust the others. Nodes use RPC to talk to each other SIO SAN 46

Basic Argument for x-Disks • Future disk controller is a super-computer. – 1 bips

Basic Argument for x-Disks • Future disk controller is a super-computer. – 1 bips processor – 128 MB dram – 100 GB disk plus one arm • Connects to SAN via high-level protocols – RPC, HTTP, DCOM, Kerberos, Directory Services, …. – Commands are RPCs – management, security, …. – Services file/web/db/… requests – Managed by general-purpose OS with good dev environment • Move apps to disk to save data movement – need programming environment in controller 47

The Slippery Slope Nothing = Sector Server • If you add function to server

The Slippery Slope Nothing = Sector Server • If you add function to server • Then you add more function to server = er ng rv thi Se me pp So d A xe Fi • Function gravitates to data. Everything = App Server 48

Why Not a Sector Server? (let’s get physical!) • Good idea, that’s what we

Why Not a Sector Server? (let’s get physical!) • Good idea, that’s what we have today. • But – cache added for performance – Sector remap added for fault tolerance – error reporting and diagnostics added – SCSI commends (reserve, . . are growing) – Sharing problematic (space mgmt, security, …) • Slipping down the slope to a 2 -D block server 49

Why Not a 1 -D Block Server? Put A LITTLE on the Disk Server

Why Not a 1 -D Block Server? Put A LITTLE on the Disk Server • Tried and true design – HSC - VAX cluster – EMC – IBM Sysplex (3980? ) • But look inside – – – – – Has a cache Has space management Has error reporting & management Has RAID 0, 1, 2, 3, 4, 5, 10, 50, … Has locking Has remote replication Has an OS Security is problematic Low-level interface moves too many bytes 50

Why Not a 2 -D Block Server? Put A LITTLE on the Disk Server

Why Not a 2 -D Block Server? Put A LITTLE on the Disk Server • Tried and true design – Cedar -> NFS – file server, cache, space, . . – Open file is many fewer msgs • Grows to have – Directories + Naming – Authentication + access control – RAID 0, 1, 2, 3, 4, 5, 10, 50, … – Locking – Backup/restore/admin – Cooperative caching with client • File Servers are a BIG hit: Net. Ware™ – SNAP! is my favorite today 51

Why Not a File Server? Put a Little on the Disk Server • Tried

Why Not a File Server? Put a Little on the Disk Server • Tried and true design – Auspex, Net. App, . . . – Netware • Yes, but look at Net. Ware – File interface gives you app invocation interface – Became an app server • Mail, DB, Web, …. – Netware had a primitive OS • Hard to program, so optimized wrong thing 52

Why Not Everything? Allow Everything on Disk Server (thin client’s) • Tried and true

Why Not Everything? Allow Everything on Disk Server (thin client’s) • Tried and true design – Mainframes, Minis, . . . – Web servers, … – Encapsulates data – Minimizes data moves – Scaleable • It is where everyone ends up. • All the arguments against are short-term. 53

The Slippery Slope Nothing = Sector Server • If you add function to server

The Slippery Slope Nothing = Sector Server • If you add function to server • Then you add more function to server = er ng rv thi Se me pp So d A xe Fi • Function gravitates to data. Everything = App Server 54

Outline • The Surprise-Free Future (5 years) – Astonishing hardware progress. • Some consequences

Outline • The Surprise-Free Future (5 years) – Astonishing hardware progress. • Some consequences – – – Absurd (? ) consequences. Auto-manage storage Raid 10 replaces Raid 5 Disc-packs Disk is the archive media of choice • A surprising future? – Disks (and other useful things) become supercomputers. – Apps run “in the disk” 55