Designing for 20 TB Disk Drives And enterprise

Designing for 20 TB Disk Drives And “enterprise storage” Jim Gray, Microsoft research 1

Kilo Mega Giga Tera Peta Exa Zetta Disk Evolution Capacity: 100 x in 10 years 1 TB 3. 5” drive in 2005 20 TB? in 2012? ! System on a chip High-speed SAN Yotta Disk replacing tape Disk is super computer! 2

Disks are becoming computers Smart drives Camera with micro-drive Replay / Tivo / Ultimate TV Phone with micro-drive MP 3 players Tablet Xbox Many more… Applications Web, DBMS, Files OS Disk Ctlr + 1 Ghz cpu+ 1 GB RAM Comm: Infiniband, Ethernet, radio… 3

Intermediate Step: Shared Logic Snap Brick with 8 -12 disk drives 200 mips/arm (or more) 2 x. Gbps. Ethernet General purpose OS 10 k$/TB to 100 k$/TB Shared n n n Sheet metal Power Support/Config Security Network ports These bricks could run applications ~1 TB 12 x 80 GB NAS Net. App ~. 5 TB 8 x 70 GB NAS Maxstor ~2 TB 12 x 160 GB NAS IBM Total. Storage ~360 GB 10 x 36 GB NAS 4 (e. g. SQL or Mail or. . )

Hardware Homogenous machines leads to quick response through reallocation HP desktop machines, 320 MB RAM, 3 u high, 4 100 GB IDE Drives $4 k/TB (street), 2. 5 processors/TB, 1 GB RAM/TB 3 weeks from ordering to operational Slide courtesy of Brewster Kahle, @ Archive. org 5

Disk as Tape is unreliable, specialized, slow, low density, not improving fast, and expensive Using removable hard drives to replace tape’s function has been successful When a “tape” is needed, the drive is put in a machine and it is online. No need to copy from tape before it is used. Portable, durable, fast, media cost = raw tapes, dense. Unknown longevity: suspected good. Slide courtesy of Brewster Kahle, @ Archive. org 6

Disk As Tape: What format? Today I send NTFS/SQL disks. But that is not a good format for Linux. Solution: Ship NFS/CIFS/ODBC servers (not disks) Plug “disk” into LAN. n n DHCP then file or DB server via standard interface. Web Service in long term 7

State is Expensive Stateless clones are easy to manage n App servers are middle tier Cost goes to zero with Moore’s law. n n One admin per 1, 000 clones. Good story about scaleout. Stateful servers are expensive to manage n n 1 TB to 100 TB per admin Storage cost is going to zero(2 k$ to 200 k$). Cost of storage is management cost 8

Databases (== SQL) VLDB survey (Winter Corp). 10 TB to 100 TB DBs. n n n Size doubling yearly Riding disk Moore’s law 10, 000 disks at 18 GB is 100 TB cooked. Mostly DSS and data warehouses. Some media managers 9

Interesting facts No DBMSs beyond 100 TB. Most bytes are in files. The web is file centric e. Mail is file centric. Science (and batch) is file centric. But…. SQL performance is better than CIFS/NFS. . n CISC vs RISC 10

Bar. Bar: the biggest DB 500 TB Uses Objectivity™ SLAC events Linux cluster scans DB looking for patterns 11

300 TB (cooked) Hotmail / Yahoo Clone front ends ~10, 000@hotmail. Application servers n n ~100 @ hotmail Get mail box Get/put mail Disk bound w ~30, 000 disks ~ 20 admins 12

AOL (msn) (1 PB? ) 10 B transactions per day (10% of that) Huge storage Huge traffic Lots of eye candy DB used for security/accounting. GUESS AOL is a petabyte n (40 M x 10 MB = 400 x 1012) 13

Google 1. 5 PB as of last spring 8, 000 no-name PCs n Each 1/3 U, 2 x 80 GB disk, 2 cpu 256 MB ram 1. 4 PB online. 2 TB ram online 8 Tera. Ops Slice-price is 1 K$ so 8 M$. 15 admins (!) (== 1/100 TB). 14

Astronomy I’ve been trying to apply DB to astronomy Today they are at 10 TB per data set Heading for Petabytes Using Objectivity Trying SQL (talk to me offline) 15

Scale Out: Buy Computing by the Slice 709, 202 tpm. C! == 1 Billion transactions/day Slice: 8 cpu, 8 GB, 100 disks (=1. 8 TB) 20 ktpm. C per slice, ~300 k$/slice clients and 4 DTC nodes not shown 16

Scale. Up: A Very Big System! UNISYS Windows 2000 Data Center Limited Edition 32 cpus on 32 GB of RAM and 1, 061 disks (15. 5 TB) Will be helped by 64 bit addressing 24 fiber channel 17

Hardware 8 Compaq DL 360 “Photon” Web Servers One SQL database per rack Each rack contains 4. 5 tb 261 total drives / 13. 7 TB total 2200 Meta Data Stored on 101 GB “Fast, Small Disks” (18 x 18. 2 GB) Imagery Data Stored on 4 339 GB “Slow, Big Disks” (15 x 73. 8 GB) To Add 90 72. 8 GB Disks in Feb 2001 to create 18 TB SAN O O J 2200 Fiber SAN Switches E E J 2200 P Q K L 2200 F SQLInst 1 G SQLInst 2 2200 R S 2200 M N 2200 SQLInst 3 H I Spare 4 Compaq Pro. Liant 8500 Db Servers 18

Amdahl’s Balance Laws parallelism law: If a computation has a serial part S and a parallel component P, then the maximum speedup is (S+P)/S. balanced system law: A system needs a bit of IO per second per instruction per second: about 8 MIPS per MBps. memory law: =1: the MB/MIPS ratio (called alpha ( )), in a balanced system is 1. IO law: Programs do one IO per 50, 000 instructions. 19

Amdahl’s Laws Valid 35 Years Later? Parallelism law is algebra: so SURE! Balanced system laws? Look at tpc results (tpc. C, tpc. H) at http: //www. tpc. org/ Some imagination needed: n What’s an instruction (CPI varies from 1 -3)? w RISC, CISC, VLIW, … clocks per instruction, … n What’s an I/O? 20

TPC systems Normalize for CPI (clocks per instruction) n n TPC-C has about 7 ins/byte of IO TPC-H has 3 ins/byte of IO TPC-H needs ½ as many disks, sequential vs random Both use 9 GB 10 krpm disks (need arms, not bytes) KB IO/s MHz/ Disks MB/s CPI mips Disks / / / cpu cpu IO disk Amdahl 1 1 1 6 TPC-C= random 550 2. 1 262 8 100 397 50 40 TPC-H= sequential 550 1. 2 458 64 100 176 22 141 Ins/ IO Byte 21 8 7 3

TPC systems: What’s alpha (=MB/MIPS) ? Hard to say: n n n Intel 32 bit addressing (= 4 GB limit). Known CPI. IBM, HP, Sun have 64 GB limit. Unknown CPI. Look at both, guess CPI for IBM, HP, Sun Alpha is between 1 and 6 Mips Memory Alpha Amdahl 1 1 tpc. C Intel 8 x 262 = 2 Gips 4 GB tpc. H Intel 8 x 458 = 4 Gips 4 GB tpc. C IBM 24 cpus ? = 12 Gips 64 GB tpc. H HP 32 cpus ? = 16 Gips 32 GB 1 2 1 6 222

Performance (on current SDSS data) Run times: on 15 k$ COMPAQ Server (2 cpu, 1 GB , 8 disk) Some take 10 minutes Some take 1 minute Median ~ 22 sec. ~1, 000 IO/cpu sec ~ 64 MB IO/cpu sec Ghz processors are fast! n n (10 mips/IO, 200 ins/byte) 2. 5 m rec/s/cpu 23

How much storage do we need? Yotta Everything Soon everything can be ! recorded and indexed Recorded All Books Most bytes will never be Multi. Media seen by humans. Data summarization, trend All Lo. C books detection anomaly (words) detection are key technologies. Movie Zetta Exa Peta Tera Giga See Mike Lesk: How much information is there: http: //www. lesk. com/mlesk/ksg 97/ksg. html A Photo See Lyman & Varian: How much information http: //www. sims. berkeley. edu/research/projects/how-much-info/ 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli A Book Mega 24 Kilo

Standard Storage Metrics Capacity: n n n RAM: Disk: Tape: (nearline) MB and $/MB: today at 512 MB and 200$/GB GB and $/GB: today at 80 GB and 70 k$/TB TB and $/TB: today at 40 GB and 10 k$/TB Access time (latency) n n n RAM: Disk: Tape: 100 ns 15 ms 30 second pick, 30 second position Transfer rate n n n RAM: Disk: Tape: 1 -10 GB/s 10 -50 MB/s - - -Arrays can go to 10 GB/s 5 -15 MB/s - - - Arrays can go to 1 GB/s 25

New Storage Metrics: Kaps, Maps, SCAN Kaps: How many kilobyte objects served per second n n The file server, transaction processing metric This is the OLD metric. Maps: How many megabyte objects served per sec n The Multi-Media metric SCAN: How long to scan all the data n the data mining and utility metric And n Kaps/$, Maps/$, TBscan/$ 26

More Kaps and Kaps/$ but…. Disk accesses got much less expensive Better disks Cheaper disks! But: disk arms are expensive the scarce resource 1 hour Scan vs 5 minutes in 1990 100 GB 30 MB/s 27

Data on Disk Can Move to RAM in 10 years 100: 1 10 years 28

The “Absurd” 10 x (=4 year) Disk 2. 5 hr scan time (poor sequential access) 1 aps / 5 GB (VERY cold data) It’s a tape! 100 MB/s 200 Kaps 1 TB 29

It’s Hard to Archive a Petabyte It takes a LONG time to restore it. At 1 GBps it takes 12 days! Store it in two (or more) places online A geo-plex (on disk? ). Scrub it continuously (look for errors) On failure, n n use other copy until failure repaired, refresh lost copy from safe copy. Can organize the two copies differently (e. g. : one by time, one by space) 30

Auto Manage Storage 1980 rule of thumb: n A Data. Admin per 10 GB, Sys. Admin per mips 2000 rule of thumb n n A Data. Admin per 5 TB Sys. Admin per 100 clones (varies with app). Problem: n 5 TB is 50 k$ today, 5 k$ in a few years. n Admin cost >> storage cost !!!! Challenge: n Automate ALL storage admin tasks 31

How to cool disk data: Cache data in main memory n See 5 minute rule later in presentation Fewer-larger transfers n Larger pages (512 -> 8 KB -> 256 KB) Sequential rather than random access n n Random 8 KB IO is 1. 5 MBps Sequential IO is 30 MBps (20: 1 ratio is growing) Raid 1 (mirroring) rather than Raid 5 (parity). 32

Data delivery costs 1$/GB today Rent for “big” customers: 300$/megabit per second per month Improved 3 x in last 6 years (!). That translates to 1$/GB at each end. You can mail a 160 GB disk for 20$. n n 3 x 160 GB ~ ½ TB That’s 16 x cheaper If overnight it’s 4 MBps. 33