Scaleable Computing Jim Gray Microsoft Corporation GrayMicrosoft com

Scaleable Computing Jim Gray Microsoft Corporation Gray@Microsoft. com ™

Thesis: Scaleable Servers u Scaleable Servers Ø Ø Ø u Servers should be able to Ø Scale up (grow node by adding CPUs, disks, networks) Scale out (grow by adding nodes) Ø Scale down (can start small) Ø u Commodity hardware allows new applications New applications need huge servers Clients and servers are built of the same “stuff” Ø Commodity software and Ø Commodity hardware Key software technologies Ø Objects, Transactions, Clusters, Parallelism

1987: 256 tps Benchmark u u u 14 M$ computer (Tandem) A dozen people False floor, 2 rooms of machines Admin expert Hardware experts A 32 node processor array Simulate 25, 600 clients Network expert Manager Performance expert DB expert A 40 GB disk array (80 drives) Auditor OS expert

1988: DB 2 + CICS Mainframe 65 tps u u u IBM 4391 Simulated network of 800 clients 2 m$ computer Staff of 6 to do benchmark u 2 x 3725 network controllers Refrigerator-sized CPU 16 GB disk farm 4 x 8 x. 5 GB

1997: 10 years later 1 Person and 1 box = 1250 tps u u 1 Breadbox ~ 5 x 1987 machine room 23 GB is hand-held One person does all the work Cost/tps is 1, 000 x less 25 micro dollars per transaction Hardware expert OS expert Net expert DB expert App expert 4 x 200 Mhz cpu 1/2 GB DRAM 12 x 4 GB disk 3 x 7 x 4 GB disk arrays

What Happened? u Moore’s law: Things get 4 x better every 3 years (applies to computers, storage, and networks) u New Economics: Commodity class price/mips software $/mips k$/year mainframe 10, 000 100 minicomputer 100 10 microcomputer 10 1 main price u fram e min i micr o time GUI: Human - computer tradeoff optimize for people, not computers

What Happens Next u u u Last 10 years: 1000 x improvement Next 10 years: ? ? 1985 1995 2005 Today: text and image servers are free 25 m$/hit => advertising pays for them Future: video, audio, … servers are free “You ain’t seen nothing yet!” performance u

Kinds Of Information Processing Point-to-point Immediate Timeshifted Broadcast Conversation Money Lecture Concert Network Mail Book Newspaper Database It’s ALL going electronic Immediate is being stored for analysis (so ALL database) Analysis and automatic processing are being added

Low rent min $/byte Shrinks time now or later Shrinks space here or there Automate processing knowbots Immediate OR time-delayed Why Put Everything In Cyberspace? Point-to-point OR broadcast Network Locate Process Analyze Summarize Database

Magnetic Storage Cheaper Than Paper u File cabinet: cabinet (four drawer) 250$ paper (24, 000 sheets) 250$ (2 x 3 @ 10$/ft 2) 180$ total space 700$ u Disk: ASCII: 2 u 0. 04¢/sheet (80 x cheaper) Image: 200, 000 pages 0. 4¢/sheet (8 x cheaper) u Store everything on disk 3¢/sheet disk (4 GB =) 800$ mil pages

Billions Of Clients u u u Every device will be “intelligent” Doors, rooms, cars… Computing will be ubiquitous

Billions Of Clients Need Millions Of Servers u All clients networked to servers Ø u u May be nomadic or on-demand Fast clients want faster servers Servers provide Shared Data Ø Control Ø Coordination Ø Communication Clients Mobile clients Fixed clients Server Ø Super server

Thesis Many little beat few big $1 million Mainframe 3 1 MM $100 K Mini $10 K Micro Nano 1 MB Pico Processor 10 pico-second ram 10 nano-second ram 100 MB 10 GB 10 microsecond ram 1 TB 14" u u 9" 5. 25" 3. 5" 2. 5" 1. 8" 10 millisecond disc 100 TB 10 second tape archive Smoking, hairy golf ball How to connect the many little parts? How to program the many little parts? Fault tolerance? 1 M SPECmarks, 1 TFLOP 106 clocks to bulk ram Event-horizon on chip VM reincarnated Multiprogram cache, On-Chip SMP

Future Super Server: 4 T Machine u Array of 1, 000 4 B machines Ø 1 bps processors Ø 1 BB DRAM Ø 10 BB disks Ø 1 Bbps comm lines Ø 1 TB tape robot u u A few megabucks Challenge: Ø Manageability Ø Programmability CPU 50 GB Disc 5 GB RAM Cyber Brick a 4 B machine Ø Security Ø Availability Ø Scaleability Ø Affordability u As easy as a single system Future servers are CLUSTERS of processors, discs Distributed database techniques make clusters work

Performance = Storage Accesses not Instructions Executed u In the “old days” we counted instructions and IO’s u Now we count memory references u Processors wait most of the time Where the time goes: clock ticks used by Alpha. Sort Components Disc Wait Sort OS Memory Wait B-Cache Data Miss 70 MIPS “real” apps have worse Icache misses so run at 60 MIPS if well tuned, 20 MIPS if not I-Cache Miss D-Cache Miss

Storage Latency: How Far Away is the Data? Clock Ticks 10 9 Andromeda Tape /Optical Robot 10 6 Disk 100 10 2 1 Memory On Board Cache On Chip Cache Registers 2, 000 Years Pluto Sacramento 2 Years 1. 5 hr This Campus 10 min This Room My Head 1 min

The Hardware Is In Place… And then a miracle occurs ? u u u SNAP: scaleable network and platforms Commodity-distributed OS built on: Ø Commodity platforms Ø Commodity network interconnect Enables parallel applications

Thesis: Scaleable Servers u Scaleable Servers Ø Ø Ø u Servers should be able to Ø Scale up (grow node by adding CPUs, disks, networks) Scale out (grow by adding nodes) Ø Scale down (can start small) Ø u Commodity hardware allows new applications New applications need huge servers Clients and servers are built of the same “stuff” Ø Commodity software and Ø Commodity hardware Key software technologies Ø Objects, Transactions, Clusters, Parallelism

Scaleable Servers BOTH SMP And Cluster SMP super server Departmental server Personal system Grow up with SMP; 4 x. P 6 is now standard Grow out with cluster Cluster has inexpensive parts Cluster of PCs

SMPs Have Advantages u u Single system image easier to manage, easier to program threads in shared memory, disk, Net 4 x SMP is commodity SMP super Software capable of 16 x server Problems: Departmental not commodity server Ø Scale-down problem (starter systems expensive) Personal u There is a BIGGEST one system Ø >4

Tpc-C Web-Based Benchmarks Ø Ø Ø u u u Order Invoice Query to server via Web page interface Web server translates to DB SQL does DB work Net: Ø easy to implement Ø performance is GREAT! HTTP u Client is a Web browser (7, 500 of them!) Submits IIS = Web ODBC u SQL

TPC-C Shows How Far SMPs have come u Performance is amazing: Ø Ø u u u 2, 000 users is the min! 30, 000 users on a 4 x 12 alpha cluster (Oracle) Peak Performance: 30, 390 tpm. C @ $305/tpm. C (Oracle/DEC) Best Price/Perf: 8, 040 tpm. C @ $54/tpm. C (MS SQL/Compaq) graphs show UNIX high price & diseconomy of scaleup

TPC C SMP Performance • SMPs do offer speedup but 4 x P 6 is better than some 18 x MIPSco

What Happens To Prices? u No expensive UNIX front end (20$/tpm. C) No expensive TP monitor software (10$/tpm. C) u => 65$/tpm. C u 27

What’s Tera. Byte? u 1 Terabyte: 1, 000, 000 business letters 150 miles of book shelf 100, 000 book pages 15 miles of book shelf 50, 000 FAX images 7 miles of book shelf 10, 000 TV pictures (mpeg) 10 days of video 4, 000 Land. Sat images 16 earth images (100 m) 100, 000 web page 10 copies of the web HTML u Library of Congress (in ASCII) is 25 TB 1980: $200 million of disc $5 million of tape silo 10, 000 discs 10, 000 tapes 1997: 200 k$ of magnetic disc 30 k$ nearline tape 48 discs 20 tapes Terror Byte !

Building the Largest NT Node u Build a 1 TB SQL Server database Ø Ø u Demo it on the Internet Ø u Show off NT and SQL Server Scaleability Stress test the product WWW accessible by anyone So data must be Ø Ø 1 TB Unencumbered Interesting to everyone everywhere AND not offensive to anyone anywhere

u u DEC Alpha + 324 Storage. Works Drives (1. 4 TB) Ø u u u The Plan 30 K BTU, 8 KW, 1. 5 metric tons. SQL 7. 0 USGS data (1 meter) Russian Space data (2 meter) SPIN-2 DEC 4100 4 x 400 Mhz Alpha Processors 4 GB DRAM Microsoft Back. Office

Image Data Sources 300 GB Src: USGS & UCSB missing some DOQs DOQ Spin-2 500 GB World. Wide Lo. B App New Data Coming

DOQ coverage of the US u u 1 Meter images of many places Problems: Ø most of data not yet published Ø interesting places missing (LA, Portland, SD, Anchorage, …) u u Loaded published 130 GB. CRDA for unpublished 3 TB

SPIN-2 Coverage u The rest of the world The US Government can’t help, but. . The Russian Space Agency is eager to cooperate. 2 Meter Geo Rectified imagery of anywhere u More data coming, Earth has ~ 500 Tera. Meters 2 u u u Ø Ø => ~30 Tera Bytes of Land at 2 x 2 Meter => we need 3% of the land (Urban World = the red stuff)

Demo Interface

Grow UP and OUT 1 Terabyte DB SMP super server Departmental server Personal system Cluster: • a collection of nodes • as easy to program and manage as a single node 1 billion transactions per day

Clusters Have Advantages u u Clients and servers made from the same stuff Inexpensive: Ø u Fault tolerance: Ø u Spare modules mask failures Modular growth Ø u Built with commodity components Grow by adding small modules Unlimited growth: no biggest one

Billion Transactions per Day Project u u u u Built a 45 -node Windows NT Cluster (with help from Intel & Compaq) > 900 disks All off-the-shelf parts Using SQL Server & DTC distributed transactions Debit. Credit Transaction Each node has 1/20 th of the DB Each node does 1/20 th of the work 15% of the transactions are “distributed”

How Much Is 1 Billion Transactions Per Day? 1 Btpd = 11, 574 tps (transactions per second) Millions of transactions per day ~ 700, 000 tpm 1, 000. (transactions/minute) Ø Ø Ø 400 M customers 250, 000 ATMs worldwide 7 billion transactions / year (card+cheque) in 1994 0. 1 NYSE Visa ~20 M tpd 1. Bof. A u 185 million calls (peak day worldwide) AT&T Ø 10. Visa AT&T Mtpd u 100. 1 Btpd u

Billion Transactions Per Day Hardware u u u 45 nodes (Compaq Proliant) Clustered with 100 Mbps Switched Ethernet 140 cpu, 13 GB, 3 TB. Type Workflow MTS SQL Server Distributed Transaction Coordinator TOTAL nodes CPUs DRAM ctlrs disks 20 Compaq Proliant 2500 20 Compaq Proliant 5000 5 Compaq Proliant 5000 45 20 x 20 x RAID space 20 x 2 128 1 1 2 GB 20 x 20 x 4 512 4 20 x 36 x 4. 2 GB 7 x 9. 1 GB 130 GB 5 x 5 x 5 x 4 256 1 3 8 GB 140 13 GB 105 895 3 TB

1. 2 B tpd u u u u 1 B tpd ran for 24 hrs. Sized for 30 days Linear growth 5 micro-dollars per transaction Out-of-the-box software Off-the-shelf hardware AMAZING!

Parallelism The OTHER aspect of clusters u Clusters of machines allow two kinds of parallelism Ø Ø u Many little jobs: online transaction processing Ø TPC-A, B, C… A few big jobs: data search and analysis Ø TPC-D, DSS, OLAP Both give automatic parallelism

Kinds of Parallel Execution Pipeline Any Sequential Program Partition outputs split N ways inputs merge M ways Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Any Sequential Program

Data Rivers Split + Merge Streams N X M Data Streams M Consumers N producers River Producers add records to the river, Consumers consume records from the river Purely sequential programming. River does flow control and buffering does partition and merge of data records River = Split/Merge in Gamma = Exchange operator in Volcano. Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Partitioned Execution Spreads computation and IO among processors Partitioned data gives NATURAL parallelism Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

N x M way Parallelism N inputs, M outputs, no bottlenecks. Partitioned Data Partitioned and Pipelined Data Flows Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Clusters (Plumbing) u Single system image Ø Ø Ø u Fault Tolerance Ø u naming protection/security management/load balance Wolfpack Hot Pluggable hardware & Software

Windows NT clusters u Key goals: Ø Ø Ø u u Easy: to install, manage, program Reliable: better than a single node Scaleable: added parts add power Initial: two-node failover Ø Ø Microsoft & 60 vendors defining NT clusters Ø Almost all big hardware and software vendors involved Ø Ø u u No special hardware needed -u but it may help Enables Ø Ø Ø Commodity fault-tolerance Commodity parallelism (data mining, virtual reality…) Also great for workgroups! Ø Beta testing since December 96 SAP, Microsoft, Oracle giving demos. File, print, Internet, mail, DB, other services Easy to manage Each node can be 4 x (or more) SMP Next (NT 5) “Wolfpack” is modest size cluster Ø Ø About 16 nodes (so 64 to 128 CPUs) No hard limit, algorithms designed to go further

u u So, What’s New? When slices cost 50 k$, you buy 10 or 20. When slices cost 5 k$ you buy 100 or 200. Manageability, programmability, usability become key issues (total cost of ownership). PCs are MUCH easier to use and program MPP Vicious Cycle No Customers! New New MPP & App New. OS Apps CP/Commodity Virtuous Cycle: Standards allow progress and investment protection New MPP & App New. OS Standard platform Customers

Thesis: Scaleable Servers u Scaleable Servers Ø Ø Ø u Servers should be able to Ø Scale up (grow node by adding CPUs, disks, networks) Scale out (grow by adding nodes) Ø Scale down (can start small) Ø u Commodity hardware allows new applications New applications need huge servers Clients and servers are built of the same “stuff” Ø Commodity software and Ø Commodity hardware Key software technologies Ø Objects, Transactions, Clusters, Parallelism

The BIG Picture Components and transactions u u Software modules are objects Object Request Broker (a. k. a. , Transaction Processing Monitor) connects objects (clients to servers) Standard interfaces allow software plug-ins Transaction ties execution of a “job” into an atomic unit: all-or-nothing, durable, isolated Object Request Broker

Objects Meet Databases The basis for universal data servers, access, & integration u u u object-oriented (COM oriented) programming interface to data Breaks DBMS into components Anything can be a data source Optimization/navigation “on top of” other data sources A way to componentized a DBMS Makes an RDBMS and O-R DBMS (assumes optimizer understands objects) DBMS engine Database Spreadsheet Photos Mail Map Document

A new programming paradigm u u u Develop object on the desktop Better yet: download them from the Net Script work flows as method invocations All on desktop Then, move work flows and objects to server(s) Gives Ødesktop development Øthree-tier deployment ØSoftware Cyberbricks

Transactions & Objects u u Application requests transaction identifier (XID) XID flows with method invocations Object Managers join (enlist) in transaction Distributed Transaction Manager coordinates commit/abort

Thesis: Scaleable Servers u Scaleable Servers Built from Cyberbricks Ø u Servers should be able to Ø u Allow new applications Scale up, out, down Key software technologies Ø Ø Clusters (ties the hardware together) Parallelism: (uses the independent cpus, stores, wires Objects (software Cyber. Bricks) Transactions: masks errors.