A Sketch of Regres Mike Carey Joey Hellerstein

Outline • Why we need to rethink everything! – All current DBMSs architected in

The World is Different • CPU, memory, disk up by 10 ** 6 in

The World is Different • Most serious applications use a TP monitor • I.

Why? • 2 tier doesn’t scale and is too hard to manage • DBMS

The World is Different • 7 X 24 a serious requirement most everywhere! •

The World is Different • The web changes everything – not a client-server protocol!

The World is Different • ERP and web applications require scalability, unheard of previously

The World is Different • Warehousing is a new application area – typical app

The World is Different • Multiprocessor architectures common – Clusters are here – NUMA

The World is Different • The gizmo revolution is coming – mobile clients –

The World is Different • The gizmo revolution is coming – small footprint servers

The World is Different • SQL-3 is here – components (blades, extenders, OLE, Corba)

And…. • DBMSs are currently “bloated” – stored procedures – object-relational features – warehouse

Conclusion • Need to rethink DBMS architecture from the ground up • This comment

The Result -- Regres • A mix of some discarded ideas (whose time has

Assumptions -- Must Design for a Data and Machine Federation • 7 X 24

Assumptions -- Must Design for a Data and Machine Federation • Integrating code and

Assumptions -- Must Design for a Data and Machine Federation • Incredible scalability requires

Advantages of a Federated DBMS • • Mimics the enterprise, which is distributed Naturally

Assumptions -- Semantic Heterogeneity a Must! • No systems to be federated have a

Assumptions -- Local Autonomy a Must! • Few systems to be federated are in

Traditional Distributed DBMSs (and all commercial systems) • Do neither of these • Are

Mariposa (and Cohera) made a good start • Economic paradigm for federated query processing

Mariposa (and Cohera) made a good start • Flexible heterogeneous replication – master-slave or

Mariposa (and Cohera) Data Model • A collection of fragments of a SQL-3 table

But there is much room for improvement! • Query decomposition into economic units of

But there is much room for improvement! • Change the economic plan midflight if

But there is much room for improvement! • Partial answers are often a good

But there is much room for improvement! • Future data will be imprecise –

Local DBMS -- Storage Model • Store segments – I. e. the unit of

Storage Model -- Open Issues • When to coalesce and split segments • LRU

Local DBMS -- System Services • DBMS provides buffer pool, file system – Can

Local DBMS -- No Knobs • Current DBMSs are WAY too hard to use

Protocol • Federation components must communicate with an asynchronous (stateless) protocol • Design challenge

Local DBMS -- Attacking Bloat • Basic Problem -- two data representations – the

Idea Number One • One representation -- no log • “No overwrite” versioning storage

Issue • POSTGRES storage system required 4 writes to commit a transaction – too

Idea Number Two • Log is the only storage system • When data is

Issue • Can cache residency be made long enough to justify the overhead? •

Semantic Heterogeneity • Lots of approaches – code (Mariposa, Cohera) – Rules (Mergent) –

Regres Focus • Regres must be repository-based • Regres must provide yellow pages for

Summary • Thin local system; fat Federator • Lots of interesting design challenges •

Slides: 43

Download presentation

A Sketch of Regres Mike Carey Joey Hellerstein Michael Stonebraker

Outline • Why we need to rethink everything! – All current DBMSs architected in the late 1970 s – why the world is different now • A sketch of Regres – a new data base architecture for the millenium

The World is Different • CPU, memory, disk up by 10 ** 6 in the last 20 years • Design point of 1 Tbyte buffer pool in 2005, up from 1 Mbyte in the 1970 s • It will NOT be 250 million 4 K pages! Need to rethink storage architectures!

The World is Different • Most serious applications use a TP monitor • I. e. a three tier application architecture – data at the bottom in a DBMS – code in middle tier in TP monitor – user interface on the client

Why? • 2 tier doesn’t scale and is too hard to manage • DBMS couldn’t execute code Probably undesirable to decompose function this way! Want code “near” the data it accesses Need to rethink application architecture!

The World is Different • 7 X 24 a serious requirement most everywhere! • End-to-end issue – RAID not the (complete) answer – Require wide area network replication Need to design in, not bolt on this capability!

The World is Different • The web changes everything – not a client-server protocol! – stateless – requirement to deal with HTML, XML, . . . Need to have web-centric architecture!

The World is Different • ERP and web applications require scalability, unheard of previously – 100, 000 ERP seats not uncommon – E-commerce on the web will entail huge transactions rates Need to think at these levels!

The World is Different • Warehousing is a new application area – typical app is data mining – queries run forever Need to design in, not bolt on sampling!

The World is Different • Multiprocessor architectures common – Clusters are here – NUMA is here – MPP is here Need to design in, not bolt on load balance!

The World is Different • The gizmo revolution is coming – mobile clients – disconnected operation Need to design in not bolt on disconnected operation!

The World is Different • The gizmo revolution is coming – small footprint servers (coke machine as a data base) Need to scale down as well as up in one system!

The World is Different • SQL-3 is here – components (blades, extenders, OLE, Corba) in the data base – multiple language support required – inheritance required Need to design in, not bolt on method support in a variety of component models!

And…. • DBMSs are currently “bloated” – stored procedures – object-relational features – warehouse features – triggers – standard benchmark hacks • Users have a low tolerance for errors Debugging the next release is getting hard!

Conclusion • Need to rethink DBMS architecture from the ground up • This comment also applies to operating systems • and probably to networks

The Result -- Regres • A mix of some discarded ideas (whose time has re-come) • And some new ideas

Assumptions -- Must Design for a Data and Machine Federation • 7 X 24 operation requires wide area replication – understood by the DBMS – transactionally consistent – fastest mechanism is to move the log Argues for Federated DBMS!

Assumptions -- Must Design for a Data and Machine Federation • Integrating code and data on multiple machines is a better idea than TP monitors – data and code on each machine in a network! Requires a Federated DBMS!

Assumptions -- Must Design for a Data and Machine Federation • Incredible scalability requires more than the biggest single system Federated DBMS a good model!

Advantages of a Federated DBMS • • Mimics the enterprise, which is distributed Naturally supports mergers Allows “jelly bean” hardware components Can be incrementally built and extended

Assumptions -- Semantic Heterogeneity a Must! • No systems to be federated have a common schema – salary in US is gross dollars – salary in France is net francs with a lunch allowance Must deal with this!

Assumptions -- Local Autonomy a Must! • Few systems to be federated are in the same “administrative moat”! • Must allow local DBAs to control their own destiny!

Traditional Distributed DBMSs (and all commercial systems) • Do neither of these • Are a non-starter for a future architecture Cannot have a traditional query optimizer!

Mariposa (and Cohera) made a good start • Economic paradigm for federated query processing – each query has a budget – each site is an independent contractor – federator acts like a general contractor, trying to solve query under the budget Agoric systems are starting to get traction!

Mariposa (and Cohera) made a good start • Flexible heterogeneous replication – master-slave or peer-to-peer – bounded out-of-date-ness • Mobile (and disconnected) sites ok – out-of-date replica

Mariposa (and Cohera) Data Model • A collection of fragments of a SQL-3 table – range partitioning – type conversion of data types when federated • Each “owned” by a local DBA

But there is much room for improvement! • Query decomposition into economic units of work – bottom-up – top down – heuristic decomposition

But there is much room for improvement! • Change the economic plan midflight if circumstances change – how to tell things have changed – what to do

But there is much room for improvement! • Partial answers are often a good idea – how to integrate Control ideas into an agoric system – can it be done without knowing how much of the answer the user will want?

But there is much room for improvement! • Future data will be imprecise – imagine federating Michelin and Fodors restaurant guide • Query processing must become evidence accumulation – built-in not bolted on – model of “likely sites” required

Local DBMS -- Storage Model • Store segments – I. e. the unit of federation • Also the unit of movement between disk and cache (segmented storage) • Need “split” and “coalesce” to keep variable length segments reasonably sized Shades of the Burroughs B 5000!

Storage Model -- Open Issues • When to coalesce and split segments • LRU a bad model for eviction

Local DBMS -- System Services • DBMS provides buffer pool, file system – Can provide file system abstraction easily • Thread management from compiler • Reliable message delivery from network • DBMS is only application running on the machine – no need for a scheduler Very thin OS will do…. .

Local DBMS -- No Knobs • Current DBMSs are WAY too hard to use • Not enough talented DBAs to go around • Tuning typically done by vendor’s SE Want to have NO tuning knobs! Only control: go/stop Not clear how to do this!

Protocol • Federation components must communicate with an asynchronous (stateless) protocol • Design challenge for a world where sessions are the norm

Local DBMS -- Attacking Bloat • Basic Problem -- two data representations – the log – the data in the data base • Consistency of these representations on crashes drives a lot of complexity

Idea Number One • One representation -- no log • “No overwrite” versioning storage system (like POSTGRES) for undo • Wide area replication for recovery

Issue • POSTGRES storage system required 4 writes to commit a transaction – too slow to be interesting in OLTP • Can we design a “no overwrite” storage system with high performance?

Idea Number Two • Log is the only storage system • When data is brought into main memory, it is “swizzled” into a high performance format • and “unswizzled” on cache eviction

Issue • Can cache residency be made long enough to justify the overhead? • Will “cold data” performance be unacceptably bad?

Semantic Heterogeneity • Lots of approaches – code (Mariposa, Cohera) – Rules (Mergent) – Prolog • Lots of past work – e. g. Multibase Space well picked over!

Regres Focus • Regres must be repository-based • Regres must provide yellow pages for economic model • Regres must provide “schema discovery” tools Focus on the repository and building semantic heterogeneity support into it

Summary • Thin local system; fat Federator • Lots of interesting design challenges • Focus of DBMS seminar this semester