Database Decay and What To Do About It

Database Decay and What To Do About It by Michael Stonebraker (With Dong Deng and Michael Brodie)

My Thesis • Traditional database design is all wrong • • Gold standard techniques are not used “in the wild” and for good reasons Contributes to database decay • DBMS application development is all wrong • Contributes to database decay • There are kluges to slow down decay • Which I will talk about • Best idea is a new paradigm for app development • Which I will talk about

Context • Operational databases • But “rot” is upstream from analytic databases • And “flows downstream”

Traditional Database Design Wisdom • Use a modeling tool (e. g. an Entity. Relationship diagram tool) • To construct/modify an E-R diagram • When satisfied, push a button • Tool spits out a relational schema in 3 rd normal form • Code in ODBC/JDBC against this schema

Example Supplier (sno, 1 111 sregion) N Supply (qty) Part (pno, psize, pcolor) Two entities (Supplier and Part with attributes) One relationship (Supply with attributes, which is 1 – N)

Lots of Elaboration • Weak entities • Inheritance • … • But the basics is all we need

Algorithm to Produce 3 rd Normal Form • Relation for each entity with attributes • For each 1 - N relationship, add the key of the 1 side to the table for the N side with its attributes • For each M – N relationship, add a table for the relationship with both keys and the attributes

For Our Data Supplier (sno, sregion) Part (pno, psize, pcolor, sno, qty) • Application groups write applications for this relational schema, typically using ODBC/JDBC as the interface

Large Applications • Usually designed/coded by separate application groups, reporting to different managers • • • Parts controlled by engineering Supply controlled by procurement Supplier controlled by finance • In aggregate have 3 applications • Often on different development schedules and budgets

Now Suppose…. • Management decides to allow multiple suppliers for each part, as long as they are in different regions • Perhaps to get better terms … • Such changes happen frequently “in the wild” • Or • Often once a quarter or more

New Diagram Supplier (sno, M 111 sregion) N Supply (qty) Part (pno, psize, pcolor) Two entities (Supplier and Part with attributes) One relationship (Supply with attributes which is now M – N)

New Tables Supplier (sno, sregion) Part (pno, psize, pcolor) Supply (sno, pno, qty) Supplier table is unchanged “Old Part” table is a join of Part and Supply (defineable as a view based on ”new stuff”)

Summary of The Traditional Wisdom • Convert “old” database to “new” • Define “old” as views • Applications (in theory) continue to run • Database stays in 3 NF (defined as goodness by research community)

A Dirty Little Secret • Based on a survey of about 20 DBAs at 3 large companies in the Boston area…. . • • None use this methodology or Use it for “green field” design and then abandon it • I. e. the “gold standard” is not used “in the wild” • So why not? ? ? • Rest of the talk is a mix of speculation and ideas

Problem #1: View Update Issues • Application: change the qty for part XXX to 500 • update to “old part” • But “old part” is a view which is a join • • • Updates to such views disallowed in most RDBMSs Either because of semantic ambiguity or vendor laziness

Problem #2: Application Semantics May Change • Delete the fact that supplier XXX supplies part YYY • • • In “old” this deletes part XXX Or Turns fields in the Part table to “null” • In the new world where there are multiple suppliers of part XXX, it presumably only deletes a supply tuple

Problem #3: Other Applications May Fail • And they are in different departments • With different budgets

Net Result • Substantial risk! • • Applications all over the enterprise must be found and corrected Often with no budget for this activity

Less Risky Solution -- Kluge • Leave the old schema as is Supplier (sno, sregion) Part (pno, psize, pcolor, sno, qty) • Replicate part info, when multiple suppliers supply the same part • • Result does not conform to any ER diagram! No longer 3 NF

Our Two Examples Do the Right Thing • No views to deal with • Deleting a tuple in the Part table “does the right thing” • Reasonable chance that other applications will continue to run correctly

The Net-Net (Our Conjecture) • DBAs want to lower risk • Therefore DBAs try not to change the schema • Our change does not conform to any ER schema • ER diagram diverges from reality, if it ever existed • Hence, DBAs deal with table schemas, not ER schemas • Table schemas become increasingly distant from 3 NF • We term this process “database decay” • Which ultimately means that a total rewrite is the only way forward

So What to Do? • Higher-level interfaces • Kluges • • Defensive schemas Defensive applications • A new development paradigm

Higher Level Interface • ORMs • Object model • E-R frameworks • E-R model • Solve neither of our problems! • And encapsulate the SQL, so it is inaccessible

Defensive Schemas • Use the 3 table version, even though initially the two table one will work • • Anticipate changes and “build them in” And code against the larger schema initially • Leave some “dead fields” in the initial schema • For later use

Defensive Schemas • Think “tall and skinny” • Single table • Data (entity-name, attribute-name, value) • Or lots of binary tables • Attribute (entity-key, value)

Tall and Skinny • Generally horrible performance • But survives lots of changes • And integrity control must be built into application logic • Which may need to be changed

Defensive Applications • Never use Select * From XXX • If a field you don’t use is dropped, then this code dies • Put in the target list only that stuff that you actually need

Defensive Applications • Use Select Sum ( …) From XXX • Even if you know there is only one result • Later on there may be more than one, and this code will continue to work • Additional examples in the paper

Best Idea – Change the Paradigm • ODBC is coded all over the enterprise • Where it is unmanageable • Application programmers must all understand the schema • • Which is a big challenge For large applications with 10’s to 100’s of tables • And they must understand defensive programming • Again a big challenge

Old Way App 1 App 2 ODBC/JDBC App 3 ODBC/JDBC database

New Way App 1 App 2 messages App 3 messages middleware database (DBA-controlled)

Processing Model • Application programmers cannot write SQL! • Instead he/she negotiates with the DBA • • • He propose a task If DBA agrees, DBA writes the SQL And defines a messaging interface for the application programmer

Processing Model • Think web services • Think stored procedures • Think execute task (parameters) • We will focus on a SQL interface – since we want to figure out what will continue to run when the schema changes

DBA Effectively Has • A database of • (Task, SQL) pairs • There are lots of details (multiple commands, code in between the SQL commands, transactions, …) which we ignore

And…. • He constructs the initial relational schema • Perhaps iteratively • However, think tables; not some other notation • Remember schema may decay! • App programmers implement business logic, but don’t need to know the schema • DBA converts messages into SQL

Can We Help? • With the initial schema • With the replacement SQL • We will start here

Suppose…. • Schema is considered a collection of tables with primary-key/foreign-key relationships • Think of this as a graph • Connected by PK/FK “join paths” • At scale 10’s to 100’s

In General • The parameters of each message specify locations in this graph • And we require the “shape of the stored procedure” • 1 input or N inputs • 1 output or N outputs

Example • Task: Order Part XXX from Supplier YYY (generates an order of size qty) -- shape is one part and one supplier • Graph: Supplier Supply Part

Example • Straightforward to (mostly automatically) generate the SQL from the message • Because there is only a single join path Supplier Supply Part

When Business Conditions Change… • Change the schema • It is the “right” thing to do! • And map the SQL

Example • Task: Order Part XXX from Supplier YYY (generates an order of qty) • Graph: Supply Supplier Part

For Each SQL Command: • Have the old path • And the new path • And the old SQL

For Each SQL Command: • Can classify the command as: • Can be run unchanged (path and shape are the same) • Path has changed, but there is a unique replacement with the same shape (new SQL can be generated automatically) • There is no new path or multiple new paths or shape has changed (human intervention required)

Possible Strategies • Run the tool in advance of committing the schema change to see how extensive the changes are • Use the tool at run time to fix the SQL at 1 st access • Effectively “schema at 1 st access” • Or anything in between

We have a Prototype • Would love to find an “in the wild” situation • To try it on • Especially one with a lot of tables! • This is a plea for a use case!!!!