Archetypal Databases or What Is a DBMS Zachary

Archetypal Databases, or, What Is a DBMS? Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems September 9, 2008

Today § Last time we saw the “vision” of the relational database – decoupling “what” from “how” § Storage scheme shouldn’t matter to the programmer § … Only the operations to be performed § But how do we realize this vision? § And what about all of the other aspects of a DBMS? § Concurrency, recovery § Creating tables, performing updates, … 2

The INGRES System § Interactive Graphics and Retrieval System § Probably the first usable RDBMS § One of the first real projects built on UNIX § On a minicomputer, the PDP-11, which greatly constrained things due to limited memory § No low-level access to disk – prevented clustering in storage § Based on a relational query language called QUEL Many religious wars between QUEL and SEQUEL camps § Stonebraker’s first commercialized project § Changed university IP rules forever § Today: was recently open-sourced by Computer Associates … but mostly overshadowed by My. SQL and Postgres (a later Stonebraker project) 3

Processes § Precompiler converts C+EQUEL into a C program that gets compiled C program § Due to 64 K process size limits, needed to break things into 4 processes communicating via pipes Lexer, parser concurrency query mod. 2 Query processing 3 Utilities, recovery 4 § Even some of these needed to be broken into overlays § Later added a Process 2. 5 to make 5 processes 4

System-R § Probably a bit closer to today’s DBMSs, at least at the low level – but didn’t run on UNIX § Based on years of experience with IMS and other IBM database systems § At this point, prior to Selinger (note limited optimization), Lindsay (not a huge performance focus) § Components were built to be generally reusable § Default language was called SEQUEL, by Chamberlin and Boyce 5

System-R Architecture SEQUEL or e-SEQUEL (RDI) Relational Data System (RDS) parsing, (limited) optimization (RSI) Relational Storage System (RSS) Storage, concurrency, access paths (“images”, “links”), triggers 6

Languages § QUEL vs. SEQUEL § The focus of many religious wars, though they borrowed each other’s ideas § Ultimately IBM won due to (1) market presence, (2) Oracle § SEQUEL had NULL concept, bag semantics, aggregation § Postgres was originally based on QUEL, hence Postgre. SQL QUEL: more orthogonal, simple EQUEL embedding RANGE OF X IS MYREL RETRIEVE (A. B) WHERE … ## C-block SEQUEL: more block-oriented, embedded via cursors and row sets, aggregation SELECT A. B FROM MYREL X WHERE … 7

Administration of a Database INGRES: § Only the DBA can create shared relations and grant access to them § It’s possible for a user to create temporary relations (and required for query processing) System-R: § Anyone can create private relations and grant access to them § It’s possible to create persistent or temporary relations § Mechanisms to add columns to tables (by default these become NULL) 8

Integrity and Security § Both systems allowed much more general notions of integrity than key constraints § Assertions as a means of validation! § Query modification is one method used to do this § Conjunction of assertions plus the query/update § Only limited expressiveness – in INGRES there must be at most one variable § Security models: § System-R used a view-based security model § INGRES used query modification § How do these differ? Which is better? 9

Data Storage

Physical Data Layout § Tuples must fit within a page § All tuples have TIDs t 1 t 2 t 3 § TID is a page + index § System-R § Allowed links between tuples § Built-in concept of NULL § Where else are TIDs useful? § Page-level clustering: done by System-R in extents § What about INGRES? 11

Access Paths – INGRES § INGRES Access Methods Interface (AMI): § Can have unordered “heap file” § Hash-based indexing § ISAM-like indexing on a primary key Predates B+ Tree – initially height-balanced, but afterwards index structure is static; requires overflow pages “ISAM-like” because can’t sort across pages – can only lock one page at a time!!! § Lookups are done via OPENR(), GET() § FIND() can be used to find a start or stop point § Can call PARAMD, PARAMI to get parameters of data or index § Idea of extensible access methods – later revisited in Generalized Search Trees (Gi. ST) 12

Access Paths – System-R § “Images” § Slightly more than indices (which are included there) – also linked lists and orderings § Index structures include ISAM and B Trees § Search arguments, aka sargable predicates § Note that clustering was a key concept here § Links via TIDs § Note that these were NOT relational in spirit! § Intention was to support IMS over RSS § Some code likely made its way back into IMS § How do these concepts relate to OO databases and today’s XML databases? 13

Catalogs and Indices as Logical Abstractions § System-R and INGRES decided to express catalogs as relations, making them accessible to queries as well as enabling reuse of code § How are they accessible in today’s RDBMSs? § INGRES even did this with indices – 1: 1 mapping between files and relations § Indices vs. views: how are they different? § How are they different in the presence of TIDs or links? 14

Query Processing

Query Processing - INGRES § DECOMP algorithm and One Variable Query Processor § Break every query into separate operations that generate a temp relation for every projection, selection, join § Pick a relation to iterate over, substitute value for variable § Repeat recursively § Note that (1) it’s interpreted, (2) it’s non-pipelined, and (3) it’s adaptive, always choosing the smallest relation for the substitution § Note that access paths were chosen in a pretty ad hoc way § Also, no concept of sargable predicates – instead, could FIND both upper and lower bounds, but needed to GET between 16

DECOMP Example § RANGE OF E, M IS EMPLOYEE RANGE OF D IS DEPT RETRIEVE (E. NAME) WHERE E. SALARY > M. SALARY AND E. MANAGER = M. NAME AND E. DEPT = D. DEPT AND D. FLOOR# = 1 AND E. AGE > 40 § First: apply selection and projection 17

DECOMP, Ctd. § RANGE OF D IS DEPT RETRIEVE INTO T 1 (D. DEPT) WHERE D. FLOOR# = 1 § RANGE OF E IS EMPLOYEE RETRIEVE INTO T 2(E. NAME, E. SALARY, E. MANAGER, E. DEPT) WHERE E. AGE > 40 § Now substitute these back into the main query 18

DECOMP, Ctd. § RANGE OF E IS T 2 RANGE OF M IS EMPLOYEE RANGE OF D IS T 1 RETRIEVE (E. NAME) WHERE E. SALARY > M. SALARY AND E. MANAGER = M. NAME AND E. DEPT = D. DEPT § Now pick the relation with smallest cardinality, e. g. , T 1, and substitute 19

One-Variable Substitution § foreach D in T 1, recursively process: § RANGE OF E IS T 2 RANGE OF M IS EMPLOYEE RETRIEVE (E. NAME) WHERE E. SALARY > M. SALARY AND E. MANAGER = M. NAME AND E. DEPT = *value. Of(D. DEPT)* § Now apply the selection to E… RANGE OF E IS T 2 RETRIEVE INTO T 3 (E. NAME, E. SALARY, E. MANAGER, E. DEPT) WHERE E. DEPT = *value. Of(D. DEPT)* 20

DECOMP, Recursively § So the query looks like: foreach D in T 1 RANGE OF E IS T 2 RETRIEVE INTO T 3 (E. NAME, E. SALARY, E. MANAGER, E. DEPT) WHERE E. DEPT = *value. Of(D. DEPT)* RANGE OF E IS T 3 RANGE OF M IS EMPLOYEE RETRIEVE (E. NAME) WHERE E. SALARY > M. SALARY AND E. MANAGER = M. NAME § Now choose the smallest relation for substitution (e. g. , T 3) 21

DECOMP, Recursively § foreach D in T 1 RANGE OF E IS T 2 RETRIEVE INTO T 3 (E. NAME, E. SALARY, E. MANAGER, E. DEPT) WHERE E. DEPT = *value. Of(D. DEPT)* § foreach E 2 in T 3 RANGE OF M IS EMPLOYEE RETRIEVE INTO (val 1) WHERE *value. Of(E 2. SALARY)* > M. SALARY AND *value. Of(E 2. MANAGER)* = M. NAME 22

Query Processing – System-R § Optimization is cost-based § Consider both disk cost (primary) and CPU cost (scaled by some H) § Compare clustered, non-clustered indices; sequential scan Needed to consider cases where data was interspersed with other relations § Every join is binary; query plans are compiled Merge join and nested loops join are present Also have a link-based join No dynamic programming yet – Selinger was not yet aboard 23

Views § INGRES: § Not supported at that time (though later) § System-R: § Single-table views are typically updatable, if one-toone (contrast with today’s SQL) Additionally, cursors could typically be modified! § Views as a security/encapsulation mechanism GRANT and REVOKE privileges 24

Triggers § Note that System-R had triggers from the beginning! § Later work led to the idea of active databases, which had very rich triggger languages 25

ACIDity

Rollback / Abort § Notion of transactions in System-R encompassed multiple operations § BEGIN_TRANS, END_TRANS, SAVE, RESTORE § Both systems had a notion of “old” and “new” pages (“shadow paging”) § Could roll back transactions by swapping back the old page § But how did that work with concurrency? 27

Concurrency & Locking § INGRES: § § Supports single QUEL statements as transactions Query locked all resources (table-level) before it began All page updates are atomic via locks Avoid deadlocks by preventing cases where they could occur – this is why they don’t do true ISAM § System-R: § Multiple levels of lock granularity logical: table-, range-level physical: page-, tuple-level § Shared, exclusive locks § Multiple isolation levels (READ UNCOMMITTED, READ COMMITTED, SERIALIZABLE; no REPEATABLE READ) § Resolve deadlocks by choosing a victim, restarting 28

Recovery § INGRES: § Deferred updates – for performance, isolation § Also used to make recovery possible § But how far does this take you? § System-R: § Notion of checkpoints and restarting § Transaction logging and replay § Not much mentioned about possibility of recovery from failed restart Today’s techniques even recover from that 29

Analysis

What Ideas from this Work Were Broken? § Shadow paging § Now everything is purely log-based § § SQL idiosyncrasies INGRES recovery Relations as files Query optimization 31

Discussion: What Ideas Are We Still Using? 32