Introduction Zachary G Ives University of Pennsylvania CIS

  • Slides: 35
Download presentation
Introduction Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems

Introduction Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 8, 2005 Some slide content courtesy of Susan Davidson & Raghu Ramakrishnan

Welcome to CIS 550, Database and Information Systems! Instructor: Zachary Ives, zives@cis. upenn. edu

Welcome to CIS 550, Database and Information Systems! Instructor: Zachary Ives, zives@cis. upenn. edu § 576 Levine Hall North § Office hours: Thursday, 12: 30 -1: 30 PM (before class) TA: Weichen Wu, wewu@cis. upenn. edu § Office hours: Wednesday, 1: 00 -2: 00 PM, Levine 612 Newsgroup: upenn. cis 550 Home page: www. seas. upenn. edu/~zives/cis 550/ Texts and readings: § Ramakrishnan & Gerke, Database Systems, 3 rd ed. § Supplementary papers (to be handed out in class) 2

Course Format and Grading § Roughly one major topic area per week to two

Course Format and Grading § Roughly one major topic area per week to two weeks § Readings in the text & current or influential research papers § Occasionally, summaries/commentary on papers (5%) § Homework assignment for each topic area (30%) § One midterm (10%), one final exam (20%) § Project (30%) – groups of 3 § Build an RSS/Atom aggregator on top of a database, or § Build a P 2 P system for synchronizing tables § (Or propose your own idea) § General participation, discussion, intangibles (5%) 3

What’s this Course About? Most CS courses concentrate on code – our interest is

What’s this Course About? Most CS courses concentrate on code – our interest is managing and representing data Warning: this course doesn’t focus on teaching SQL or how to be an Oracle DBA (though it will get you started) It’s not an “application” course! … So what in the world are we studying for 14 weeks? ? ? 4

Data – What Do We Need to Do with It? 5

Data – What Do We Need to Do with It? 5

Ways Information Is Represented 6

Ways Information Is Represented 6

Example: An Encyclopedia Entry (www. wikipedia. com) § A database is a collection of

Example: An Encyclopedia Entry (www. wikipedia. com) § A database is a collection of data elements (facts) stored in a computer in a systematic way, such that a computer program can consult it to answer questions. The answers to those questions become information that can be used to make decisions that may not be made with the data elements alone. The computer program used to manage and query a database is known as a database management system (DBMS). The properties of database systems are studied in information science. § At the core of the concept of a database is the idea of a collection of generic facts, or pieces of knowledge. Facts may be structured in a number of ways, known as database models. For instance, one database model is to associate each fact with a record representing an entity (such as a person), and to arrange these entities into trees or hierarchies – the hierarchical database model. Another model is to arrange facts into sets of values which satisfy logical predicates – the relational database model. § The first database management systems were developed in the 1960 s. A pioneer in the field was Charles Bachman. Two key data models arose at this time: the network model (developed by CODASYL) followed by the hierarchical model (as implemented in IMS). These were later usurped by the relational model, which was contemporary with the so-called flat model designed for very small tasks. 7

Example: To-Do List § § Buy school supplies Go to orientation Exercise Buy Philly

Example: To-Do List § § Buy school supplies Go to orientation Exercise Buy Philly postcards due 9/7 on 9/7 every M/W/F How does this differ from the plain text model? What might you do with it that you couldn’t? 8

Example: Your PDA/Cell Phone Calendar Contacts Event Day Lunch 10/24 Advice Levine Biking 10/26

Example: Your PDA/Cell Phone Calendar Contacts Event Day Lunch 10/24 Advice Levine Biking 10/26 Dinner 10/26 When Who Where 1 pm Zack Cavanaugh’s 10/25 9 am Dr. Smith 599 9 am Jane 6 PM Jane Pottruck Food Court Who Zack Dr. Smith Jane Phone 6 -2789 6 -1234 543 -2198 jane Email Office zives 576 Levine N drsmith 599 L 2220 Walnut St. 9

What If We Want to Include Contact Info on Our Calendar? § Do we

What If We Want to Include Contact Info on Our Calendar? § Do we also want to keep e-mail addresses, telephone numbers etc. ? § Should we expand the number of columns in our table: Event When Who-name Lunch 1 pm Zack … Who-email zives Who-tel …. Where 6 -2789 …. Cav… What is the trade-off in terms of entering data? 10

“Link” Calendar with Contacts? § Why can’t we “link” calendar entries with contact info,

“Link” Calendar with Contacts? § Why can’t we “link” calendar entries with contact info, and show the results of the two? § The link could be based on something as simple as the person's name § (What’s the danger here? What else might work better? ) § This brings up an issue – how to “follow links” § If we were to do this in C, how might it be done? In Java? 11

Another Kind of Link: Classes and Subclasses Person has attributes: § § § ssn

Another Kind of Link: Classes and Subclasses Person has attributes: § § § ssn Penn. ID set of user IDs given name family name … Student IS A person who: § § takes courses is given grades is taught listens to lectures in class, OR over the Web, OR on videotape This is yet another kind of information § How have you previously seen such relationships encoded? 12

Data Representation and Modeling All of the data we’ve seen have an implicit data

Data Representation and Modeling All of the data we’ve seen have an implicit data model The data model includes some basic assumptions about what’s an “item” of data, how to interpret it, and so on The relational data model was the first model that is independent of its data structures and implementation § A theory of normalization guides you in designing relations § Concepts have been adapted to form object-oriented data models, XML, etc. § There are “sibling” fields to databases with semantic models: natural language (meanings of bags or lists of words) Information retrieval (associations between words and documents) ontologies (inferences about relationships between classes, and classes and subclasses of relationships) Ø We’ll mostly focus on the relational model and its descendants 13

The DBMS Provides an Interface over the Database § A database (DB) is a

The DBMS Provides an Interface over the Database § A database (DB) is a large, integrated collection of data Generally is cohesive in “some” way § A DB models a real-world organization or unit § A database management system (DBMS) is a software package designed to store and manage databases § Reliable storage & recovery of 100 s of GB § Querying/updating interface and API (for applications and Web pages) § Support for many concurrent users § Why do we need a DBMS, instead of coding in Java? 14

DBMS Benefit #1: Generality and Declarativity § Don’t require the programmer or user to

DBMS Benefit #1: Generality and Declarativity § Don’t require the programmer or user to know details like indices, sort orders, machine speeds, disk speeds, concurrent users, etc. § Instead, the programmer/user programs with a logical model in mind § The DBMS “makes it happen” based on an understanding of relative costs of different methods 15

Benefit #2: Efficiency and Scale § Size of personal address book is probably less

Benefit #2: Efficiency and Scale § Size of personal address book is probably less than 100 entries, but there are things we'd like to do quickly and efficiently: § “Give me all appointments on 10/28” § “When am I next meeting Jim? ” § “Program” these as quickly as possible (and make them resilient to data format changes) § Scale to a corporate calendar with hundreds of thousands of entries 16

Benefit #3: Management of Concurrency and Reliability § Suppose other people are allowed access

Benefit #3: Management of Concurrency and Reliability § Suppose other people are allowed access to your calendar and are allowed to modify it? How do we stop two people changing the file at the same time and leaving it in a physical (or logical) mess? § Suppose the system crashes while we are changing the calendar. How do we recover our work? § This requires a basic concept… 17

Transactions § Key concept for concurrency is that of a transaction: an atomic sequence

Transactions § Key concept for concurrency is that of a transaction: an atomic sequence of database actions (read/write) on data items (e. g. calendar entry). § Key concept for recoverability is that of a log: keeping track of all actions carried out by the db. 18

Anatomy of a Typical DBMS (Simplification!) API/GUI Query Stats Optimizer Physical plan Exec. Engine

Anatomy of a Typical DBMS (Simplification!) API/GUI Query Stats Optimizer Physical plan Exec. Engine Catalog Schemas Data/etc Logging, Recovery, Requests Index/file/rec Mgr Data/etc Requests Buffer Mgr Pages Storage Mgr Data Red = logical Blue = physical Requests Storage 19

The Database Abstraction Provided by the DBMS We think of databases at two levels:

The Database Abstraction Provided by the DBMS We think of databases at two levels: § Logical structure: What users/programmers see – program or query interface § Physical structure: Organization on disk, indices, etc. The logical level is further split into: § Overall database design (conceptual; seen by the DB designer) § Views that various users get to see 20

The Three-level Architecture for Databases View 1 View 2 Schema … View N Logical,

The Three-level Architecture for Databases View 1 View 2 Schema … View N Logical, Conceptual Level Physical Level (file organization, indexing) 21

Data Independence A user of a relational database system should be able to use

Data Independence A user of a relational database system should be able to use the database without knowing about how the precisely how data is stored, e. g. SELECT When, Where FROM Calendar WHERE Who = “Jane" After all, you don't worry IEEE floating-point when you do division in a Java program or with a calculator 22

More on Data Independence Logical data independence Protects the user from changes in the

More on Data Independence Logical data independence Protects the user from changes in the logical structure of the data: could reorganize the calendar “schema” without changing how we query it Physical data independence Protects the user from changes in the physical structure of data: could add an index on who (or sort by when) without changing how the user would write the query, but the query would execute faster (query optimization) 23

Presentation Layer (4 th Tier): Data-Driven Web Sites HTML view Processing § “Data driven

Presentation Layer (4 th Tier): Data-Driven Web Sites HTML view Processing § “Data driven web sites” also add an HTML “presentation” layer on top of what we’ve seen § Or they use XML plus “style sheets” to get the same effect 24

An Issue: 80% of the World’s Data is Not in a DB! Examples: §

An Issue: 80% of the World’s Data is Not in a DB! Examples: § Scientific data (large images, complex programs that analyze the data) § Personal data § WWW and email (some of it is stored in something resembling a DBMS) Data management is expanding to tackle these problems § Flexibility – data management imposes many constraints to make problems solvable § Must deal with entities outside our control In this course, we’ll start by focusing on databases, but eventually look “outside the box” at the Web and at gluing together data from many places 25

Combining Databases with Mediators (a kind of middleware) “Mediated Schema” XML A layer above

Combining Databases with Mediators (a kind of middleware) “Mediated Schema” XML A layer above three-tiered architecture, to combine multiple databases/sources on the Web § Some of these are databases over which we have no control § Some must be accessed in special ways § We generally need to think about how to translate between different database formats § This problem of data integration is a particular focus here at Penn (and several other top departments) 26

How Does One Build a Database? § Start with a conceptual model § Design

How Does One Build a Database? § Start with a conceptual model § Design & implement schema § Write applications using DBMS and other tools § Many ways of doing this where the hard problems are taken care of by other people (DBMS, API writers, library authors, web server, etc. ) § Common applications include PHP/JSP/servlet-driven web sites § The DBMS takes care of query optimization and execution 27

Conceptual Design fid PROFESSOR name Teaches STUDENT sid name COURSE Takes cid name semester

Conceptual Design fid PROFESSOR name Teaches STUDENT sid name COURSE Takes cid name semester 28

Designing a Schema (Set of Relations) STUDENT COURSE Takes sid name sid cid name

Designing a Schema (Set of Relations) STUDENT COURSE Takes sid name sid cid name sem 1 Jill 1 550 -0105 550 -0103 DB F 05 2 Bo 1 700 -1005 700 -1003 AI S 05 3 Maya 3 500 -0105 501 -0103 Arch F 05 § Convert to tables + constraints § Then need to do “physical” design: the layout on disk, indices, etc. PROFESSOR Teaches fid name fid cid 1 Ives 1 550 -0105 2 Saul 2 700 -1005 8 Roth 8 501 -0105 29

Applications Use Queries in SQL § Structured Query Language, often embedded (e. g. ,

Applications Use Queries in SQL § Structured Query Language, often embedded (e. g. , in servlets, JSP) § Based on restricted first-order logic expressions over relations § Not procedural – defines constraints on the output § Converted into a query plan that exploits properties; run over the data by the query optimizer and query execution engine <html> <body> <!-- hypothetical. Embedded. SQL: SELECT * FROM STUDENT, Takes, COURSE WHERE STUDENT. sid = Takes. s. ID AND Takes. c. ID = cid --> </body> </html> 30

Processing the Query Web Server / UI / etc Hash STUDENT Optimizer Takes by

Processing the Query Web Server / UI / etc Hash STUDENT Optimizer Takes by cid Execution Engine Merge COURSE by cid Storage Subsystem SELECT * FROM STUDENT, Takes, COURSE WHERE STUDENT. sid = Takes. s. ID AND Takes. c. ID = cid 31

DBMSs in the Real World A huge industry for 20% of the world’s data!

DBMSs in the Real World A huge industry for 20% of the world’s data! § Big, mature relational databases § IBM DB 2, Oracle, Microsoft SQL Server § Adding advanced features, including “native XML” support § “Middleware” above these systems § SAP, Siebel, People. Soft, dozens of special-purpose apps § “Application servers, ” aka TP Monitors § Web. Sphere, Web. Logic, Tomcat, … § Support transactional applications in EJB, . NET, etc. § Integration and warehousing systems § BEA Aqua. Logic, DB 2 Information Integrator § Current trends: § Web services; XML everywhere § Smarter, self-tuning systems 32

So What about Database Research? § Not focusing on the problems of Oracle… §

So What about Database Research? § Not focusing on the problems of Oracle… § Understanding what’s possible to do with XML § Better query processing § Better languages for meta-info (e. g. , constraints) § Better interaction with the non-database world § § Data streams and sensors Peer-to-peer architectures and data cooperatives Integrating data from different formats Lots of theory and systems-building § You’ll see familiar concepts in this course from operating systems and from complexity theory/logic § … And from programming languages, AI planning, … 33

In this Course. . . § Study relational databases, their design, how to query,

In this Course. . . § Study relational databases, their design, how to query, what forms of indices to use. § Beyond relational algebra: a logical model of data (Datalog), recursion § XML and semi-structured data models § Understanding DB internals § How DBs are built § Performance implications § Integrating and mediating between databases (a huge problem today) 34

Your First Assignment… § Read the Codd paper § Write a one-paragraph summary of

Your First Assignment… § Read the Codd paper § Write a one-paragraph summary of the key ideas in this paper and post to the newsgroup: upenn. cis 550 35