CS 519 Big Data Exploration and Analytics Review
CS 519: Big Data Exploration and Analytics Review: Relational Query Languages
Announcements • The due data for selecting papers is Tuesday. – Your top 5 choices 2
Relational model and query languages • Relational model defines data organization • Relational query languages define data retrieval/manipulation operations. 3
Relational model Relation name Attribute names Book: Title Price Category Year My. SQL $102. 1 computer 2001 Cell biology $201. 69 biology 1954 French cinema $53. 99 art 2002 NBA History sport 2010 $63. 65 tuples 4
Relational Model • Attributes – Atomic values – Domain: string, integer, real, – Keys: no duplicate values • Each relation must have keys – A relation does not contain duplicate tuples. 5
Database Schema vs. Database Instance • Schema: S: Book(Title, Price, Category, Year) • Instance: Title Price Category Year My. SQL $102. 1 computer 2001 Cell biology $201. 69 biology 1954 French cinema $53. 99 art 2002 NBA History sport 2010 $63. 65 • Values of each attribute A in I: active domain of A, adom(A) 6
SQL • A declarative language for querying data stored in relational databases. • Much easier to use than procedural languages. – Say what instead of how SELECT returned attribute(s) FROM table(s) WHERE conditions on the tuples of the table(s) 7
SQL Example Movie(id, title, year, total-gross) Actor(id, name, b-year) Plays(mid, aid) What movies are made in 1998? SELECT title FROM movie WHERE year = 1998; 8
SQL Example Movie(mid, title, year, total-gross) Actor(aid, name, b-year) Plays(mid, aid) • Find actors who played in a movie whose total gross is more than $2, 000. SELECT * FROM Actor, Movie, Plays WHERE Movie. id = Plays. mid AND Plays. aid = Actor. id AND total-gross > 2000000; 9
Formal Relational Query Languages • Formal languages that express queries over relational schemas. • Relational Algebra • Datalog (recursion-free with negation) • Relational calculus • Used to explore the properties of relational model. • Easier to use than SQL in some application domains. 10
Relational Algebra (RA) • Used by RDBMS to execute queries • Six operators – Selection σ – Projection Π – Join ∞ – Union – Difference – – Renaming ρ (for named perspective) 11
Relational Algebra Movie(mid, title, year, total-gross) Actor(aid, name, b-year) Plays(mid, aid) • Selection σ – σtitle=‘Fargo’ (Movie) • Projection Π – Πb-year (Actor) • Join ∞ – Movie ∞id=mid Plays 12
Datalog • First created to support recursive queries over relational databases. • Easier to use than relational algebra. • Used extensively in research and industry – Data integration, networking, logic programming, learning, distributed processing, … • We talk about the recursion-free datalog. 13
Datalog Movie(mid, title, year, total-gross) Actor(aid, name, b-year) Plays(mid, aid) • Each tuple in database is a fact Movie(236878, ‘Godfather I’, 1972, 40000000) Movie(879900, ‘Godfather II’, 1974, 3900000) Actor(090988, ’Robert De Niro’, 1943) • Each query is a rule Movies that were produced in 1998 and made more than $2, 000. Q 1(y): - Movie(x, y, 1998, z), z > 2000. 14
Datalog Example Movie(mid, title, year, total-gross) Actor(aid, name, b-year) Plays(mid, aid) • Actors who played in a movie whose total gross is more than $2, 000. Q 2(y): - Actor(x, y, z), Plays(t, x), Movie(t, v, w, f), f > 2000. • Actors who played in a movie whose total gross is more than $2, 000 and a movie made in 1998. Q 3(y): - Actor(x, y, z), Plays(t, x), Movie(t, v, w, f), f > 2000, Plays(g, x), Movie(g, l, 1998, h). 15
Movie(mid, title, year, total-gross) Actor(aid, name, b-year) Plays(mid, aid) Datalog Q 2(y): - Actor(x, y, z), Plays(t, x), Movie(t, v, w, f), f > 2000. atom head body y: head variable; x, z, t : existential variables • Extensional Database Predicates (EDB) – Movie, Actor, Plays • Intentional Database Predicate (IDB) – Q 2 16
Datalog programs Movie(mid, title, year, total-gross) Actor(aid, name, b-year) Plays(mid, aid) • A collection of rules: union • Actors who played in a movie with gross of more than $2, 000 or a movie made after 1990. Q 4(y): - Actor(x, y, z), Plays(t, x), Movie(t, v, w, f), f > 2000. Q 4(y): - Actor(x, y, z), Plays(t, x), Movie(t, v, w, f), w > 1990. 17
Movie(mid, title, year, total-gross) Actor(aid, name, b-year) Plays(mid, aid) Views • Similar to views in SQL • Actors who played in a movie with gross of more than $2000 and in a movie with ‘Robert De Niro’. V(x, y, z): - Actor(x, y, z), Plays(t, x), Movie(t, v, w, f), f > 20000. Q 5(y): - V(x, y, z), Plays(t, x), Plays(t, f), Actor(f, ’Robert De Niro’, g, h). • Unfolding Definition of V Q 5(y): - Actor(x, y, z), Plays(t, x), Movie(t, v, w, f), f > 20000, Plays(u, x), Plays(u, f), Actor(f, ’Robert De Niro’, g, h). 18
Datalog with negation Movie(mid, title, year, total-gross) Actor(aid, name, b-year) Plays(mid, aid) • All actors who did not play in a movie with ‘Robert De Niro’. U(x, y, z): - Actor(x, y, z), Plays(t, x), Plays(t, f), Actor(f, ’Robert De Niro’, g). Q 6(y): - Actor(x, y, z), not U(x, y, z). 19
Safe datalog rules Movie(mid, title, year, total-gross) Actor(aid, name, b-year) Plays(mid, aid) • Unsafe rules: V(x, y, z): - Actor(x, y, 1998), z > 200. W(x, y, z): - Actor(x, y, z), not Plays(t, x). • A datalog rule is safe if every variable appear in at least one positive predicate 20
Datalog to SQL • Non-recursive datalog with negation represents the core functionalities of SQL • We can translate each non-recursive datalog program to a core SQL query and vice versa. 21
Equivalency Theorem • RA and non-recursive datalog with negation the same set of queries. • Relational queries. 22
Conjunctive queries (CQ) • One datalog rule. • SELECT-DISTINCT-FROM-WHERE. • Select/project/join (σ, Π, ∞) fragment of RA. • Existential/ conjunctive fragment of RC • There is not any comparison operator (<, ≠, …) in CQ. – If used the family is called CQ<, CQ≠, … 23
CQ examples Movie(mid, title, year, total-gross) Actor(aid, name, b-year) Plays(mid, aid) Actors who played in “LTR”. Q 7(y): - Actor(x, y, z), Plays(t, x), Movie(t, ’LTR’, w, f). Non-CQ: Actors who played in some movies with only one actor. 24
Query equivalency and containment • Interesting and long standing problems in query processing. • Queries q 1 and q 2 are equivalent if and only if for every database instance I, q 1(I) = q 2(I) – Shown as q 1 q 2 • Query q 1 is contained in q 2 if and only if for every database instance I, q 1(I) q 2(I) – Shown as q 1 q 2 25
Containment examples Is q 1 q 2? q 1(x): - R(x, y), R(y, z), R(z, w). q 2(x): - R(x, y), R(y, z). q 1(x): -R(x, y), R(y, ’Joe’). q 2(x): -R(x, y), R(y, z). q 1(x): - R(x, y), R(y, z), R(z, x). q 2(x): - R(x, y), R(y, x). 26
Containment examples Is q 1 q 2? q 1(x): - R(x, y), R(y, y). q 2(x): - R(x, y), R(y, z), R(z, t). 27
Query semantics • Rules based form of a CQ q(u): - R 1(u 1), …, Rn(un). – ui is shorthand for (x, y, …, z). • Valuation v is a total function from a set of variables to domain (dom) and identity on constants in the domain. 28
Query semantics • The set of variables in q is shown as var(q) e. g. var(q 1) = {x, y} q 1(x): - R(x, y), R(y, y). • The image of database instance I under query q, q(u): - R 1(u 1), …, Rn(un) is q(I) = {v(u)| v is a valuation over var(q), for each } 29
Query homomorphism • A homomorphism is a function from var(q 2) to var(q 1) s. t. for each atom R(x, y, …) in the query q 1 there is an atom R(h(x), h(y), …) in q 2. • h leaves the constants in q 2 intact. • Example q 1(x): - R(x, y), R(y, z), R(z, w). q 2(x): - R(x, y), R(y, z). We treat head variables, ‘x’, as constants, i. e. , the same in q 1 and q 2. 30
Homomorphism Theorem • Given CQs q 1 and q 2, we have q 1 if there exists a homomorphism q 2 if and only. • Example: q 1(x): -R(x, y), R(y, z), R(z, w). q 2(x): -R(x, y), R(y, z). – Since is a homomorphism, we have . 31
Homomorphism examples q 1(x): -R(x, y), R(y, ’Joe’). q 2(x): -R(x, y), R(y, z). q 1(x): - R(x, y), R(y, z), R(z, x). q 2(x): - R(x, y), R(y, x). There is no homomorphism: 32
Homomorphism examples Is q 1 q 2? q 1(x): - R(x, y), R(y, y). q 2(x): - R(x, y), R(y, z), R(z, t). 33
Homomorphism Theorem • Given CQs q and q’, we have q there exists a homomorphism q’ if and only if. • Proof: For each w in q(I), there is a valuation v that maps free tuples in q to I such that v(u)= w. Thus, h(v) will map free tuples in q’ into I and h(v(u’)) = w, where w is in q’(I). Using canonical instances: read the book page 117. 34
Checking containment • Check if there exists a homomorphism between queries. • The problem is NP-complete, proved by reducing from 3 -SAT. • Since the size of queries are relatively small, the process is sufficiently fast. 35
Query minimization • A conjunctive query q is minimal if for every other conjunctive query q’ , if q’ q, q’ has at least as many atoms as q. • Examples: q 1(x): - R(x, z), R(z, t), R(x, w). q 2(x): - R(x, z), R(z, t), R(x, ’Joe’). 36
Query minimization algorithm 1. Remove an atom from q. Let’s call new query q’. 2. We have q q’. 3. Check to see if q’ q, if it is then remove the atom permanently. • Example: q 1(x): - R(x, z), R(z, t), R(x, w). q 2(x): - R(x, z), R(z, t). We have a homomorphism from q 1 to q 2. 37
Larger families: UCQ Movie(mid, title, year, total-gross) Actor(aid, name, b-year) Plays(mid, aid) • CQ with union Movies that were produced in 1998 or made more than $2, 000. Q 1(y): - Movie(x, y, 1998, z). Q 1(y): - Movie(x, y, z, t), t > 2000. • We can extend homomorphism theorem for UCQs. 38
Homomorphism Theorem for UCQ • Given UCQs there is a and , we have if and only if for every , such that • Thus, we can use apply homomorphism theorem to each CQ in a UCQ to check the containment. • Containment checking for UCQs is NP-complete. 39
Larger families: relational queries • Containment checking for relational queries is undecidable. • Proved by reduction from finite satisfiability problem: – Given a query, is there any (finite) database where the query as at least one answer. 40
- Slides: 40