Big Data and Databases CIS 5517 DataIntensive and
Big Data and Databases CIS 5517 – Data-Intensive and Cloud Computing Lecture 6 Slides based on Z. Ives’ at University of Pennsylvania 1
To This Point… We’ve discussed how to encode and operate on structured data (using Data. Frames) We’ve looked at how to convert from unstructured to structured data and how to manage heterogeneity Today, the first part of “big” (in this case, indeed large – not heterogeneous) data University of Pennsylvania 2
Why Is “Big” Hard? 1. Combinatorial explosion 2. Inherent sequence 3. Non-uniform access / storage costs 3
Inherent Sequence: Some Examples These algorithms are written sequentially. Parallel algorithms may exist, but they typically fun min(L): look different, and compilers may not “find” them min = L[0] based on sequential versions. for i in range(1, len(L)): if L[i] < min: = L[i] Also, min sometimes, instead, the algorithms are NP-hard, and the best we can do is hope to approximate them. return min But some problems also can’t be handled with approximation with any guarantees. 5
Non-Uniform Data Access / Storage Cost Let’s take a simple algorithm: Do a Cartesian product between pairs of items (r, s) in (R, S) Find the pair where fn(r, s) is minimum • (Informally) how would we write this in Python with Data. Frames, with iteration? 6
Nested Loops over Lists min = sys. maxsize for r_index, r_row in R_df. iterrows(): for s_index, s_row in S_df. iterrows(): new_val = fn(r_row, s_row) if (new_val < min): min = new_val Suppose R has 1 M tuples, S has 1 M tuples, each comparison takes 1 nsec 7
But Now Let’s Add Some Real-World Complexities We double R =2 M tuples, S = 2 M tuples, but now only half of R and half of S fit in memory! RM R X SM S R_on_disk. open() S_on_disk. open() min = sys. maxsize while r_row = R_on_disk. read_row(): while s_row = S_on_disk. read_row(): new_val = fn(r_row, s_row) if (new_val < min): min = new_val S_on_disk. move_to_start() 8
The “On Disk” Case: An Old Hard Disk • We can read 1 GBPS, but each request takes 4 msec • Latency vs throughput RM R X SM S R_on_disk. open() S_on_disk. open() min = sys. maxsize while r_row = R_on_disk. read_row(): while s_row = S_on_disk. read_row(): new_val = fn(r_row, s_row) if (new_val < min): min = new_val S_on_disk. move_to_start() 9
Other Options • SSD: We can read 10 GBPS, but each request takes 100 usec • Network: Another PC can take half of R, half of S • But it costs 4 msec to communicate 10
General Techniques, and Issues that Arise 1. Put data on disk (or disks!) and come up with schemes that don’t require iteration over all rows 2. Distribute data and computation and come up with schemes that don’t require iteration over all rows • • Across CPUs (or CPUs + GPUs) Across machines (eg in a compute cluster or cloud data center) But now: • • How do we efficiently retrieve the data? And what if the data is changing “in flight”? 11
Some Examples • • • Logs in a Web company – platforms like Data. Dog, Apache Flume Order processing and analytics at Amazon Ad analytics at Facebook, Google, Apple, … Twitter meme (or bot) detection A search engine – Google, Bing, Yandex, Baidu, … • • • Facial (or entity) recognition from images Gene sequence matching pipelines Social network friend recommendation 12
Let’s Explore the Issues in Stages 1. What do I need to do when I have (possibly updatable) data that must be stored on disk? Issues: performance, expressing the computation, consistency, … 2. What if I have data that needs to be stored in a compute cluster? (similar issues as CPU vs GPU etc) Ideally: we’ll see some commonalities 13
Challenges of Disk-bound Data Disk is slow (even SSD is slow)! • • We don’t want to retrieve all of the data in order to do our computation Need a notion of indexing to directly access the data we want • If queries + updates happen simultaneously, need to handle concurrency via isolation • Also need to think (a lot) about error detection These requirements led to the database management system (DBMS) 14
Big Data, Persistent Data: Databases
A Constant Tension: Constraints vs Flexibility How flexible should we be in allowing people to store their data? Very flexible: üif needs change, or we have special fields, we can accommodate �we don’t prevent typos, nonsensical data, or data we don’t know how to handle! Very constrained: üwe restrict the data to what we believe should be allowable, and catch many errors at data entry �if needs change, or we have special fields, they don’t “fit” 16
A More Controlled Environment for Data: The Database Management System A DBMS is a software layer designed to manage: • • Data storage, where we can enforce certain constraints • Data durability, i. e. the data is retrievable even with crashes and hardware failures • Data relationships, i. e. , links in the data Data consistency in the presence of updates • A DBMS also provides query capabilities with some sort of optimization – equivalent to but different from Pandas DFs 17
Constraints on Data Structures, Values, and Relationships 18
The Simplest Kinds of Constraints: on Data Structure Suppose I have records on People – what should I model there? How should I restrict the values for certain fields? 19
People “Entity Set” Constraints (Schema) Person(id: string, name: string, age: int) • id is unique, ie a “key” • name is non-empty • age is >= 0 and <= 150 and is a floating-point value in Earth-years USPerson: special case (subtype of) Person where id is of form nnn-nn-nnnn 20
Relationships Suppose I also know that People have Friends? • Think of a graph • What constraints should I add here? 21
Assembling Results (Using Data. Frames) 22
Assembling Results (Using Data. Frames) People Friends AND WHAT IS MISSING? ? ? 23
Type of relationships between data One-One One-Many-Many 24
One to Many Relationships: “Trees” JSON or nested collections: family_tree] = '}id': '123 -45 -6789', 'name': 'jie, ' ' '} '} ' children': [{'id': '654 -32 -1987', 'name': 'feng, {[{' id': '234 -56 -7890', 'name': 'trevor, ' children': [{'id': '987 -65 -4321', 'name': 'jill, {' id': '999 -87 -2211', 'name': 'ben, {[{' id': '345 -67 -8901', 'name': 'jayant, ' children[{[] : ' 25
One-to-Many Relations (Child Parent) 26
Many-Many relationships Not every relationship is hierarchical / tree-structured! • Consider our example of the friends relationship • We want two tables: • • The entities (People) The relationships (Friends) 27
Summary So Far: Modeling Data in Relations • Both entities (“nodes”) and relationships (“edges”) can be modeled using tables • Constraints such as keys, foreign keys, domain constraints • 1: 1, 1: m, m: n constraints restrict how we would encode relationships • Beware symmetry 28
Querying data: SQL 29
What do we do with tables? • Select rows Pandas df[Boolean vector] • Project columns Pandas df[column names] • Join tables together Pandas df. merge() • Compute functions over subsets of rows, columns Pandas df. groupby() • Insert/delete rows Pandas df. append() • Modify the content of a row Pandas assign value to a cell You can practice additional examples with: 'https: //raw. github. com/pandas-dev/pandas/master/pandas/tests/data/tips. csv https: //pandas. pydata. org/pandas-docs/stable/comparison_with_sql. html 30
A Simple Relational Database Student Takes Sid Name Course Num Name Dept 1 Jill 1 550 Database CIS 2 Bo 1 520 Machine Learning CIS 3 Maya 3 521 BE 3 550 Brain Computer Interfaces 31
Selecting rows Data. Frames: SQL: SELECT * FROM Student WHERE Name=“Maya” 32
Projecting columns Data. Frames: SQL: SELECT Name FROM Student SELECT DISTINCT Name FROM Student 33
Merging (Joining) Student Sid Data. Frames: Takes Name Sid Num 1 Jill 1 550 2 Bo 1 520 3 Maya 3 521 3 550 SQL: SELECT Student. Name, Student. Sid, Takes. Num FROM Student JOIN Takes ON Student. Sid=Takes. Sid SELECT * FROM Student NATURAL JOIN Takes 34
Grouping • “For each course, count the number of students. ” Data. Frames: SQL: SELECT Num, count(*) AS Count FROM Takes GROUP BY Num 35
EXISTS/NOT-EXISTS in SQL “Print all students who are not taking any course. ” SELECT * FROM Student AS s WHERE NOT EXISTS (SELECT * FROM Takes AS t WHERE s. Sid=t. Sid) Student Takes Sid Name Num 1 Jill 1 550 2 Bo 1 520 3 Maya 3 521 3 550 36
How could you do this with Data. Frames? “Print students who are not taking any course. ” 37
How could you do this with Data. Frames? “Print all students who are not taking any course. ” 38
How could you do this with Data. Frames? “Print all students who are not taking any course. ” 39
Working between Data. Frames and Databases 40
We often need to work between Data. Frames and Databases • The data “in the wild” that I need is in a database • I have too much data to put into a single Data. Frame • I want to pass a Data. Frame between programs • We need to be able to store Data. Frames in a database and get Data. Frames from a database 41
Talking between Data. Frames and sqlite • Create a connection object using sqlite 3. connect() • Write a Data. Frame to the database using. to_sql. • Read a Data. Frame from the database using. read_sql. Can set index = None in to_sql read_sql_query is equivalent to read_sql 42
Install SQLite • Step 1 − Go to SQLite download page, and download precompiled binaries from Windows section. • Step 2 − Download sqlite-shell-win 32 -*. zip and sqlite-dll-win 32 -*. zipped files. • Step 3 − Create a folder sqlite and unzip above two zipped files in this folder, which will give you sqlite 3. def, sqlite 3. dll and sqlite 3. exe files. • Step 4 − Add sqlite in your PATH environment variable and finally go to the command prompt and issue sqlite 3 command. • Step 5 – Install SQLite browser. • https: //sqlitebrowser. org/ 43
What Should I Do Where? Using Pandas vs Using SQL to Do Our Work 44
A Simple Example Who is interviewing now, who is also taking (or has taken) CIS 5517? Interviewing(student, month, day, year) Taking 5517(sid, term) 45
Computing Using Data. Frames vs SQL, Given Data in an RDBMS read_sql_query Interviewing. merge(Taking 5517) Interviewing[ Interviewing Taking 5517[‘sid’] [month==10]] Interviewing. merge(Taking 5517) SELECT * FROM Interviewing JOIN Taking 5517 Interviewing[ ON student=sid Interviewing WHERE month=10 Taking 5517[‘sid’] [month==10]] read_sql_table Interviewing Taking 5517 index Interviewing Taking 5517 46
What about Non-Relational (No. SQL) Databases? • Mongo. DB, Cassandra, Hbase, Accumulo, Redis, … • As a general rule: • They store keys/values, and don’t have much in terms of constraint enforcement • They often don’t expressly capture relationships • They typically don’t support transactions (described in a moment) 47
Other factors: Query Optimization and Transactions 48
The Relational Data Model It all began with a breakthrough paper by E. F. Codd in 1970: “A relational model of data for large shared data banks”. Communications of the ACM 13 (6): 377 Codd’s insights: • • Separate physical implementation from logical Model the data independently from how it will be used (accessed, printed, etc. ) • • Describe the data minimally and mathematically Use standard mathematical (logical) operations over the data – these are the relational calculus (~ SQL) or relational algebra (~ sequence of Data. Frame operations) 49
What are Codd’s Basic Operations? • A relation is a set of tuples with the same schema • The relational algebra: • • • Select items from a set, based on a predicate • • Union tuples from two different sets Project / map tuples based on a subset of their attributes Join tuples from two relations based on a predicate over their relationships Group tuples by a subset of their fields, and compute an aggregate over the other fields 50
Optimization was key to adoption of relational systems There was a gap between the breakthrough idea and commercial adoption • • • Codd’s original work: 1969 -70 • • Indexing Earliest relational database research: ~1976 Oracle “ 2. 0”: 1979 • Key to success was optimization Optimization based on rewriting queries Optimizing joins “Materialized views” – precomputed, always-updated tables derived from others Pat Selinger 51
Indexing: Direct Access by a Key • In “selection”, rows are selected from tables/Data. Frames based on values in one or more columns. SELECT * FROM Students WHERE Name= “Maya” • Looking at all rows of a table can be very expensive • Indices reduce the number of rows in a table that must be examined. • • Think of an index in a book, or over an html page. Database indices (B+-tree or Hash index) work between disk and main memory 52
An Index Is Like a Dictionary, But (Typically) On Disk Makes joins (and other operations) much faster! R_on_disk. open() S_on_disk. open() ret = [] while r_row = R_on_disk. read_row(): for s_row in S_on_disk. match(r_row. key): ret. add(r_row. join(s_row)) return ret 53
Creating an index in sqlite 54
Creating an index in sqlite 55
When is an index used? • Suppose the following index is created: CREATE INDEX idx_R ON R(a, b, c); • • The index can be used if the initial columns of the index (columns a, b, and so forth) appear in WHERE clause terms as • • column = expression column IS expression column IN (subquery) column IS NULL The right-most column of an index that is used can use inequality, restricting the value for that column within a range • column > expression … 56
Examples CREATE INDEX idx_R ON R(a, b, c); … WHERE a=5 AND b =3 AND c IS NULL • All columns of the index are usable … WHERE a=5 AND b >10 AND b<50 • First two columns of the index are usable. . . WHERE a=5 AND b >12 AND c='hello’ • First two columns of the index are usable. . . WHERE b =3 AND c ='hello’ • The index is not usable 57
Optimization based on query rewriting • The basic operations we discussed in Data. Frames and SQL (project, select, join, groupby) are part of the relational algebra, a core set of of table operations • Often we can apply them in different orders to get equivalent results at different performance • Query optimization finds the most efficient representation to evaluate (or one that’s not bad) 58
An Example Using Data. Frames projectname_x, name_y merge/joinid, a people_df merge/joinb, id friends_df people_df 59
An Example Using Data. Frames projectname_x, name_y merge/joinid, a projectid, name merge/join b, id people_df friends_df projectid, name people_df 60
An Example Using Data. Frames 61
Transactions: Managing updates to data • Transactions in SQL embody an “all or nothing” unit of work • • Despite failures in the system • • One transaction (issued by me) might enter all occurrences of CIS 5517 Despite concurrent activity • E. g. suppose we have a calendar application Another transaction (issued by my assistant) might schedule a meeting for me 12 -1: 30 pm on Monday, 1/29/2018 • How should conflicts be handled? This becomes especially complicated in distributed (“sharded”) environments. Description Time Location Lunch 1/29/2018 White Dog @ 12 Cafe 2 pm Faculty Meeting 1/29/2018 307 Levine @ 9: 3010: 30 … … … 62
Example T 1: BEGIN TRANSACTION; CREATE TABLE Student (sid INTEGER NOT NULL, name CHAR(20), PRIMARY KEY (sid)); INSERT INTO Student VALUES(1, 'Jill'); INSERT INTO Student VALUES(2, 'Bo'); INSERT INTO Student VALUES(3, 'Maya'); COMMIT; T 2: INSERT INTO Student VALUES(1, ’Bob'); • If this transaction is successful, all 3 students will appear in Student (and won’t disappear!) • Key constraints will also be guaranteed: no two students will have the same sid Serializable behavior vs Eventual consistency? 63
Choosing a Database System • There are many different databases available, and many issues to consider in determining what database platform to use. • SQL vs No. SQL • SQL offers transactional semantics and expects very regular structure • No. SQL typically offers less support for updates, but will typically offer faster queries as a result • My. SQL vs Postgre. SQL vs Oracle vs. . . • With Docker containers you can install and use these quite easily • During the course, we’ll survey a few options 64
To recap… • • • Data in a database is “tamed” • • Constraints: domain, key, foreign key Relationships between data are understood Data. Frames ≈ tables • Selection, projection, joining (merge), grouping Can “persist” Data. Frames to a database, or extract Data. Frames from a database Databases can run queries over very large datasets quickly • • Indexing improves performance of selections, joins Query optimization reorders expressions (especially joins) to improve performance Persistent data may need to be updated, need to understand what “transactions” mean in the persistent store (e. g. Oracle, my. SQL, Mongo. DB, …) 65
- Slides: 64