1 LECTURE 3 DATABASES DATABASES E 4520 Data
1 LECTURE #3: DATABASES, DATABASES E 4520: Data Science for Mechanical Systems Instructor: Josh Browne, Ph. D Guest Lecturer: Gilman Callsen Feb 6, 2019
2 Gilman Callsen gilman. callsen@rho. ai • “Entrepreneur with a penchant for technology companies. ” • Yale, BA Psychology • Started as a physics major, though! • Databases have been a big part of my entire career. • Not a database administrator.
3 Basic Timeline Late 90 s & Early 00 s Websites Chromic Décor (2006) MC 10(2008) Pit Rho (2012) Rho AI (2016)
4 I learned the value of databases pretty quickly at this point. . .
5 Let’s Get Those Laptops Out cd ~/Documents git clone https: //github. com/gcallsen/database-class-examples. git clone https: //github. com/rhoai/python-dev. git cd ~/Documents/python-dev docker-compose up -d docker run -it --rm --net python-dev_default -v ~/Documents/database-class-examples: /code rhoai/python-dev: v 0. 1. 0
6 OVERVIEW
7 Data, Data • Who remembers what Erik Allen (Lecture #1) said about “Data Collection and Preparation”?
8 Data Collection and Preparation • Be prepared to spend a lot of time (~80%) on data collection and cleaning • If you’ve got a data set, be very grateful! • Expect to be a partner in the data generation process • Have a full tool belt of models, to prepare for a paucity of data early in the process
9 Where do the data live? • With that in mind. . . where do you think all those collected and prepared data points live?
10 You’re Right! DATABASES!
11 Cooking Analogy As we go through this lecture keep cooking in mind Grocery Store Blue Apron Family Cook
12 What is a database? • An organized collection of data ? Database Do Stuff
13 What is a database? • An organized collection of data Collection of Ingredients Grocery Store Blue Apron Chef ? The Food
14 Who Should Care? • Everyone! • Databases are the backbone of nearly every digital ‘thing’ we interact with today. • Examples • Software engineer • Mechanical engineer • Business person • Marketing • Generic consumer • Even Schools. . .
15 https: //xkcd. com/327/
16 DBs for DS in Mechanical Systems? • Application needs will vary wildly • Super fast (real-time mechanical systems) • Highly accurate • Complex connections • Research • Development • Unlikely anyone here will be DBA but an understanding of how to think of databases goes a long way • Rubicon Global (waste management) • Automated pickup detection - on board vs cloud • Video processing • Accelerometer data • GPS location + time
17 Types of Databases • No silver bullet • An incredibly wide array of databases exist; all with strengths and weaknesses • All situations require considering what mix of DBs are used. • Polyglot persistence • Grocery Store vs Blue Apron vs Cook Relational Non. Relational Why are there only two ‘types’ here when our analogy has 3 ‘types’?
18 Relational Overview • • A relational database is a digital databased on the relational model of data, as proposed by E. F. Codd in 1970. Virtually all relational database systems use SQL (Structured Query Language) for querying and maintaining the database. Straight from wikipedia https: //en. wikipedia. org/wiki/Relational_database
19 Non-Relational Overview • • • A No. SQL (originally referring to "non SQL" or "non relational") database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. No. SQL databases are increasingly used in big data and real-time web applications. No. SQL systems are also sometimes called "Not only SQL" to emphasize that they may support SQL-like query languages, or sit alongside SQL database in a polyglot persistence architecture. Straight from wikipedia https: //en. wikipedia. org/wiki/No. SQL
20 https: //www. alooma. com/blog/types-of-modern-databases
21 In Practice • Amazon. com • How do you think they store your data? https: //neo 4 j. com/blog/neo 4 j-doc-manager-polyglot-persistence-mongodb/
22 In Practice • In most cases, more than one database is used! Polyglot Persistence https: //neo 4 j. com/blog/neo 4 j-doc-manager-polyglot-persistence-mongodb/
23 Real World Example • Many speakers in this lecture series will use Pit Rho as an example…and so will I! • Databases - how many do you think we’d need? • Take away: production systems require thoughtfully using tech components. No silver bullet.
24 RELATIONAL A man walks into a bar and sees two tables. He says, “May I join you? ”
25 https: //www. alooma. com/blog/types-of-modern-databases
26 Summary Version • • A database is a means of storing information in such a way that information can be retrieved from it. A relational database is one that presents information in tables with rows and columns. A table is a collection of objects of the same type (rows). Data in a table can be related to other tables (typically using ‘keys’) • The ability to retrieve these related data provides us the term relational database.
27 Relational Databases • Core concepts for today • Tables • Normalization • SQL query language
28 Tables • Grocery store Products id int id name price description department name string 1 Apple 1. 99 Delicious Apple Produce price float 2 Banana 3. 49 Bunch of Bananas Produce description string 3 Bread 3. 99 Loaf of whole wheat Bakery department string 4 Cheddar 2. 79 Sliced cheddar Deli
29 Normalization • • Database Normalization is a technique of organizing the data in the database. Basically, you break apart tables to: • • • eliminate data redundancy and reduce malformed data when performing CRUD operations Importantly, this allows the database itself to enforce data integrity. Once you’ve done this, you now need the concept of of ‘joins’. To perform a join you need two items: • • two tables and a join condition the tables contain the rows to be combined, and the join condition the instructions to match rows together
30 Things can get crazy. . .
31 Basic Normalization • Grocery store Departments Products id int name string department_id int description string name string price float description string
32 Basic Normalization • Grocery store Products Departments id name description id name price descripti on departme nt_id 1 Produce Healthy stuff! 1 Apple 1. 99 Delicious Apple 1 2 Bakery Breads and goodies 2 Banana 3. 49 Bunch of Bananas 1 3 Deli Meats, cheeses, etc 3 Bread 3. 99 Loaf of whole wheat 2 4 Cheddar 2. 79 Sliced cheddar 3
33 SQL • • • The fundamentals of most SQL languages are the same Variations exist based on the database’s functionality https: //www. w 3 schools. com/sql_intro. asp • Worth going through that as a primer
34 SQL Common Commands • • • SELECT - extracts data from a database UPDATE - updates data in a database DELETE - deletes data from a database INSERT INTO - inserts new data into a database CREATE DATABASE - creates a new database ALTER DATABASE - modifies a database CREATE TABLE - creates a new table ALTER TABLE - modifies a table DROP TABLE - deletes a table CREATE INDEX - creates an index (search key) DROP INDEX - deletes an index
35 SQL On Our Tables Command Outcome SELECT * FROM products; Get all the Products SELECT * FROM departments; Get all the Departments SELECT p. name, d. name, price FROM products p FULL OUTER JOIN departments d ON d. id = p. department_id; Get all of the products, display their name, price, and name of the department they are in.
36
37 NON-RELATIONAL Database Admins walked into a No. SQL bar. …a little while later they walked out because they couldn’t find a table.
38 https: //www. alooma. com/blog/types-of-modern-databases
39 Summary Version • • • A database is a means of storing information in such a way that information can be retrieved from it. A non-relational database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. There a lot of options. Why? • • Data do not always fit nicely into columns and rows Wide array of use cases that can have highly optimized solutions (e. g. time-series data) Scalability (horizontal scalability) Often, CAP Theorem is at play
40 Non-Relational Databases • Core concepts for today • CAP Theorem • Documents • Unnormalized Form • “Query languages”
41 CAP Theorem • High Level: • • • Consistency • Every read receives the most recent write or an error • Every request receives a (non-error) response – without the guarantee that it contains the most recent write Availability Partition tolerance • • The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes No distributed system is safe from network failures The CAP theorem says that when network failures happen, you must choose between consistency and availability Difference between development and production
42 Unnormalized Form • • • Group all those tables back together! Unnormalized form is frequently used in nonrelational databases (certainly not always) An unnormalized data model has redundancy • Fields are grouped together Multiple values and complex structures inside single fields etc. • Frequently to improve the read performance of a database • • • Why? • • • at the expense write performance Not all data fit nicely into tables NOTE: this does not mean there is ‘no data model’
43 Query Languages • • • All over the map Majority of them are specific to the database Many attempt to have “SQL-Like” query languages, you’ll often see “mydb. QL”
44 Redis • What’s my ham sandwich recipe? Keys key value sandwich_ham ingredients: [‘wheat bread’, ‘cheddar, ‘ham’] sandwich_turkey ingredients: [‘wheat bread’, ‘cheddar, ‘turkey’] cheese_platter_1 ingredients: [‘cheddar’, ‘mozzarella’, ‘olives’] cheese_platter_2 ingredients: [‘pepper jack’, ‘mozzarella’, ‘olives’]
45 Redis Search? • • Against the keys, yes! • Let’s try to find all sandwich recipes • Try to find all recipes that use cheddar cheese Against the values. . . have fun.
46 Elasticsearch • What can I make if I have extra cheddar cheese? Documents _id value sandwich_ham ingredients: [‘wheat bread’, ‘cheddar, ‘ham’] sandwich_turkey ingredients: [‘wheat bread’, ‘cheddar, ‘turkey’] cheese_platter_1 ingredients: [‘cheddar’, ‘mozzarella’, ‘olives’] cheese_platter_2 ingredients: [‘pepper jack’, ‘mozzarella’, ‘olives’] Note: specifying the _id is completely unnecessary, adding here for consistency.
47 Elasticsearch. . . search? • • • Let’s hope the name isn’t lying to us… Against the keys, yes! • Let’s try to find all sandwich recipes • Try to find all recipes that use cheddar cheese Against the values. . . have actual fun!
48 BREAK A programmer’s wife sends him to the grocery store with the instructions, “get a loaf of bread and, if they have eggs, get a dozen. ” He comes home with a dozen loaves of bread and tells her, “they had eggs. ”
49 Group activity • Get a head start on your homework! • Won’t be the same actual use case but same thought process • Design your polyglot architecture for the “Amazon” scenario: • • • Define and create at least two tables in Postgres Define and create at least one use for Redis Define and create at least one mapping for Elasticsearch
50 Homework • “Mechanical systems” database design • Choose any problem you have worked on • • • Describe the problem Describe the data available Describe how you need to use data • • Who’s accessing it? Is this a big team, small team, etc? Are there applications accessing it programmatically? Are you exploring the data to find new features? • Define your database design • Define how the data fits in to each database you’ve chosen • Submit HW#2 to courseworks (posted) • What database(s) will you use? • Why? Justify each choice. • Mock out the tables, database structure, etc. • Format options: 1) presentation-style (< 8 slides), or report-style (< 750 words).
51 Another real world example • Document AI • If there’s time. . .
52 Reading & Reference • https: //www. red-gate. com/simple-talk/sql/database- administration/five-simple-database-design-errors-you -should-avoid • https: //www. learndatasci. com/tutorials/usingdatabases-python-postgres-sqlalchemy-and-alembic/ • https: //www. w 3 schools. com/sql/default. asp • https: //dzone. com/articles/23 -useful-elasticsearchexample-queries
53
54
- Slides: 54