CSE 482 Big Data Analysis Lecture 4 SQL

CSE 482: Big Data Analysis Lecture 4: SQL and Database Querying 1

Outline of Today’s Lecture l Previous lecture talks about how to collect online data from the Web l Today’s lecture focuses on how to store the data into a database – – What is a database? What is a database management system? What is SQL (structured query language)? Using SQL with python 2

What is a Database? l Collection of related data central to a given enterprise – Examples: u Banking – savings/checking accounts, mortgage, etc u Vehicle registration – car registration, year, make, etc u Student registration – name, PID, GPA, last semester enrolled, etc u Electronic Medical Records – name, SSN, date of birth, address, symptoms, diseases, medication, test results, etc u Sales – customer purchase transactions, customer complaints, etc 3

Database Management System (DBMS) l A collection of programs that enables users to create and maintain a database l Examples of DBMS – MS Access, MS SQL Server, IBM DB 2, Oracle, Sybase, Postgres, my. SQL, and many more – These are relational DBMS (which mean, these systems would model the data as a collection of relations/tables) l We will use my. SQL as our DBMS for this class 4

My. SQL Database Account l Every registered student will have access to a My. SQL account on mysql-user. cse. msu. edu l To log in, go to: – http: //www. cse. msu. edu/facility/php. My. Admin/index. php – Username is your CSE username Password is your PID (if you’re using it the first time) Server Choice: mysql-user l Send an email to manager@cse. msu. edu if you have problems logging in – Note that your database password is not the same as your CSE account password (unless you set them to be the same) 5

My. SQL Database Account 6

SQL (Structured Query Language) l A language provided by relational DBMS to – – – Create a table Alter the schema (structure) of a table Insert data records into a table Delete data records from a table Query a table etc 7

Example: Wikipedia Database Edit is a relationship between user and article 8

CREATE TABLE User ( Username VARCHAR(50) PRIMARY KEY, IS_Admin BOOLEAN, Content VARCHAR(500) ); CREATE TABLE Article ( Title VARCHAR(50) PRIMARY KEY, Content VARCHAR(500) ); A primary key is used to uniquely identify the records in a table 9

CREATE TABLE Edit ( Editor VARCHAR(50), Edited_Title VARCHAR(50), Modified_Time DATETIME, PRIMARY KEY(Editor, Edited_Title, Modified_Time); FOREIGN KEY (Editor) REFERENCES User(Username), FOREIGN KEY (Edited_Title) REFERENCES Article(Title) ); A foreign key is used to enforce the constraint that rows which were referenced in the Edit table exists in the referenced table 10

Example of Foreign Key Inserting a row (XYZ, My. SQL, 2013 -01 -01 10: 10) is a violation of foreign key constraint because there is no Article titled My. SQL 11

My. SQL Example You can also use php. My. Admin interface to directly access the database: https: //www. cse. msu. edu/php. My. Admin/ Option to create new table 12

My. SQL Example Enter the columns of the table and their constraints 13

Summary of useful My. SQL commands l l l l l set password=password(‘new password’); show databases; -- show the list of databases available to you use dbname; -- use the database called dbname show tables; -- show tables available describe User; truncate table User; drop table User; source script-file; -- executing SQL commands from a script-file load data infile /path/file. txt into table User; 14

DROP TABLE l Delete a table from the database l After deletion, the table can no longer be used in queries, updates, or any other commands since its description no longer exists l Example: DROP TABLE USER; 15

ALTER TABLE l Add a new column to an existing table: ALTER TABLE USER ADD BIRTHDATETIME; l Remove a column from an existing table: ALTER TABLE USER DROP COLUMN BIRTHDATE; l Modify an existing column in a table: ALTER TABLE USER MODIFY BIRTHDATE INTEGER DEFAULT 1900; l Modify constraints of a table: ALTER TABLE EDIT DROP PRIMARY KEY; 16

INSERT or DELETE Operation INSERT INTO Article VALUES (‘Database’, ‘$@%@23’); INSERT INTO Article VALUES (‘Michigan’, ‘#525 s 36’); DELETE FROM Article WHERE Title=‘Database’; DELETE FROM Article WHERE Content=‘#525 s 35’; 17

Retrieval Queries in SQL l Basic form of the SQL retrieval queries: SELECT FROM WHERE <attribute list> <table list> <condition> – <attribute list> is a list of column names whose values are to be retrieved by the query – <table list> is a list of table names required to process the query – <condition> is a conditional (Boolean) expression that identifies the rows to be retrieved by the query 18

Example: Wikipedia Database 19

SQL Query (for 1 Table) l Query: Find the names of users who have administrative privileges SELECT Username FROM User WHERE is. Admin = true; 20

SQL Query (for 1 Table) l Query: Show all the records in the User table SELECT * FROM User; A wildcard (*) in the SELECT clause means retrieve all the columns Query has no WHERE clause (which means all rows will satisfy the condition) 21

SQL Query (for 1 Table) l Query: Count the number of edits made by XYZ SELECT Count(*) FROM Edit Count is an aggregate function; used to aggregate the rows WHERE Editor=‘XYZ’; 22

SQL Query (for 1 Table) l Query: Count the number of edits made by each user SELECT Editor, Count(*) FROM Edit GROUP BY Editor; GROUP BY clause means group the results based on the Editor column before summarizing them 23

SQL Query (for 1 Table) l Query: Count the number of distinct articles edited by each user SELECT Editor, Count(DISTINCT Edited_Title) FROM Edit GROUP BY Editor; 24

SQL Query (for 1 Table) l Query: Find the top 2 editors who made the most edits SELECT Editor, Count(*) FROM Edit GROUP BY Editor ORDER BY Count(*) DESC LIMIT 2; 25

SQL Query for 2 Tables l Query: Find the titles of Wikipedia articles edited by administrators SELECT Edited_Title FROM User, Edit WHERE Is. Admin = True; SQL is wrong. Missing the join condition. 26

SQL Query for 2 Tables l Query: Find the titles of Wikipedia articles edited by administrators SELECT Edited_Title FROM User, Edit WHERE Editor = Username AND Is. Admin = True; Join condition 27

SQL Query for 2 Tables l Query: For each non-administrator, count the number of distinct articles they have edited SELECT Editor, COUNT(DISTINCT Edited_Title) FROM User, Edit WHERE Editor = Username AND Is. Admin = False GROUP BY Editor; 28

Using My. SQL with Python l Install mysql-connector-python library – conda install mysql-connector-python l Steps 1. Import mysql-connector module 2. Create database connection cnx = mysql. connector. connect(username, password, host, database) 3. Create a cursor = cnx. cursor() 4. Pass query to cursor for execution cursor. execute(query) If it is a database update query: cnx. commit() 5. Close cursor and database connection 29

Using My. SQL with Python l Example: load wiki_edit. txt file into My. SQL database – Schema: Revision. Id, Article_name, Edit. Date, Editor 30

Using My. SQL with Python l Create a new table (and drop the old one) in My. SQL 31

Using My. SQL with Python l Insert tuples from Data. Frame into My. SQL (after removing the unicode characters in article name) Make sure you commit the database changes 32

Using My. SQL with Python l Query table and print the results 33

Accessing CSE My. SQL from Home l You may not be able to access the database directly from your code outside of MSU network (firewall) l You may need to login to MSU VPN or EGR VPN 34
- Slides: 34