Introduction to Database Technology Theres a database Behind

Introduction to Database Technology

There’s a database Behind every great application

What is a Database? A collection of data organized to make it easy to create, retrieve, update and delete information. C. R. U. D. Create Retrieve Update Delete

Another purpose of a database is to preserve the meaning of your data. Data is useless if we forget or lose what it means.

Databases consist of A data repository A database engine Applications access the data repository thru the database engine The database engine enforces business rules (so that applications don’t have to)

Examples of Business Rules ¡ ¡ ¡ Employee IDs are Unique (Uniqueness guarantee) Account balances cannot go below their minimum value (check constraint) Purchase price must be above zero (check constraint) An in province parking ticket must be issued to a known licence plate (referential integrity – pp 317 & pp 352) An employee’s manager must also be an employee (referential integrity)

Applications that Use Databases ¡ ¡ ¡ ¡ ¡ General Ledgers (Accounting Systems) Inventory: Libraries, Reservation Systems Source Code Control Systems Help Desk Telephone Personnel Records (Employment, Student) CRM (Customer Relationship Management) Safety and Repair Systems

Data Dictionary The 1 st step in database design is to create a Document known as a “Data Dictionary” • Name all your “fields” • Organize the fields into related groups • Record what each field means • Record the data type (string, integer, currency…) and format (ie: day/month/year) for each field. See pp 167 -169 of Shelly, Cashman et al

Record the meaning of your fields ¡ ¡ The data dictionary will be a reference for the life of the system Field names may not be clear – the definition makes it clear. What do we mean by “Price”, “Phone. No”, “Sex” (Cattle have 5 sexes) If a field name can have 2 different meanings – we actually have 2 different fields In order to use a field consistently we need to have a single definition

Data Type and Format ¡ ¡ ¡ Phone #s are strings, not numbers – we don’t use them for calculation Postal codes have multiple formats depending on jurisdiction, as do phone numbers Dates are a big problem 11/1/06 - what day is this U. S. : Month/Day/Year Britain: Day/Month/Year ISO: Year/Month/Day Canada: uh, whatever… Be clear as to the internal storage format of your data as well as the acceptable input format.

Potential Problems Inconsistent use of data fields ¡ Too many copies of the same information (redundancy) ¡ l l All copies need to be updated at the same time – this could be a programming nightmare If you have 2 (or more copies of the same data – which one is right? ) Failure to supply necessary information ¡ Intentional Drift ¡

A Data Dictionary Solves… • Inconsistent use of data fields • Intentional Drift at least by developers and maintainers ; -)

How do we encourage the user to enter the right data?

Database Organization ¡ ¡ ¡ A schema is a collection of tables organized around a them, ie: Accounting, Manufacturing, Me Entity is a term used during the design stage referring to a collection of related information about people, objects or relationships. In the implementation stage we turn these entities into tables. A table consists of 0 or more rows of information A row is made up of one or more fields (or columns) A collection of fields is called a tuple. An index is an organized lookup table (usually but not necessarily sorted) used to speed access to a table.

What’s a field? A field is a column of data in a table: weather. Info( location, when, temp, windspeed, humidity) Entity/table name Fields

Why discourage composite fields? Composite fields present a problem of inconsistent data entry by the user l Composite fields require extra programming effort l Databases provide functions for accessing parts of a date. l Spatial Database systems (Geographical Information Systems) provide functions for accessing longitude, latitude and altitude. l

Each table has a primary key ¡ ¡ ¡ The purpose of the primary key is to uniquely identify any single row in the table A key allows us to look up and manipulate a single row (ie: my Bank balance) There is no such thing as a secondary key. (the book makes this mistake) The primary key is an example of an index The primary key may be made up of 1 or more fields in its table marks( student. No, course. No, final. Grade)

The text book is Wrong! (pp 402) ¡ ¡ ¡ What the book calls a “secondary key” is just another index on a table. An index may be declared as UNIQUE – this means that the database engine enforces a business rule Indices are automatically updated by the database engine on every update, insert and delete operation – other than having to declare the index programmer doesn’t have to worry about it.

Are Indices a Good Idea? ¡ Indices speed up lookup of information – required for retrieving information (Why not index everything) ¡ Indices slow down update, insert and delete operations – every indexed value that changed required another access to the disk drive.

Data Warehouse vs Transaction Database One solution is to create 2 databases – a Transaction DB to handle records subject to frequent changes, such as purchase information. This database would have very few indices And a Data Warehouse containing historical information that changes very slowly – for MIS and Research (Data Mining) activities. Data Mining is the business practice of looking through your data to find new relationships (for fun and profit!)

Database Design Preliminary Step KNOW YOUR DATA (Well, we did a Data Dictionary)

Database Design Preliminary Step: Draw an Entity-Relationship diagram: Student Show Cardinality: 1: 1 1: N N: M Course Professor

Database Normalization was developed at IBM by E. F. Codd (who passed away in April 2003). Normalization is based on the mathematics of Set Theory. C. J. Date worked with Edgar Codd and has developed theory more fully.

Database Normalization is a technique for organizing the data in your database so that you avoid the following problems: • Insertion anomaly • Deletion anomaly • Update anomaly • Minimizing and managing redundancy

Referential Integrity A database engine allows us to avoid insertion, deletion and update anomalies. Demo Using Sample. DB (Access)

Database Normalization 1. 2. 3. Identify the minimal primary key in each table (1 NF) Check each of the non-key fields in each table – make sure that each nonkey field relies on all of the key fields in the table (2 NF) If a field is calculated, remove it from the table definition; (3 NF) If it can be figured out from other fields, put it in a lookup table using the other fields as keys (Boyce/Codd Normal Form)

Database Normalization The Key The Whole Key And Nothing But the Key so help me Codd

1 NF – First Normal Form Student ( s. No, Course. No, CName, Lab 1, Lab 2, Lab 3 …, Avg, Dept) becomes: Student( s. No, Course. No, CName, Lab. No, Mark, Avg, Dept) Eliminate repeating fields – to do this we create a new field Lab. No and document it in the data dictionary

1 NF – First Normal Form Student( s. No, Course. No, CName, Lab. No, Mark, Avg, Dept) Identify the smallest possible key. This is called the candidate key - because we may change our minds later. Don’t be afraid to discuss alternative candidate keys

2 NF – Second Normal Form Student( s. No, Course. No, CName, Lab. No, Mark, Avg, Dept) => Lab. Marks( s. No, Course. No, Lab. No, Mark) Student. Info(s. No, Student. Dept) Course. Info(Course. No, CName, Dept) Course. Registration(s. No, Course. No, Avg) Check that each of the nonkey fields depends on the whole key – if not split the table(s). What does Dept really mean – is it the department of the course? Of the Student? We may need to add another field to the Data Dictionary!

3 NF – 3 rd Normal Form Student( s. No, Course. No, CName, Lab. No, Mark, Avg, Dept) => Lab. Marks( s. No, Course. No, Lab. No, Mark) Student. Info(s. No, Student. Dept) Course. Info(Course. No, CName, Dept) Course. Registration(s. No, Course. No, Avg) Eliminate calculated fields from your table (but leave them in the data dictionary – we will calculate them as as needed – this way the calculated field will never be inconsistant with the underlying values.

3 NF – 3 rd Normal Form Marks( s. No, Course. No, Lab. No, Mark) Student. Info(s. No, Student. Dept) Course. Info(Course. No, CName, Dept) Dept. Info(Dept) Where field needs to be “Looked Up”, we set up a Foreign Key referring to the field in a table where it’s the leading part of the primary key. This allow referential integrity rules to apply. Here we’ve bolded the foreign keys. In Marks, s. No and Course. No are foreign keys referring to Student. Info and Course. Info both refer to Dept. Info.

Normalization As you go through each stage, don’t be afraid to go back and revise previous stages. You may need to create or remove a field or two. If you change your key fields, go back and check that your tables still fit the normalization rooms.

Concluding Statement Normalization requires that you understand your data, what it means and how data items relate A good technique is to argue about it with other members of your team.

Database Transactions have A. C. I. D. properties Atomicity ¡ Consistancy ¡ Isolation ¡ Durability ¡

Atomicity ¡ A transaction is an all or nothing event Simple Transaction $$ is withdrawn from account A $$ is deposited into account B If only one of these occurs either A or B will be very unhappy.

Atomicity (2) A transaction is an all or nothing event Complex Transaction A reserves a hotel room and a flight and a rental car Puts down a deposit Put together this makes a single transaction. If even one part of the deal fails the database system should cancel the others. You can’t divide a transaction (or an atom) without things exploding and falling apart.

Consistancy ¡ A transaction may be inconsistant while it is still incomplete. ¡ It is consistant both before and after. Before $ is withdrawn from the Expense Fund. During: (at this point it looks like we lost $) The $ is moved to Accounts Payable After

Isolation ¡ ¡ When a transaction is still incomplete it is independent of other transactions. Maintaining isolation involves some fairly sophisticated rules because of the problem of concurrency. Before $ is withdrawn from the Expense Fund. During: (at this point it looks like we lost $) The $ is moved to Accounts Payable After A Database Engine prevents viewing the data in midstep by another process, either by locking the records (preventing a read of the records in the mid-transaction), or by providing a view of the record as it stood, before the transaction took place.

Isolation (2) A database engine will also lock all records involved in a transaction – preventing a 2 nd transaction from writing to the same record at the same time. If the engine cannot lock all the records that it needs, the transaction fails gracefully by undoing any changes. Accepting all the changes is called a commit. Undoing all the changes is called a rollback.

Durability ¡ Database storage is persistant. Once a change is made it is stored in a file – not in memory.

Database Types

Flat Files Name; Age; Sex; Height; First. Examination; Health. Ins; Temperature Smith, John; 37; M; 180; 21/12/2004; Y; 36. 2 Shiah, Mary; 22; F; 155; 1/7/2003; Y; 36. 1 Raoul, Carlos; M; 171; 17/7/2004; N; 35. 8 Ankh, David; M; 172; 20/4/1999; N; 37. 0 Talik, Mona; F; 165; 18/8/2001; Y; ; • One record per line • All records have the same structure • Fields have the same size or are delimited (ie: CSV files) • (optionally) 1 st row may have field names • Easy to read and write • Each program must contain code to correctly manage the data • Spreadsheets are often used as flat file databases

ISAM (Indexed Sequential Access Method) n Person* (Multiple Records per File) q q Name (indexed field) Street Date of Birth Checkup* (Multiple Records per Person) n n Checkup Date HDL Cholesterol Reading LDL Cholesterol Reading Attending Physician (indexed Field)

ISAM continued n n n ISAM developed at IBM in the 1960 s ADABAS (from Software AG – used by Humber College for student records) best current example. Others: IMS (IBM), RMS (Dec), BTRIEVE ( Novell) Hierarchical (regular, tree like) structure Fast – other technologies such as Relational Databases can be built on top of ISAM Most file systems are ISAMlike in their implementation Person Doctor Person Index Doctor Index

Minor Database Technologies n n n Networked Database (ie: Cognos) Object Oriented Database Management Systems (OODBMS), ie: Gemstone, Object. Store Concept Oriented Database (ie: CYC)

The 3 Elephants in the Room XML Documents ) L Q S ( l a n io t la e R Databases Search

Relational Database § § Information is organized into tables using a process known as Normalization Each row in a table is called a record Each row is uniquely indexed using one or more fields known as a primary key Indices are automatically maintained by a database engine The Largest Elephant in the Room • Oracle • DB 2 • Sybase • SQL Server • My. SQL

Relational Database SQL – Structured Query Language Most of this course will be about using SQL and connecting SQL code to Java. This is “bread and butter” material – every significant application interfaces with a database of some kind.

XML (Extended Markup Language) <? xml version="1. 0"? >  <svg width="505" height="366" view. Box="0 0 505 366"> <defs> <linear. Gradient id="AIgd 1" units="user. Space" x 1="32. 2" y 1="122. 1“ x 2="473. 9“ y 2="122. 1"> <stop offset="0" style="stop-color: #FFFFFF"/> <stop offset="0. 2135" style="stop-color: #CDDA 62"/> <stop offset="0. 882" style="stop-color: #A 9 C 992"/> <stop offset="1" style="stop-color: #FFFFFF"/> </linear. Gradient> </defs> </svg>

XML n n n (cont’d) Hierarchical/Tree Like Human Readable (? ? ? ) Tagged architecture q Every tag has an an end tag q Tags have names, content and attributes q All data comes with its own label <tag attrib 1=“value” attrib 2=“value 2”> <subtag 1>Content – text or binary data</subtag> <another. Tag /> </tag>

XML q q q (cont’d) Regular Rules make data machine readable; can automatically translate one XML format to another (XSLT) Document Oriented, not record oriented Used to store application preferences, electronic data exchange, documents (ie: Office 2003 documents; Open Office) Gets bulky quickly – lots of overhead Brittle. Don’t let humans edit an XML document – one missing bracket, quote or slash can make a document unreadable

Search S e a r c h

Search Apple’s Spotlight and Google Desktop use Search to create virtual directories based on content - potentially replacing the User’s view of the file system

Search Documents Retrieved Precision of Result Documents Desired In a perfect search, Documents Retrieved would always match Documents Desired Search is not always accurate or reliable • We don’t always find what we want • We find things we don’t want • Order of retrieval changes – outside of our control • Links can disappear • Would you trust your bank acct to Search?

Search Documents Retrieved • Search works when the data has little or no organization Precision of Result Documents Desired • Search is imprecise – we often find things what we don’t know we want • A “Page Rank” algorithm (Sergey Brin & Larry Page, 1995, Stanford University) taps the collective consciousness of the community – we individually vote for the most relevant information, so that it bubbles up to the 1 st few positions. Google uses Page Rank to tap into the collective wisdom (or foolishness) of crowds.

Comparison n Relational Databases are for reliable management of identically structured records of information. RDBMS technology scales easily so that one can consistently manage and summarize large numbers of records. XML is for application preferences and storing documents where the organization is irregular but the possible field types are known. Not all fields have to appear in every record. XML does not currently scale well up to large numbers of records. Search is for quickly retrieving documents based on words or features of text, regardless of its organization. While it lacks consistency and reliability it scales well and takes less planning and effort than the other two approaches.

End of Slide Show