Physical Data Modeling Implementation BINF 697 Outline l

Physical Data Modeling – Implementation BINF 697

Outline l DBMS infrastructure & performance l DBMS data-types, sizes, collation l Create, read, update, delete (CRUD) l Logical data-access architectures l Denormalization, summary tables BCHB 697 - Edwards 2

The Data Modeling Process l Conceptual Data Modeling l l Logical Data Modeling l l l Define the entities and their relationships Define each entities’ attributes, incl. types Choose primary key(s) for each entity Attribute details (cardinality, optional) Normal Forms. Physical Data Modeling l Implementation BCHB 697 - Edwards 3

Database Management Systems (Oracle, …) l Monolithic servers, expensive software l l l Single point of failure! l l Many databases, each with many tables Users, permissions, etc. the focal point of all projects, etc. High priests (DBAs) manage as infrastructure Big memory, fast disk, fast network Clients connect from other computers via the network l Clients may be web-applications, scripts, or ad-hoc query tools BCHB 697 - Edwards 4

Database Management Systems (My. SQL, Maria. DB, …) l Monolithic servers (Free software) l l l Anyone can install (Linux, PC, etc. ) l l l Many databases, each with many tables Users, permissions, etc. One per project, not one per machine Commodity hardware, free OS, ad-hoc Clients connect from the same or other computers via the network l Clients may be web-applications, scripts, or adhoc query tools BCHB 697 - Edwards 5

LAMP Stack BCHB 697 - Edwards 6

Database Management Systems (SQLite) l No servers, free l l l Nothing to install (Linux, PC, etc. ) l l l One database per file, with many tables No users, just filesystem permissions One/many per project, sometimes temporary Personal computers, often directly integrated with client software Clients open file directly l Web-applications, scripts, or ad-hoc query tools BCHB 697 - Edwards 7

Data access hierarchy Memory (RAM) Cost & Speed Disk Capacity (Hard Drives, SSD) Intranet / Local Network (DBMS, File Server) Internet / Wide Area Network (Web, Cloud Storage) BCHB 697 - Edwards 8

Database Management Systems (Common) l Use the filesystem (disk) to store the data l l l Access data using fixed-size disk “blocks” l l Database size >> RAM; Lots of data-structures and low-level tricks to retrieve some data from disk to memory quickly. Many rows at a time; # of rows depends of # of bytes/row; keep small! Keep frequently accessed data in memory Avoid sending large results over the network! BCHB 697 - Edwards 9

DBMS data-types: Numeric My. SQL Documentation, Table 11. 1 l Also, FLOAT (4 bytes), DOUBLE (8 bytes) l l Floating point numbers DECIMAL (digits, decimals) l Exact decimal representation, expensive BCHB 697 - Edwards 10

DBMS data-types: String l l l CHAR(n): fixed length string, characters VARCHAR(n): variable length string, characters BLOB(n): variable length bytes, binary ENUM: fixed set of possible string values, internally represented as integers Characters come from a “character set” with a “collation” – often case insensitive: l Use utf 8_general_ci (or similar) BCHB 697 - Edwards 11

Semantic considerations l ID columns are typically integer, unsigned l l l l Primary key id columns are auto increment Consider the number of instances for size Use CHAR for accessions, VARCHAR for “human” strings, descriptions etc. Use BLOBs for data Use ENUM when possible Could represent DECIMAL using CHAR or INTEGER… Usually DATE and DATETIME data-types are available too BCHB 697 - Edwards 12

Use-cases drive performance! l Consider the lifecycle of the data stored l l Are updates, additions, deletions interleaved with read-only data-access? l l Create, read, update, delete Changing the underlying data can invalidate inmemory caches, pre-computes, and summaries Writes (to disk) are typically slower than reads Consistency issues abound… Which data is accessed most frequently, needed with least latency? What is the data-access timescale? BCHB 697 - Edwards 13

Use-cases drive performance! l Data-access hierarchy considerations: l l l Is the relevant disk-block already in memory? Is the result in a single disk-block? Can we determine the disk-block to retrieve without reading all of a table’s disk blocks? Does the query require scanning and/or retrieving an entire table? Understanding the use-cases for data CRUD will help identify performance bottlenecks BCHB 697 - Edwards 14

Database Application Architectures l Presentation/View Layer l l Application/Controller Layer l l User-facing, user-interface, display Generates use-cases to support view e. g. web-browser, client software Translates the Presentation/View layer requests into queries of the DBMS/Data/Model layer Handles business logic, Based on logical data-model DBMS/Data Layer l Executes the requested query on the physical data-model implementation BCHB 697 - Edwards 15

Database Application Architectures l Two-tier applications: l l Presentation and application together Direct connection to DBMS Change in the physical data-model forces change on the client Three-tier applications: l l l Presentation communicates with “middle” (application) tier Middle tier interprets logical data model requests Middle tier queries DBMS Change in the physical data-model forces changes only in the middle tier. Changes in presentation might require change in the 16 BCHB 697 - Edwards middle tier only.

Database Application Architectures l Middle tiers provide rich, aggregate, denormalized data for presentation layer l l l Application Programming Interface (API) Often return their results as XML or JSON Can be implemented as using one, or many, queries to the DBMS also provide “views” which behave like tables, but which are constructed on the fly DBMS optimization can be carried out without affecting the client application BCHB 697 - Edwards 17

Denormalization l Derived values and/or summary tables violate the normalized forms (1 NF, especially)… l l l Pre-computing derived or summary values shifts the expense away from the query l l …but some are expensive to compute e. g. total drug expenditure across all operations …but must be updated whenever source changes …leading to slow updates/deletes/additions …or must be allowed to be out-of-date (how much? ) Multi-table updates can lead to inconsistencies l l Transactions guarantee atomic changes… …but can block read access to many tables. BCHB 697 - Edwards 18

Denormalization l Large or infrequently accessed columns l l l Split problem columns to secondary table Reduces size of primary table’s rows in memory BLOBs are rarely needed to determine queries Put large BLOBs on the filesystem and store the filenames… Redundant values: l l Avoid accessing multiple tables. e. g. scientific name in Taxonomy database BCHB 697 - Edwards 19

Exercise l l l Install and start your BINF 697 virtual machine Start a web-browser on the host, and connect to http: //localhost: 8888/dashboard Click on php. My. Admin Create New Databases: taxonomy, sakila Use the Import tab to populate l l SQL dumps are in the course data-directory Poke around…. what is good, what is bad? BCHB 697 - Edwards 20

Homework l Create a physical data model for My. SQL using php. My. Admin for your Class Registration logicaldata model Populate the rows for each table, using Import from CSV format Export as SQL, including the data. l Submit exported SQL dump. l Due Feb 7 th, 10 am l l BCHB 697 - Edwards 21
- Slides: 21