Innovations in Database Technology IRMAC BIDW SIG May
Innovations in Database Technology IRMAC BI/DW SIG May 28, 2009
Agenda § About Infobright § Data Warehousing Challenge § Use Cases § Infobright Approach § Infobright Architecture § Infobright Versions & System Requirements 2
About Infobright 3
About Infobright Founded 2006 Headquarters Toronto, Canada; offices in Boston, MA and Warsaw, Poland The Infobright Data Warehouse Simplicity: No new schemas, no indices, no data partitioning, easy to maintain Scalability: Designed for rapidly growing volumes. Ideal for up to 30 TB Low TCO: Industry-leading compression, less storage, industry standard servers, low software costs, minimal ongoing operational expenses The Open Source Solution Community (open source) and Enterprise Editions are available § Leverages My. SQL connectivity to ETL and BI My. SQL Integration § Provides My. SQL customers with scalable, enterprise-ready data warehouse § My. SQL/SUN Microsystems invests in Infobright Sept 15, 2008 4
Data Warehousing Challenge 5
. Data Warehousing Challenges Limited Resources and Budget More Data, More Data Sources 10 10110 10 01 10 1 0 01 01 10 1 01 01 01 10 1 0 1 10 101 10 0 100 01 1010 01010101010 External Sources 01 1 0 0 10 101101 0 10 10 1 0101010101010101 Real time data 01 1 0101010101010101010101010 Multiple databases More Kinds of Output Needed by More Users, More Quickly 0 0 1 10 011 0 1010 10 1 0 1010 10 1 1 010 1 0 1 1010 0100 1 0 1 1 01 0101 0 10 1 0 0 1 10 10 1 0 § Labor intensive, heavy 6 6 Traditional Data Warehousing indexing and partitioning § Hardware intensive: massive storage; big servers § Expensive and complex
Data Warehousing – Raising The Bar New Demands: Larger transaction volumes driven by the internet Impact of Cloud Computing More -> Faster -> Cheaper Data Warehousing Matures: Near real time updates Integration with master data management Data mining using discrete business transactions Provision of data for business critical applications Early Data Warehouse Characteristics: Integration of internal systems Monthly and weekly loads Heavy use of aggregates 7
Use Cases 8
Use Cases Infobright is a good fit for; • • Loading millions of transactions with a limited batch window Summarizing transactional data for trend analysis Extracting transactional detail based on specific constraints Ad hoc query support across many dimensional attributes Avoid using Infobright for; • Real-time transactional updates (operational data entry) • Full data extracts (select * from …) • Row based operations that need to access all columns of a table are typically better suited to row based databases 9
Customer Experience – Load Speed Business Requirement • Mavenir - OEM customer deploying a world wide telco application • Application provides operators with access to detailed SMS traffic • Needed a low cost solution with the ability to load 20 K records per second • Peak of 70 M messages per hour during Chinese New year Solution • Custom front end developed using My. SQL JDBC driver • Completed design, test, deployment in < 3 months with no assistance from Infobright • Allowed for expansion from 7 to 90 days of online SMS history • Supports plan for 70% annual growth • Rollout to allow for 120 concurrent users 10
Customer Experience – Query Performance Business Requirement • Sulake - Online Social Networking service with 126 M users across 31 countries • 990 M page impressions per month • Need to quickly analyze online spend on a daily basis to enhance online experience and drive additional revenue • Existing Inno. DB solution was able to process business queries in a reasonable time frame (queries taking hours to complete) Solution • Business opportunities were being lost due to inability to analyze subscriber behavior using transactions • Customer used existing data model and deployed the application using Business Objects – Data Integrator for ETL, Web-Intelligence for BI • Existing ETL workflows were converted to Infobright in less than 4 weeks without assistance • Historically long running queries (hours) now running in minutes and seconds • Additional benefits due to compression were a reduced need for disk storage and an overall reduction in I/O and network traffic 11
Customer Experience - TCO $ Business Requirement • A global provider of electronic trading solutions across 22 time zones and 700 financial exchanges • Wanted to expand analytical access to financial transactions to include both current (30 days) and archived transactions (4 years) • Expansion of existing Sybase solution was too costly Solution • Infobright was able to achieve performance benchmarks within the first 3 days of a proof of concept using production data • 28, 000 records per second load speed • Join 100 M row with a 30 Mrow table -> 400 k rows, returned in 185 seconds • Additional queries that did not complete using Sybase, finished in minutes using Infobright • Final solution deployed using Pentaho Kettle for ETL and Crystal Reports for BI • Success with modest data size (150 GB) has opened opportunities for additional more detailed transactional analysis 12
Customer Experience – Query Performance and TCO Business Requirement • Trade. Doubler – Based in Sweden, a global digital marketing company, serving 1600+ online advertisers across Europe and Asia. • Trade. Doubler optimizes Web marketing campaigns by analyzing Web clicks, impressions and purchases. • Analyzing terabytes of data about the results of its programs is central to the company’s success. • Selected Infobright to produce analytical results rapidly, seamless interoperability with their My. SQL database and low TCO Solution • Deployed solution using a single, $12, 500 Dell server with 8 CPU cores and 16 GB RAM • Used Pentaho Kettle for ETL and Jaspersoft Server Pro Reports for BI • Needed to process and analyze data 20 billion online transactions/month • In POC, loaded > 3. 2 billion rows at > 300, 000 rows / second • In production, achieved 30 x data compression • Extremely fast query speed. 3 queries that previously did not return, now returned within a minute 13
Infobright Approach 14
Introducing Infobright Column Orientation Knowledge Grid – statistics and metadata “describing” the supercompressed data Data Packs – data stored in manageably sized, highly compressed data packs Data compressed using algorithms tailored to data type 1 5 15 Smarter architecture § Load data and go § No indices or partitions to build and maintain § Knowledge Grid automatically updated as data packs are created or updated § Super-compact data footprint can leverage off-theshelf hardware
Column vs. Row-Oriented EMP_ID 1 2 3 FNAME Moe Curly Larry LNAME Howard Joe Fine SALARY 10000 12000 9000 Row Oriented (1, Moe, Howard, 10000; 2, Curly, Joe, 12000; 3, Larry, Fine, 9000; ) § Works well if all the columns are needed for every query. § Efficient for transactional processing if all the data for the row is available Column Oriented (1, 2, 3; Moe, Curly, Larry; Howard, Joe, Fine; 10000, 12000, 9000; ) § Works well with aggregate results (sum, count, avg. ) § Only columns that are relevant need to be touched § Consistent performance with any database design § Allows for very efficient compression 16
Data Packs and Compression Data Packs 64 K § Each data pack contains 65, 536 data values § Compression is applied to each individual data pack § The compression algorithm varies depending on data type and data distribution Compression 64 K 17 § Results vary depending on the Patent Pending Compression Algorithms distribution of data among data packs § A typical overall compression ratio seen in the field is 10: 1 § Some customers have seen results have been as high as 40: 1
Knowledge Grid Data Pack Nodes (DPN) A separate DPN is created for every data pack created in the database to store basic statistical information Character Maps (CMAPs) Every Data Pack that contains text creates a matrix that records the occurrence of every possible ASCII character Histograms are created for every Data Pack that contains numeric data and creates 1024 MIN-MAX intervals. Pack-to-Pack Nodes (PPN) PPNs track relationships between Data Packs when tables are joined. Query performance gets better as the database is used. This metadata layer = 1% of the compressed volume 18
A Simple Query using the Knowledge Grid SELECT count(*) FROM employees WHERE salary > 50000 AND age < 65 AND job = ‘Shipping’ AND city = ‘TORONTO’; 1. Find the Data Packs with salary > 50000 2. Find the Data Packs that contain age < 65 3. Find the Data Packs that have job = ‘Shipping’ 4. salary age job Rows 1 to 65, 536 All packs ignored 65, 537 to 131, 072 All packs ignored 131, 073 to …… Find the Data Packs that have City = “Toronto’ 5. Now we eliminate all rows that have been flagged as irrelevant. 6. Finally we have identified the data pack that needs to be decompressed All packs ignored Only this pack will be decompressed Completely Irrelevant Suspect All values match 19 city
A Join Query using the Knowledge Grid Car Sales id sale discount prov Sales Person id date SELECT MIN(sale), MAX(discount), name FROM carsales, salesperson WHERE carsales. id = salesperson. id AND carsales. prov = ‘ON’ AND carsales. date = ‘ 2008 -02 -29’ GROUP BY name; name Pack-to-Pack carsales_id vs salesperson_id salesperson. id 1. Eliminate the Car Sales Data Packs that are irrelevant based on constraints in the SQL 2. Determine the related Sales Person Data Packs based on the values of carsales_id found in the relevant Car Sales Data Packs. 3. Create a Pack-to-Pack node that stores the results of the join condition between Car Sales and Sales Person. 4. Any subsequent queries will be able to use the PPN to resolve joins between Car Sales and Sales Person 0 1 0 carsales. id 20 Indicates that the Data Packs are related
Infobright Architecture 21
Infobright – Embedded With My. SQL/Infobright Architecture CONNECTORS: Native C API, JDBC, ODBC, . NET, PHP, Python, Perl, Ruby, VB My. ISAM • Views • Users • Permissions • Tables Defs Caches & Buffers Parser My. SQL Loader Management Services & Utilities SQL Interface CONNECTION POOL: Authentication, Thread Reuse, Connection Limits, Check Memory, Caches Infobright Loader / Unloader Infobright Optimizer and Executor My SQL Optimizer Knowledge Grid Knowledge Knowledge Knowledge Knowledge Node Node Node Node Data Pack Data Pack Data Pack Data Pack Node Node Node Node Infobright ships with the full My. SQL binaries. The My. SQL architecture is used to support database components such as connectors, security and memory management. Compressor // Decompressor Data Data Pack Pack Data Pack Data Pack Infobright Components • IB Storage Engine consisting of 64 Kb Data Packs, Compressor, and the Knowledge Grid • IB Optimizer that uses rough set algorithms and the knowledge grid to navigate the database • IB Loader supports text based and binary data formats 22
Optimized SQL for Infobright My. SQL The Infobright Optimizer supports a large amount of My. SQL syntax and functions. When the optimizer encounters SQL syntax that is not supported, then the query is executed using the My. SQL optimizer. Infobright Optimized SQL • Select Statements • Comparison Operators • Logical Operators • String Comparison Functions (LIKE, . . ) • Aggregate Functions • Arithmetic Operators • Data Manipulation Language (I/U/D) 23 • Data Definition Language (CREATE & DROP) • String Functions • Date/Time Functions • Numeric Functions • Trigonometric Functions • Case Statements
String Numeric Date Infobright Data Types 24 Most of the data types expected for a My. SQL database engine are fully supported. The data types that are currently not implemented within Infobright include BLOB, ENUM, SET and Auto Increment.
ETL Integration Leverage existing IT tools and resources for fast, simple deployments and low TCO § Increased efficiency with popular platforms § Deeper ETL Integration § Jaspersoft, Talend, Pentaho § Leverages end-to-end data management provided by ETL tools § Improved support for Data Manipulation Language (DML) 25
Data Loading with & without custom ETL connectors § Loading Infobright tables with custom connectors: § Kettle from Pentaho § Talend ETL from Talend § Jaspersoft ETL (Talend) from Jaspersoft Two ways to invoke Infobright loader without connectors 1. Generate a CSV or binary file and invoke the Infobright loader to load the file 2. Named pipe technique: § Create a named pipe (i. e. mkfifo /home/mysql/s_mysession 1. pipe) § Launch the Infobright loader in the background to read from the pipe § Launch the ETL process that writes data to the named pipe § When the ETL process runs, as records are written to the named pipe, the loader reads them and writes them to an Infobright database table 26
Infobright Versions & System Requirements 27
Comparison of ICE and IEE Features Forums and/or one-time 4 -hr support pack Available Warranty and Indemnification No Included INSERT/UPDATE/DELETE No Supported Infobright Loader Up to 50 GB/hr Multi-threaded, Up to 300 GB/hr Data Load Types Text only Text & Binary (100% faster) No Supported Technical Support My. SQL Loader Platform Support 28 64 -bit Intel and AMD RHEL 5, Cent. OS 5, Debian Windows Server 2003, 32 -bit Intel and AMD for Windows Server 2008, RHEL 5, Cent. OS 5, Debian, Solaris 10 XP, Ubuntu 8. 04, Fedora 9
System Requirements 29
For More Information Data Warehouse Evangelist Bob Newell Bob. Newell@infobright. com Or join our open source community at www. infobright. com Thank you 30
Query performance Infobright # Query name Traditional DB Intervall No cache Cache 1 Affiliate/minor/sum(order)/year 20060101 -20061231 7, 72 0, 99 13, 00, 91 4, 03, 21 2 Affiliate/major/sum(order)/year 20060101 -20061231 31, 52 7, 81 N/A 1 Affiliate/minor/sum(order)/month 20060101 -20060131 1, 32 0, 43 1, 00, 43 10, 69 2 Affiliate/major/sum(order)/month 20060101 -20060131 3, 23 0, 65 2, 12, 34 18, 55 3 Events/Cat=2/Country/sum(no of)/year 20060101 -20061231 37, 16 24, 42 N/A 4 Events/Cat=*/Country/sum(no of)/year 20060101 -20061231 41, 67 29, 62 N/A 3 Events/Cat=2/Country/sum(no of)/month 20060101 -20060131 15, 16 7, 15 8, 08, 13 2, 10, 15 4 Events/Cat=*/Country/sum(no of)/month 20060101 -20060131 22, 12 8, 01 15, 08, 32 3, 12, 82 Time in minutes, seconds, milliseconds 3 311
- Slides: 31