IBM Information Server Simplifying the Creation of the
IBM Information Server Simplifying the Creation of the Data Warehouse © 2006 IBM Corporation
The New Role of the Data Warehouse § The data warehouse is becoming a more active and integrated participant in enterprise architectures – A source of the best information in the business – Active source of analytics § Because of this, the data warehouse has new requirements – Must be more flexible and adaptable to change – Must have trustworthy, auditable information – Must represent the business view – Must be capable of scaling to meet ever-growing information volumes 2
Critical Success Factors for Data Warehousing Auditable Data Quality § Ensure quality embedded in data flows § Understand quality changes over time § Provide proof of quality and lineage Scalability § Seamless expansion of capacity § Resource estimation § Accurate performance analysis and balancing 3 Metadata-Driven Design Acceleration & Automation § Automate connection between design and build tasks § § Provide in-tool metadata visibility § § Easily connect to any data source § Collaboration Seamless flow of metadata across roles Shared understanding between business & IT Team development Reuse § Integrated object search § Object reuse optimization § Reuse of data flows through shared services
The IBM Solution: IBM Information Server Delivering information you can trust IBM Information Server Unified Deployment Understand Cleanse Transform Deliver Discover, model, and govern information structure and content Standardize, merge, and correct information Combine and restructure information for new uses Synchronize, virtualize and move information for in-line delivery Unified Metadata Management Parallel Processing Rich Connectivity to Applications, Data, and Content 4
Critical Success Factors for Data Warehousing Auditable Data Quality § Ensure quality embedded in data flows § Understand quality changes over time § Provide proof of quality and lineage Scalability § Seamless expansion of capacity § Resource estimation § Accurate performance analysis and balancing 5 Metadata-Driven Design Acceleration & Automation § Automate connection between design and build tasks § § Provide in-tool metadata visibility § § Easily connect to any data source § Collaboration Seamless flow of metadata across roles Shared understanding between business & IT Team development Reuse § Integrated object search § Object reuse optimization § Reuse of data flows through shared services
Collaboration for Data Warehouse Design Data Admin Implementers IBM Quality. Stage Database application and transformation development Architects Rational Data Architect Metadata and data-driven data modeling and management Subject Matter Experts, Data Stewards Collaboration Analysts IBM Business Glossary IBM Information Analyzer Business definition & ontology mapped to physical data Data-driven analysis, reporting, monitoring, data rule and integration specification IBM Metadata Server § Simplify integration 6 § Facilitate change management & reuse § Increase compliance to standards § Increase trust and confidence in information
Collaborative Metadata: From Analysis to Build Collaboration § Common metamodel provides seamless flow of metadata – Analysis activities populate information into data flow design – Data. Stage users can see the table metadata from Information Analyzer – Analysis Results and notes visible • • • 7 Provides insight into quality of source Provides guidance on how flow should be defined Notes allow free-form collaboration across roles – ensuring knowledge is completed transferred from analysis to build
Critical Success Factors for Data Warehousing Auditable Data Quality § Ensure quality embedded in data flows § Understand quality changes over time § Provide proof of quality and lineage Scalability § Seamless expansion of capacity § Resource estimation § Accurate performance analysis and balancing 8 Metadata-Driven Design Acceleration & Automation § Automate connection between design and build tasks § § Provide in-tool metadata visibility § § Easily connect to any data source § Collaboration Seamless flow of metadata across roles Shared understanding between business & IT Team development Reuse § Integrated object search § Object reuse optimization § Reuse of data flows through shared services
Easily Embed Data Quality with Unified Design § One design experience § Speeds Development time § Extended User orientation in a simplified design environment § Performance Oriented 9 Auditable Data Quality
Measure Data Quality Over Time Using Baseline Reporting Auditable Data Quality § Compare quality results to a baseline to understand quality changes over time § Embed profiling tasks into sequencer to take before and after snapshots of data quality and rules adherence 10
Critical Success Factors for Data Warehousing Auditable Data Quality § Ensure quality embedded in data flows § Understand quality changes over time § Provide proof of quality and lineage Scalability § Seamless expansion of capacity § Resource estimation § Accurate performance analysis and balancing 11 Metadata-Driven Design Acceleration & Automation § Automate connection between design and build tasks § § Provide in-tool metadata visibility § § Easily connect to any data source § Collaboration Seamless flow of metadata across roles Shared understanding between business & IT Team development Reuse § Integrated object search § Object reuse optimization § Reuse of data flows through shared services
In-tool Metadata Visibility Metadata-Driven Design Impact Analysis: -Find dependencies …What does this item depend on? -Find where used …Where is this item used? Results shown using the Advanced Find window 12
Job Difference – Integrated report Difference report displayed in Designer - jobs opened automatically from report hot links Options available to: - 13 Print report Save report as HTML Metadata-Driven Design
Slowly Changing Dimension Design Acceleration § New engine capabilities – Surrogate Key management – Updatable in-memory lookups § New & enhanced stages – Surrogate Key Generator – Slowly Changing Dimension § Single stage per Dimension – Quick setup and definition – Easy single point of maintenance 14 Metadata-Driven Design
Rapid Connectivity: Common Connectors Connection objects allow properties to be dropped onto stage Test the connection instantly Diagram lets you select the link to edit as though your on the canvas 15 Metadata-Driven Design Warning sign tells you which fields are mandatory Parameter button on every field Graphical ODBC specific SQL builder
Critical Success Factors for Data Warehousing Auditable Data Quality § Ensure quality embedded in data flows § Understand quality changes over time § Provide proof of quality and lineage Scalability § Seamless expansion of capacity § Resource estimation § Accurate performance analysis and balancing 16 Metadata-Driven Design Acceleration & Automation § Automate connection between design and build tasks § § Provide in-tool metadata visibility § § Easily connect to any data source § Collaboration Seamless flow of metadata across roles Shared understanding between business & IT Team development Reuse § Integrated object search § Object reuse optimization § Reuse of data flows through shared services
Reuse: Find It § Find item in Repository tree – In-place find – Find by Name (Full or Partial) – Wild card support – Find next… – Filter on type 17 Reuse
Find – Advanced Search Criteria § Search on following criteria: – Object type • Job, Table Definition, Stage etc. – Creation • Date/Time • By User – Last Modification • Date/Time • By User – Where Used • What other objects use this object? – Dependencies of • What does this object use? § Options – Case – Match on “name & description” or “name or description” 18 Reuse
Reuse: Connection Objects § Allows saving of a re-usable connection path to a specific source or target – Username, password, db name etc. § Can be used for: – Stage connection properties • • Loading onto a stage in the stage editor Drag ‘n’ drop from Repository tree – Meta data import from that source or target – Drag ‘n’ drop table imported from that source or target onto canvas to create a pre-configured stage instance 19 Reuse
Reuse: Job Parameter Sets 20 § New object in repository that contains the names and values of job parameters. § A Job Parameter set can be referenced by one or more jobs enabling easier deployment of jobs across machines and also enabling easy propagation of a changed job parameter value Reuse
Reuse: Simply Deploy Data Flows as Shared Services § Automates the creation of information integration services including federation § Provides fundamental infrastructure services (security, logging, monitoring) § Provisions to leading web services JMS, EJB and SOAP over HTTP § Provides load balancing & fault tolerance for requests across multiple Service providers 21 Reuse
Critical Success Factors for Data Warehousing Auditable Data Quality § Ensure quality embedded in data flows § Understand quality changes over time § Provide proof of quality and lineage Scalability § Seamless expansion of capacity § Resource estimation § Accurate performance analysis and balancing 22 Metadata-Driven Design Acceleration & Automation § Automate connection between design and build tasks § § Provide in-tool metadata visibility § § Easily connect to any data source § Collaboration Seamless flow of metadata across roles Shared understanding between business & IT Team development Reuse § Integrated object search § Object reuse optimization § Reuse of data flows through shared services
Job Performance Analysis Scalability A new visualization tool which: § Provides deeper insight into runtime job behavior. § Offers several categories of visualizations, including: – Record Throughput – CPU Utilization – Job Timing – Job Memory Utilization – Physical Machine Utilization § Hides runtime complexity by emphasizing the stages the customer placed on the designer canvas. 23
Record Throughput § Breakdown of records read and records written per second. § Filters to show one line for each link drawn on the canvas initially. § Names used to refer to each dataset are the actual stage names on the canvas. § Advanced users can turn off filters and see every runtime dataset, including the inner operators of composites, and inserted operator datasets. § One tab for each partition, as well as the ability to show an overlay view including every partition for smaller jobs. 24 Scalability
CPU Utilization § Visualizes the time in CPU of each operator. Scalability Total CPU and System Time § Shows what operators were dominating the CPU at different points during the run. § Percentage view shows what percentage of the CPU load of the job each stage on the canvas was responsible for. § Inserted operators and Composite sub-operators automatically get bundled up in these results. § Advanced users can see combination, which will change this chart to reflect each process and the stages contained within. 25 Percentage CPU Pie Chart
Physical Machine Utilization 26 Scalability Average Process Distribution Disk Throughput Free Memory Whisker Box Percent CPU Utilization
Resource Estimation § Provides estimates for required disk space and CPU utilization. Helps with: – Job design –detect bottlenecks and optimize transformation logic to improve performance – Error protection – run with a range of data of particular interest for a better protection from job aborts due to bad data formats or insufficient null-handling – Resource allocation – determine allocation of scratch space and disk space to protect the job from aborts due to lack of space § Two Statistical Models – Static – provides worst case disk space estimates based on schema and job design. 27 – Dynamic – Runs job and statistically samples actual resource usage. Then provides calculated estimates per node Scalability
Resource Estimation Tool Layout 28 Scalability
Migration Path § Seamless upgrade for Web. Sphere Data. Stage users into the IBM Information Server – All Data. Stage jobs along with all other objects will migrate into the IBM Information Server § Upgrade for Web. Sphere Quality. Stage into the IBM Information Server – Existing Quality. Stage projects will migrate into the IBM Information Server – Conversion utilities for Standardize, Match, and Survive stages – All other stages will continue to execute 29
Summary § Data Warehouses are becoming a tier one operational system in many companies § They must be able to adapt to change more quickly, must have authoritative information, and must be scalable § Platforms for building data warehouses must support metadatadriven design, collaboration, reuse, and auditable data quality, and must be able to scale to support growing data volumes § The IBM Information Server provides all of this in a unified platform 30 Metadata-Driven Design Acceleration & Automation Auditable Data Quality Scalability Collaboration Reuse IBM Information Server
31
- Slides: 31