What Does Ab Initio Mean Ab Initio is
What Does Ab Initio Mean? § Ab Initio is a Latin phrase that means: § Of, relating to, or occurring at the beginning; first § From first principles, in scientific circles § From the beginning, in legal circles
About Ab Initio § Ab Initio is a general purpose data processing platform for enterprise class, mission critical applications such as data warehousing, clickstream processing, data movement, data transformation and analytics. § Supports integration of arbitrary data sources and programs, and provides complete metadata management across the enterprise. § Proven best of breed ETL solution. § Applications of Ab Initio: – – – ETL for data warehouses, data marts and operational data sources. Parallel data cleansing and validation. Parallel data transformation and filtering. High performance analytics Real time, parallel data capture.
Ab initio Platforms § No problem is too big or too small for Ab Initio runs on a few processors or few hundred processors. Ab Initio runs on virtually every kind of hardware § SMP (Symmetric Multiprocessor) systems § MPP (Massively Parallel Processor) systems § Clusters § PCs
Ab Initio runs on many operating systems § § § Compaq Tru 64 UNIX Digital unix Hewlett-Packard HP-UX Ibm aix NCR MP-RAS Red Hat Linux IBM/Sequent DYNIX/ptx Siemens Pyramid Reliant UNIX Slicon Graphics IRIX Sun Solaris Windows NT and Windows 2000
Ab Initio base software consists of three main pieces: § Ab Initio Co>Operating System and core components § Graphical Development environment(GDE) § Enterprise Metadata Environment(EME)
Ab Initio Architecture Applications Application Development Environments Graphical C ++ Shell Component Library User-defined Components Ab Initio Metadata Repository Third Party Components Ab Initio Co>Operating System Native Operating System UNIX Windows NT
Ab Initio Overview User Create all your graphs GDE Run all your graphs EME Co>Operating system Store all variables in a repository / is also used for control / also collects all metadata about graph developed in GDE DTM User Graph when deployed generate. ksh Used to schedule graphs developed in GDE. It also has capability to maintain dependencies between graphs
Co>Operating System § The Co>Operating System is core software that unites a network of computing resources-CPUs, storage disks, programs, datasets-into a production-quality data processing system with scalable performance and mainframe reliability. § The Co>Operating System is layered on top of the native operating systems of a collection of computers. It provides a distributed model for process execution, file management, process monitoring, check-pointing, and debugging.
Graphical Development Environment (GDE) § GDE lets create applications by dragging and dropping components onto a canvas configuring them with familiar, intuitive point and click operations, and connecting them into executable flowcharts. § These diagrams are architectural documents that developers and managers alike can understand use. the co>operating system executes these flowcharts directly. This means that there is a seamless and solid connection between the abstract picture of the application and the concrete reality of its execution.
Graphical Development Environment (GDE) § The Graphical Development Environment (GDE) provides a graphical user interface into the services of the Co>Operating System. § Unlimited scalability : Data parallelism results in speedups proportional to the hardware resources provided, double the number of CPUs and execution time is halved. § Flexibility : Provides a powerful and efficient data transformation engine and an open component model for extending and customizing Ab Initio’s functionality. § Portability : Runs heterogeneously across a huge variety of operating system and hardware platforms.
Graphical Method for Building Business Applications § A Graph is a picture that represents the various processing stages of task and the streams of data as they move from one stage to another. § One Picture is worth a thousand words, is one graph worth a thousand lines of code? Ab Initio application graphs often represent in a diagram or two what might have taken hundreds to thousands of lines of code. This can dramatically reduce the time it takes to develop, test, and maintain application
What is Graph Programming Ab Initio has based the GDE on the Data Flow Model § Data flow diagrams allow you to think in terms of meaningful processing steps, not microscopic lines of code § Data flow diagrams capture the movement of information through the application. § Ab Initio calls this development method Graph Programming
Graph Programming? § The process of constructing Ab Initio applications is called Graph Programming. § In Ab Initio’s Graphical Development Environment, you build an application by manipulating components, the building blocks of the graph. § Ab Initio Graphs are based on the Data Flow Model. Even the symbols are similar. The basic parts of Ab Initio graphs are shown below.
Symbols Boxes for processing and Data Transforms Arrows for Data Flows between processes Cylinders for serial I/O files Divided cylinders for parallel I/O files Grid boxes for database tables
Graph Programming § Working with the GDE on your desktop is easier than drawing a data flow diagram on a white board. You simply drag and drop functional modules called Components and link them with a swipe of the mouse. When it’s time to run the application, Ab Initio Co>Operating System turns the diagram into a collection of process running on servers § The Ab Initio term for running data flow diagram is a Graph. The inputs and outputs are dataset components; the processing steps are program components; and the data conduits are flows.
Anatomy of a Running Job What happens when you push the “Run” button? Ø Your graph is translated into a script that can be executed in the Shell Development Environment. Ø This script and any metadata files stored on the GDE client machine are shipped (via FTP) to the server. Ø The script is invoked (via REXEC or TELNET) on the server. Ø The script creates and runs a job that may run across many nodes. Ø Monitoring information is sent back to the GDE client.
Anatomy of a Running Job § Host Process Creation Ø Pushing “Run” button generates script. Ø Script is transmitted to Host node. . Ø Script is invoked, creating Host process Host GDE Client Host Processing nodes
Anatomy of a Running Job § Agent Process Creation Ø Host process spawns Agent processes. Host GDE Agent Client Host Agent Processing nodes
Anatomy of a Running Job § Component Process Creation Ø Agent processes create Component processes on each processing node. Host GDE Agent Client Host Agent Processing nodes
Anatomy of a Running Job § Component Execution Ø Component processes do their jobs. Ø Component processes communicate directly with datasets and each other to move data around. Host GDE Agent Client Host Agent Processing nodes
Anatomy of a Running Job § Successful Component Termination Ø As each Component process finishes with its data, it exits with success status. Host GDE Agent Client Host Agent Processing nodes
Anatomy of a Running Job § Agent Termination Ø When all of an Agent’s Component processes exit, the Agent informs the Host process that those components are finished. Ø The Agent process then exits. Host GDE Client Host Processing nodes
Anatomy of a Running Job § Host Termination Ø When all Agents have exited, the Host process informs the GDE that the job is complete. Ø The Host process then exits. Host GDE Client Host Processing nodes
Ab Initio S/w Versions & File Extensions § Software Versions – Co>Operating System Version => 2. 8. 32 – GDE Version => 1. 8. 22 § File Extensions –. mp Stored Ab Initio graph or graph component –. mpc Program or custom component –. mdc Dataset or custom dataset component –. dml Data Manipulation Language file or record type definition –. xfr Transform function file –. dat Data file (either serial file or multifile)
Versions § To find the GDE version Select Help >> About Ab Initio from the GDE window. § To find the Co>Operating System version Select Run >> Settings from the GDE window. Look for the Detected base System Version.
Connecting to Co>op Server from GDE
Host Profile Setting 1. 2. 3. 4. 5. 6. 7. 8. Choose settings from the run menu Check the use host profile setting checkbox. Click Edit button to open the Host profile dialog. If running Ab Initio on your local NT system, check Local Execution (NT) checkbox and go to step 6. If running Ab Initio on a Remote UNIX system, fill in the path to the Host and Host Login and Password. Type the full path of Host directory. Select the Shell Type from pull down menu. Test Login and if necessary make changes.
Host Profile Enter Host, Login, Password & Host directory Select the Shell Type
Ab Initio Components Ab Initio provided components. Datasets, Partition, Transform, Sort, Database are frequently used.
Creating Graph Type the Label Specify the Input. dat file
Create Graph - Dml Specify the. dml file § Propagate from Neighbors: Copy record formats from connected flow. § Same As: Copy record format’s from a specific component’s port. § Path: Store record formats in a Local file, Host File, or in the Ab Initio repository. § Embedded: Type the record format directly in a string.
Creating Graph - dml § DML is Ab Initio’s Data Manipulation Language. § DML describes data in terms of Editing. dml file through Record Format Editor – Grid View – Record Formats that list the fields and format of input, output, and intermediate records. – Expressions that define simple computations, for example, selection. – Transform Functions that control reformatting, aggregation, and other data transformations. – Keys that specify groupings, ordering, and partitioning relationships between records.
Creating Graph - Transform § § Specify the. xfr file § A transform function is either a DML file or a DML string that describes how you manipulate your data. Ab Initio transform functions mainly consist of a series of assignment statements. Each statement is called a business rule. When Ab Initio evaluates a transform function, it performs following tasks: – Initializes local variables – Evaluates statements – Evaluates rules. § Transform function files have the xfr extension.
Creating Graph - xfr § § Transform functions: A set of rules that compute output values from input values. Business rule: Part of a transform function that describes how you manipulate one field of your output data. Variable: Optional part of a transform function that provides storage for temporary values. Statement: Optional part of a transform function that assigns values of variables in a specific order.
Sample Components § § § § § Sort Dedup Join Replicate Rollup Filter by Expression Merge Lookup Reformat etc.
Creating Graph – Sort Component Specify Key for the Sort § Sort: The sort component reorders data. It comprises two parameters: Key and max-core. § Key: The Key is one of the parameters for Sort component which describes the collation order. § Max-core: The max-core parameter controls how often the sort component dumps data from memory to disk.
Creating Graph – Dedup component § Dedup component removes duplicate records. § Dedup criteria will be either uniqueonly, First or Last. Select Dedup criteria.
Creating Graph – Replicate Component § Replicate combines the data records from the inputs into one flow and writes a copy of that flow to each of its output ports. § Use Replicate to support component parallelism.
Creating Graph – Join Component • Specify the key for join • Specify Type of Join
Database Configuration (. dbc) § A file with a. dbc extension which provides the GDE with the information it needs to connect to a database. A configuration file contains the following information: – The name and version number of the database to which you want to connect. – The name of the computer on which the database instance or server to which you want to connect runs, or on which the database remote access software is installed. – The name of the database instance, server, or provider to which you want to connect. – You generate a configuration file by using the Properties dialog box for one of the Database components.
Creating Parallel Applications § Types of Parallel Processing – Component-level Parallelism: An application with multiple components running simultaneously on separate data uses component parallelism. – Pipeline parallelism: An application with multiple components running simultaneously on the same data uses pipeline parallelism. – Data Parallelism: An application with data divided into segments that operates on each segment simultaneously uses data parallelism.
Partition Components § § Partition by Expression: Dividing data according to a DML expression. Partition by Key: Grouping data by a key. Partition with Load balance: Dynamic load balancing. Partition by Percentage: Distributing data, so the output is proportional to fractions of 100. § Partition by Range: Dividing data evenly among nodes, based on a key and a set of partitioning ranges. § Partition by Round-robin: Distributing data evenly, in blocksize chunks, across the output partitions.
Departition Components § Concatenate: Concatenate component produces a single output flow that contains first all the records from the first input partition, then all the records from the second input partition and so on. § Gather: Gather component collects inputs from multiple partitions in an arbitrary manner, and produces a single output flow, does not maintain sort order. § Interleave: Interleave component collects records from many sources in round robin fashion. § Merge: Merge component collects inputs from multiple sorted partitions and maintains the sort order.
Multifile systems § A multifile system is a specially created set of directories, possibly on different machines, which have identical substructure. § Each directory is a partition of the multifile system. When a multifile is placed in a multifile system, its partitions are files within each of the partitions of the multifile system. § Multifile system leads to better performance than flat file systems because multifile systems can divide your data among multiple disks or CPUs. § Typically (SMP machine is exception) a multifile system is created with the control partition on one node and data partitions on other nodes to distribute the work and improve performance. § To do this use full internet URLs that specify file and directory names and locations on remote machines.
Multifile
SANDBOX § A sandbox is a collection of graphs and related files that are stored in a single directory tree, and treated as a group for purposes of version control, navigation, and migration. § A sandbox can be a file system copy of a datastore project. § In the graph, instead of specifying the entire path for any file location , we specify only the sandbox parameter variable. For ex : $AI_IN_DATA/customer_info. dat. where $AI_IN_DATA contains the entire path with reference to the sandbox $AI_HOME variable. § The actual in_data dir is $AI_HOME/in_data in sandbox
SANDBOX § The sandbox provides an excellent mechanism to maintain uniqueness while moving from development to production environment by means switch parameters. § We can define parameters in sandbox those can be used across all the graphs pertaining to that sandbox. § The topmost variable $PROJECT_DIR contains the path of the home directory
SANDBOX
Deploying § Every graph after validation and testing has to be deployed as. ksh file into the run directory on UNIX. § This. ksh file is an executable file which is the backbone for the entire automation/wrapper process. § The wrapper automation consists of. run, . env, dependency list , job list etc § For a detailed description on wrapper and different directories and files , Please refer the documentation on wrapper / UNIX presentation.
Symbols Boxes for processing and Data Transforms Arrows for Data Flows between process Cylinders for serial I/O files Divided cylinders for parallel I/O files Grid boxes for database tables
Parallelism § Component parallelism § Pipeline parallelism § Data parallelism
Component Parallelism Sorting Customers Sorting Transactions
Component Parallelism § Comes “for free” with graph programming. § Limitation: – Scales to number of “branches” a graph.
Pipeline Parallelism Processing Record: 100 Processing Record: 99
Pipeline Parallelism § Comes “for free” with graph programming. § Limitations: – Scales to length of “branches” in a graph. – Some operations, like sorting, do not pipeline.
Data Parallelism s n o Pa i t i rt
Two Ways of Looking at Data Parallelism Expanded View: Global View:
Data Parallelism § Scales with data. § Requires data partitioning. § Different partitioning methods for different operations.
Data Partitioning Expanded View: Global View:
Data Partitioning: The Global View Degree of Parallelism Fan-out Flow
Session III Partitioning
Partitioning Review Fan-out Flow § For the various partitioning components: – Is it Key-based? Does the problem require a key-based partition? – Performance: Are the partitions balanced or skewed?
Partitioning: Performance Partition 0 Partition 1 Partition 2 Partition 3 Balanced: Processors get neither too much nor too little. Partition 2 Partition 3 Skewed: Some processors get too much, others too little.
Sample Data to be Partitioned § Customers § § § § 42 John 02116 43 Mark 02114 44 Bob 02116 45 Sue 02241 46 Rick 02116 47 Bill 02114 48 Mary 02116 49 Jane 02241 30 9 8 92 23 14 38 2. record decimal(2) id; string(5) name; decimal(5) zipcode; decimal(3) amount; string(1) newline; end
Partition by Round-robin Partition 0 Customers 42 John 02116 30 45 Sue 02241 92 48 Mary 02116 38 Partition 1 Customers 43 Mark 02114 9 46 Rick 02116 23 49 Jane 02241 2 Partition 2 Customers 44 Bob 02116 8 47 Bill 02114 14
Partition by Round-robin § Not key based. § Results in very well balanced data, especially with block-size of 1. § Useful for record-independent parallelism.
Partition by Key partition on zipcode: Customers 43 Mark 02114 9 45 Sue 02241 92 47 Bill 02114 14 49 Jane 02241 2 Customers 42 John 02116 30 44 Bob 02116 8 46 Rick 02116 23 48 Mary 02116 38
Partition by Key often followed by a Sort on zipcode: Customers 43 Mark 02114 9 47 Bill 02114 14 45 Sue 02241 92 49 Jane 02241 2 Customers 42 John 02116 30 44 Bob 02116 8 46 Rick 02116 23 48 Mary 02116 38 Rollup by zipcode: Totals by Zipcode 02114 23 02241 94 Totals by Zipcode 02116 99
Partition by Key § § § Key-based. Usually results in well balanced data. Useful for key-dependent parallelism.
Partition by Expression: amount/33 Customers 42 John 02116 30 43 Mark 02114 9 44 Bob 02116 8 46 Rick 02116 23 47 Bill 02114 14 49 Jane 02241 2 Customers 48 Mary 02116 38 Customers 45 Sue 02241 92
Partition by Expression § Key-based, depending on the expression. § Resulting balance very dependent on expression and on data. § Various application-dependent uses.
Partition by Range With splitter values of 9 and 23: Customers 43 Mark 02114 44 Bob 02116 49 Jane 02241 9 8 2 Customers 46 Rick 02116 23 47 Bill 02114 14 Customers 42 John 02116 30 45 Sue 02241 92 48 Mary 02116 38
Range+Sort: Global Ordering Sort following a partition by range: Customers 49 Jane 02241 44 Bob 02116 43 Mark 02114 2 8 9 Customers 47 Bill 02114 14 46 Rick 02116 23 Customers 42 John 02116 30 48 Mary 02116 38 45 Sue 02241 92
Partition by Range § Key-based. § Resulting balance dependent on set of splitters chosen. § Useful for “binning” and global sorting.
Partition with Load Balance if middle node highly loaded: Customers 42 John 02116 30 43 Mark 02114 9 44 Bob 02116 8 49 Jane 02241 2 Customers 45 Sue 02241 92 Customers 46 Rick 02116 23 47 Bill 02114 14 48 Mary 02116 38
Partition by Load Balance § Not key-based. § Results in skewed data distribution to complement skewed load. § Useful for record-independent parallelism.
Partition with Percentage With percentages: 4, 20 Customers 42 John 02116 30 43 Mark 02114 9 44 Bob 02116 8 45 Sue 02241 92 Customers 46 Rick 02116 23 47 Bill 02114 14 48 Mary 02116 38 49 Jane 02241 2 The next 16 records would go here, and the next 76 records would go here Customers. . .
Partition by Percentage § Not key-based § Results in usually skewed data distribution conforming to the provided percentages. § Useful for record-independent parallelism.
Broadcast (as a Partitioner) Unlike all other partitioners which write a record to ONE output flow, Broadcast writes each record to EVERY output flow. Customers 42 John 02116 43 Mark 02114 44 Bob 02116 45 Sue 02241 46 Rick 02116 47 Bill 02114 48 Mary 02116 49 Jane 02241 30 9 8 92 23 14 38 2
Broadcast § § § Not key-based Results in perfectly balanced partitions Useful for record-independent parallelism.
Session IV De-Partitioning
Departitioning combines many flows of data to produce one flow. It is the opposite of partitioning. Each departition component combines flows in a different manner.
Departitioning Expanded View: Score 1 Score 2 Departition Output File Score 3 Global View:
Departitioning Fan-in Flow § For the various departitioning components: – Key-based? – Result ordering? – Effect on parallelism? – Uses?
Concatenation Globally ordered, partitioned data: 49 Jane 02241 44 Bob 02116 43 Mark 02114 2 8 9 Sorted data: 49 Jane 44 Bob 43 Mark 47 Bill 46 Rick 42 John 48 Mary 45 Sue 02241 02116 02114 02116 02241 2 8 9 14 23 30 38 92 47 Bill 02114 14 46 Rick 02116 23 42 John 02116 30 48 Mary 02116 38 45 Sue 02241 92
Concatenation § § Not key-based. Result ordering is by partition. Serializes pipelined computation. Useful for: – creating serial flow from partitioned data – appending headers and trailers – writing DML § Used infrequently
Merge Round-robin partitioned and sorted by amount: 42 John 02116 30 48 Mary 02116 38 45 Sue 02241 92 49 Jane 02241 2 43 Mark 02114 9 46 Rick 02116 23 Sorted data, following merge on amount: 49 Jane 44 Bob 43 Mark 47 Bill 46 Rick 42 John 48 Mary 45 Sue 02241 02116 02114 02116 02241 2 8 9 14 23 30 38 92 44 Bob 02116 8 47 Bill 02114 14
Merge § Key-based. § Result ordering is sorted if each input is sorted. § Possibly synchronizes pipelined computation; may even serialize. § Useful for creating ordered data flows. § Used more than concatenate, but still infrequently
Interleave Round-robin partitioned and scored: 42 John 02116 30 A 45 Sue 02241 92 A 48 Mary 02116 38 A 43 Mark 02114 9 C 46 Rick 02116 23 B 49 Jane 02241 2 C 44 Bob 02116 8 C 47 Bill 02114 14 B Scored dataset in original order, following interleave: 42 John 43 Mark 44 Bob 45 Sue 46 Rick 47 Bill 48 Mary 49 Jane 02116 02114 02116 02241 30 A 9 C 8 C 92 A 23 B 14 B 38 A 2 C
Interleave § § Not key-based. Result ordering is inverse of round-robin. Synchronizes pipelined computation. Useful for restoring original order following a record-independent parallel computation partitioned by round-robin. § Used in rare circumstances
Gather Round-robin partitioned and scored: 42 John 02116 30 A 45 Sue 02241 92 A 48 Mary 02116 38 A 43 Mark 02114 9 C 46 Rick 02116 23 B 49 Jane 02241 2 C 44 Bob 02116 8 C 47 Bill 02114 14 B Scored dataset in random order, following gather: 43 Mark 46 Rick 42 John 45 Sue 48 Mary 44 Bob 47 Bill 49 Jane 02114 02116 02241 02116 02114 02241 9 C 23 B 30 A 92 A 38 A 8 C 14 B 2 C
Gather § Not key-based. § Result ordering is unpredictable. § Neither serializes nor synchronizes pipelined computation. § Useful for efficient collection of data from multiple partitions and for repartitioning. § Used most frequently
Layout § Layout determines the location of a resource. § A layout is either serial or parallel. § A serial layout specifies one node and one directory. § A parallel layout specifies multiple nodes and multiple directories. It is permissible for the same node to be repeated.
Layout § The location of a Dataset is one or more places on one or more disks. § The location of a computing component is one or more directories on one or more nodes. By default, the node and directory is unknown. § Computing components propagate their layouts from neighbors, unless specifically given a layout by the user.
Session V Join
Join Types • Inner join — sets the record-required parameters for all ports to True. • Outer join — sets the record-required parameters for all ports to False. • Explicit — allows you to set the record-required parameter for each port individually.
Join Types. . Contd. Case 1: Inner Join join-type Case 2: Full Outer Join join-type Case 3: Explicit join-type: record-required 0: false record-required 1: true Case 4: Explicit join-type: record-required 0: true record-required 1: false
Some key Join Parameters Økey Name(s) of the field(s) in the input records that must have matching values for Join to call the transform function. Ødriving Number of the port to which you want to connect the driving input. The driving input is the largest input. All other inputs are read into memory. The driving parameter is only available when the sorted-input parameter is set to In memory: Input need not be sorted.
Some key Join Parameters Ødedupn Set the dedupn parameter to true to remove duplicates from the corresponding inn port before joining. This allows you to choose only one record from a group with matching key values as the argument to the transform function. Default is false, which does not remove duplicates Øoverride-keyn Alternative name(s) for the key field(s) for a particular in port.
References § § § Ab Initio Tutorial Ab Initio Online Help Website (abinitio. com)
- Slides: 100