Hadoop and DB 2 11 both sides of

Hadoop and DB 2 11 – both sides of the story Emil Kotrc, CA Technologies Session Code: E 7 14: 00 -15: 00 Tuesday, 17 th November 2015 Platform: DB 2 for z/OS

Abstract DB 2 for z/OS and Hadoop are very good friends since the version 11. DB 2 11 can easily access the data stored in the Hadoop/IBM Big. Insights cluster as well as submit a map/reduce job and read the results back via IBM supplied user defined functions. On the other hand, a typical Hadoop installation can access the DB 2 data using the standard interfaces out of the box. In this presentation we will show and discuss the possibilities of one or two ways integration of DB 2 for z/OS and Hadoop. 2

Objectives • A very brief introduction to Hadoop, IBM Big. Insights, and related technologies. • Show DB 2 can access data in Hadoop or submit jobs to Hadoop using IBM supplied UDFs. Explain the implementation steps. • Show Hadoop can access data in DB 2 using Sqoop and other tools. • Discuss possible integration of DB 2 and Hadoop and show some examples. 3

Agenda • • • Hadoop overview – HDFS, map reduce IBM Big. Insights DB 2 and Hadoop and DB 2 Next Big Thing… Spark? References 4

HADOOP OVERVIEW 5

Hadoop Overview • Hadoop – open source framework for 1. Storage, and 2. Large scale parallel processing • Batch oriented • Designed to run on commodity hardware • All components can cope with hardware failures • Implemented in Java, Apache License, First release in 2007 • Based on two Google’s papers • The Google File System, Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, 2003 • Map. Reduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, 2004 6

Hadoop Overview Main Components, HDFS 1. Hadoop distributed file system (HDFS) • Data stored in blocks and replicated across multiple nodes • HDFS cluster consists of • • Name. Node – keeping the metadata Data. Nodes – serving block of data • Interfaces - REST API, HDFS shell • • $ $ • • http: //<HOST>: <HTTP_PORT>/webhdfs/v 1/<PATH>? op=<OPERATION> $ curl --user: password http: //server: 14000/webhdfs/v 1/tmp/test. csv? op=OPEN hadoop fs –ls /tmp hadoop fs –cat file hdfs –cat file hadoop fs –copy. From. Local local. csv remote. csv 7

Hadoop Overview Main Components, HDFS 1. Hadoop distributed file system (HDFS) Source: http: //hadoop. apache. org/docs/current/hadoop-projectdist/hadoop-hdfs/Hdfs. Design. html 8

Hadoop Overview Main Components, Map Reduce 2. Map Reduce • Framework for parallel and distributed computation • Takes advantage of data locality • moves the code to the data • keeps the work as close to the data as possible • Two steps 1. Map - takes the input, divides the problem into sub-problems and distributes them to worker nodes, performs filtering, shuffling, and sorting 2. Reduce - collects the outputs from workers, performs summary operations and forms the output often saved in HDFS • Map Reduce programs can be written in • Java (assembler for hadoop) • C++, Python, Ruby, … • High Level languages that translate to Map. Reduce (Pig, Hive, JAQL) 9

Hadoop Overview Main Components, Map Reduce 2. Map Reduce Source: https: //developer. yahoo. com/hadoop/tutorial/module 4. html 10

Hadoop Overview Related Projects • Oozie – workflow scheduler • Avro – data serialization system • Hive – data warehouse for Hadoop • Provides SQL-like language (Hive. QL) • HBase – Big. Table (Google) like capabilities for Hadoop • Non-relational model • Sqoop – application for transferring data between relational databases and Hadoop • Flume – service for collecting, aggregating, and moving large amounts of data • Pig – high level platform for Map Reduce (Pig Latin) 11

DB 2 AND HADOOP 12

DB 2 and Hadoop Source: DB 2 11 Technical Overview, SG 24 -8180 13

DB 2 and Hadoop Overview • IBM Provides DB 2 connectors for IBM Big. Insights • Implemented as two DB 2 UDFs – part of DB 2 11 Base installation • Integration of DB 2 with Big Data analytics • Think of it as of Proof of Concept • DB 2 can submit a Map. Reduce job to Hadoop/Big. Insights • Using a JAQL query via JAQL_SUBMIT() • DB 2 can read the data from HDFS • Can be used in SQL as relational data via HDFS_READ() • Combined - DB 2 can submit a Map. Reduce job, read the outputs back, process them using SQL. 14

DB 2 and Hadoop Big. Insights • IBM Big. Data platform based on Hadoop, Spark • Runs on x 86 -64, POWER, z Systems (z. Linux) • Quick Start Edition – Free, Downloadable, Non-production • VMware image • Native Installation for Linux • Version 3. 0 vs 4. 0 • IBM Open Platform with Apache Hadoop • IBM Big. Insights for Apache Hadoop v 4 • Value added features for analytics (Data scientist, Analyst, Enterprise management) • No migration from 3. 0 to 4. 0, should come in 4. 1 (? ) 15

DB 2 and Hadoop DB 2 and Big. Insights Integration 1. 2. 3. 4. Install and Configure Big. Insights (version 3. 0) Deploy JAQL ad hoc query application Configure WLM environment Create UDFs to support Big Data • JAQL_SUBMIT() and HDFS_READ() • Need DSNWLM_WEBSERVICES WLM environment • Part of the DB 2 11 installation, FMID HDBBB 10 (DB 2 Base), prefix. SDSNLOD 2 load library • Documented in a IBM support document #7040438 (for BI 2. 1. 1) • See http: //www. ibm. com/developerworks/library/ba-integrate-db 2 -biginsights/index. html 16

DB 2 and Hadoop JAQL_SUBMIT() • Submits a JAQL Script to Big. Insights ad hoc JAQL application • Uses REST API • Parameters • • JAQL Script – 8000 characters Return String – for HDFS_READ Integration Console URL Option string – user and password 17

$DB 2 and Hadoop JAQL_SUBMIT(), JAQL • • [ {name: ] "Jon Doe", income:$

DB 2 and Hadoop JAQL_SUBMIT(), JAQL • • [ {name: ] "Jon Doe", income: 20000, mgr: false}, "Vince Wayne", income: 32500, mgr: false}, "Jane Dean", income: 72000, mgr: true}, "Alex Smith", income: 25000, mgr: false} JSON Query language Data processing and query language for Big. Insights JAQL query is translated to Map. Reduce Pipeline processing – Source -> operator_1 -> … -> operator_n -> output read(json. Text("/tmp/employees. json")) -> filter not $. mgr -> sort by [$. income desc] -> write(del("/tmp/emplist. csv", { schema: schema { name, income }})); 18

DB 2 and Hadoop JAQL_SUBMIT(), DDL CREATE FUNCTION SYSFUN. JAQL_SUBMIT(SCRIPT VARCHAR(8000), PARMS VARCHAR(512), URL VARCHAR(512), OPTIONS VARCHAR(256)) RETURNS VARCHAR(512) LANGUAGE C EXTERNAL NAME DSN 8 JAQL PARAMETER STYLE DB 2 SQL PARAMETER CCSID UNICODE PARAMETER VARCHAR NULTERM FENCED SELECT JAQL_SUBMIT(‘ [1, 2, 3, 4, 5] -> write(del("/tmp/test. csv"))', NOT DETERMINISTIC '', 'http: //myserver: 8080', '') EXTERNAL ACTION FROM SYSIBM. SYSDUMMY 1; DISALLOW PARALLEL WLM ENVIRONMENT DSNWLMW STAY RESIDENT YES RUN OPTIONS ’POSIX(ON), XPLINK(ON)’; 19

DB 2 and Hadoop HDFS_READ() • Reads a delimited file in HDFS • Uses REST API • Http. FS API • http: //<HOST>: <HTTP_PORT>/webhdfs/v 1/<PATH>? op=<OPERATION> • Returns a generic table – new in DB 2 11 • Schema of the table at the query time • Parameters • REST API Request • Option String – delimiter, user, password 20

DB 2 and Hadoop HDFS_READ(), DDL CREATE FUNCTION SYSFUN. HDFS_READ(URL VARCHAR(256), OPTIONS VARCHAR(256)) RETURNS GENERIC TABLE LANGUAGE C EXTERNAL NAME DSN 8 HDFS PARAMETER STYLE DB 2 SQL PARAMETER CCSID UNICODE PARAMETER VARCHAR NULTERM FINAL CALL FENCED NOT DETERMINISTIC EXTERNAL ACTION DISALLOW PARALLEL SCRATCHPAD 200 WLM ENVIRONMENT DSNWLMW STAY RESIDENT YES RUN OPTIONS ’POSIX(ON), XPLINK(ON)’ CARDINALITY 100000; 21

DB 2 and Hadoop HDFS_READ(), Example SELECT * FROM TABLE (HDFS_READ('http: //myserver: 14000/webhdfs/v 1/tmp/cars. csv', '')) AS CARS(YEAR INT, MAKE VARCHAR(10), MODEL VARCHAR(30), DESCRIPTION VARCHAR(40)); 1997, Ford, E 350, "ac, abs, moon", 3000. 00 1999, Chevy, "Venture ""Extended Edition""", 4900. 00 1996, Jeep, Grand Cherokee, "MUST SELL! air, moon roof, loaded", 4799. 00 2000, Toyota, Camry, , 6700. 0 22

DB 2 and Hadoop JAQL_SUBMIT() and HDFS_READ() SELECT * FROM TABLE (HDFS_READ(JAQL_SUBMIT(' read(json. Text("/tmp/employees. json")) -> filter not $. mgr -> sort by [$. income desc] -> write(del("/tmp/emplist", { schema: schema { name, income }})); ', 'http: //myserver: 14000/webhdfs/v 1/tmp/emplist', 'http: //myserver: 8080', '')) AS EMPLIST(NAME VARCHAR(128), INCOME INTEGER); NAME INCOME Vince Wayne 32500 Alex Smith 25000 Jon Doe 20000 [ {name: ] "Jon Doe", income: 20000, mgr: false}, "Vince Wayne", income: 32500, mgr: false}, "Jane Dean", income: 72000, mgr: true}, "Alex Smith", income: 25000, mgr: false} 23

DB 2 and Hadoop Key points for integration • Lessons learned from JAQL_SUBMIT() and HDFS_READ() • Hadoop and related projects • REST APIs • Map Reduce high level languages • DB 2 • external UDFs • DB 2 11 generic table UDF 24

DB 2 and Hadoop Generic functions • New Feature of DB 2 11 • Generic functions return a generic table • Schema at query time • Can be implemented as an external UDF • Can be implemented in C • See header file SDSNC. H(DSNUDF) for Structures used for Generic Table UDF • APAR PI 42256 providing samples 25

DB 2 and Hadoop Big. Insights and Big. SQL • Big. SQL – provides SQL access to data across Big. Insights • Supports JDBC and ODBC • Query engine for Hadoop data • Table is like a view to Hadoop data (files in HDFS, Hive, …) • Local query vs Map Reduce • Competition – Cloudera Impala, Hive (Stinger), Oracle Big Data SQL, Pivotal HAWQ (Greenplum), Teradata SQL-H, … 26

HADOOP AND DB 2 27

Hadoop and DB 2 Overview • Sqoop – SQL to Hadoop • Flume - collecting, aggregating and moving large amounts of data • UNLOAD, LOAD • Big. Insights – Data Import • IBM and 3 rd party • Veristorm v. Storm • IBM System z Connector for Hadoop 28

Sqoop Overview • Originally SQL To Hadoop • Transferring bulk data between Hadoop and RDBMS • Import data • incremental loads or free-form SQL query • to text/binary files or to Hive/Hbase • Import from mainframe • Export data from Hadoop to RDBMS • Uses Map Reduce framework • Parallel operation • Fault tolerance 29

Sqoop Overview, cont. • Command line interface • sqoop COMMAND [generic and specific args] • • • Implemented in Java Top-level Apache project since 2012 Licensed under Apache License 2. 0 Sqoop stable version > 1. 0 Sqoop 2 – not intended for production 30

Sqoop Overview, cont. Import Incremental Import Export 31

Sqoop Configuration • Built-in support for several databases (DB 2 not included) • JDBC interface • Copy JARs to $SQOOP_HOME/lib • DB 2 JDBC Driver • DB 2 for z/OS License JAR • Connectors - Specific vs Generic, 3 rd party 32

Sqoop Import from DB 2 Import • Imports data in a table-centric fashion • Selecting data using parameters (--table, --column, --where), • Or using free form queries (--query) • Data can be imported as • Text files (default) • Null values as “null” • Binary files • Sequence file • Avro • Supports compression • A Java class generated to hold a row 33

Sqoop Import from DB 2, cont. 1. Examine the table • List of columns, types 2. Create table-specific class to hold a row • Generated code • Widgets. java 3. Map Reduce launched • Dividing query across multiple nodes • Splitting column • Guessing a column (primary key) • -m 1 avoids splitting 34

Sqoop Import from DB 2, Considerations • Consistency – multiple parallel jobs • • Map tasks reading from database run in parallel as separate jobs Thus do not share a single transaction Disable updates Transaction isolation – read committed or read uncommitted (-relaxed-isolation) • Direct-mode imports – only for some systems • Must be explicitly enabled 35

Sqoop Import from DB 2, Considerations • Text based imports can be used in streaming or map reduce as Text. Input. Format • Generated class by Sqoop can be used to parsing the data • Importing to Hive – Hive schema based on source schema • Importing large objects • In separate files (lobfile), --inline-lob-limit 36

Sqoop Import from DB 2, Example $ sqoop import --connect "jdbc: db 2: //server: port/LOCATION" --username user -P --table DSN 81010. DEPT -m 1 --target-dir /tmp/dept $ hdfs –ls /tmp/dept Found 2 items drwx--x--x - biadmin $ hdfs -ls Found 3 items -rw-r--r-1 drwx--x--x -rw-r--r-1 0 2015 -07 -02 11: 48 DSN 81010. DEPT 0 2014 -07 -04 11: 29 DSN 81010. EMP /tmp/dept/DSN 81010. DEPT biadmin biadmin 0 2015 -07 -02 11: 48 DSN 81010. DEPT/_SUCCESS 0 2015 -07 -02 11: 47 DSN 81010. DEPT/_logs 844 2015 -07 -02 11: 48 DSN 81010. DEPT/part-m-00000 37

Sqoop Import from DB 2, Example $ hdfs -cat /tmp/dept/DSN 81010. DEPT/part-m-00000 A 00, SPIFFY COMPUTER SERVICE DIVBBA, null, A 00, B 01, XLANNINGABC, 000020, A 00, , C 01, INFORMATION CENTER, 000030, A 00, , D 01, DEVELOPMENT CENTER, null, A 00, , D 11, MANUFACTURING SYSTEMS, 000060, D 01, , D 21, ADMINISTRATION SYSTEMS, 000070, D 01, , E 01, SUPPORT SERVICES, 000050, A 00, , E 11, OPERATIONS, 000090, E 01, , E 21, SOFTWARE SUPPORT, 000100, E 01, , F 22, BRANCH OFFICE F 2, null, E 01, , G 22, BRANCH OFFICE G 2, null, E 01, , H 22, BRANCH OFFICE H 2, null, E 01, , I 22, BRANCH OFFICE I 2, null, E 01, , J 22, BRANCH OFFICE J 2, null, E 01, , , 38

Sqoop Import from DB 2, Example • Import with free form query • $CONDITIONS is where Sqoop adds split conditions $ sqoop import --connect "jdbc: db 2: //server: port/LOCATION" --username user -P -query 'SELECT * FROM DSN 81010. EMP WHERE $CONDITIONS' --split-by EMPNO --target -dir /tmp/emp • Sqoop determines the splits • Preferred is integral split column than textual 39

Incremental Import Sqoop Incremental Imports • Retrieve only rows newer than already imported • Two modes • Append – based on a value in a column • Lastmodified – based on a timestamp • Did not work for DB 2 for z/OS (as of 1. 4. 3) • • --check-column --last-value At the end of an import a last value is printed Saved job allows recurring incremental import 40

Sqoop Incremental Imports, Example $ sqoop import --connect "jdbc: db 2: //server: port/LOCATION" --username user -P --table DSN 81010. EMP --target-dir /tmp/emp --incremental append --last-value '2001 -01 -01' --check-column HIREDATE --driver com. ibm. db 2. jcc. DB 2 Driver -m 1. . . 15/07/13 11: 40: 09 INFO tool. Import. Tool: --last-value 2005 -06 -19 $ hdfs -ls /tmp/emp/ Found 4 items drwx--x--x -rw-r--r-1 biadmin biadmin 0 0 559 2015 -07 -13 11: 34 11: 39 11: 35 11: 39 /tmp/emp 3/_logs-00000 /tmp/emp 3/part-m-00001 41

Sqoop Import from mainframe • • sqoop import-mainframe Imports all members from a PDS to HDFS via FTP --dataset is used to specify PDF (all members imported) Target directory by default is /user/yourid/pds • Can be altered by --target-dir or --warehouse-dir • Each record in a member is a text record with a newline • Single text field (DEFAULT_COLUMN) • Data can be imported to HDFS, Hive, Hbase, and other $ sqoop import-mainframe --connect server --dataset DEPT --username user -password mypassword -m 8 42

Sqoop Export • • • Exports files from HDFS to a table Table must already exist Specify directory to export Files are parsed according to defined delimiters Three modes of operation 1. INSERT mode 2. UPDATE mode 3. CALL mode – calls a Stored procedure for each record • Allows validation of copied data 43

Sqoop Export, cont. • 1. 2. 3. Works in similar steps as import: Metadata of the table Generated class to parse text files Insert into table 1. Map. Reduce job – batch insert statements 2. Degree of parallelism 44

Sqoop Export, Considerations • INSERT mode is default • Export fails if an INSERT fails – be aware of constraints • Multi-row INSERTs • UPDATE mode set by --update-key • • Each record is treated as an UPDATE No rows updated is not considered as an error --update-key does not need to be unique Optional --update-mode with allowinsert – UPDATE or INSERT • Export • Hive table • Sequence file 45

Sqoop Export, Considerations • Export and transactions • Multiple independent map tasks • Export not atomic operation due to parallelism • Commit scope is few thousand rows (10000 = 100 statements * 100 rows) • Optionally use --stagging-table • Auxiliary table – staged data are moved to target in a single transaction • Must exist and be empty or specify --clear-staging-table • Codegen and built your Map. Reduce job if you need specials 46

Sqoop Export, Example • INSERT mode $ sqoop export --connect "jdbc: db 2: //server: port/LOCATION" --username user -P --table SCHEMA. DEPT --export-dir /user/use/DSN 81010. DEPT • UPDATE mode $ sqoop export --connect "jdbc: db 2: //server: port/LOCATION" --username user -P --table SCHEMA. DEPT --export-dir /user/DSN 81010. DEPT -update-key DEPTNO • Mixed UPDATE/INSERT mode • Did not work for DB 2 for z/OS (as of Sqoop 1. 4. 3) 47

Sqoop Other tools • sqoop-job • Manage saved jobs • Allows re-executing the jobs • sqoop-validation • Compares row count • sqoop-import-all-tables • sqoop-list-tables • … 48

Big. Insights Imports from DB 2 • Data Import • JDBC based • Create a properties file • Copy DB 2 driver to /biginsights/oozie/shared. Libraries/db. Drivers/ HDFS directory • Big. SQL LOAD • Load tables or delimited files from external sources • Uses sqoop for RDBMS – LOAD HADOOP USING JDBC CONNECTION URL … WITH PARAMETERS 49

Flume • • • Collecting, aggregating, and moving data Designed to work with log data Based on streaming data flows Allows complex flows Uses transactional approach for reliability flume-ng-sql-source plugin allows to communicate with SQL databases • Despite Sqoop Flume streams data continues to run as new data becomes available 50

NEXT BIG THING… SPARK? 51

Spark • • Open source cluster computing framework Uses in-memory primitives instead of Map. Reduce Well suited for machine learning Requires distributed file system and cluster management • Builds on to of Hadoop technologies – HDFS, YARN, … • Spark. SQL • Spark. R • Stay tuned… 52

REFERENCES 53

References • Integrating DB 2 for z/OS and Hadoop: http: //www. ibm. com/developerworks/library/ba-integrate -db 2 -biginsights/index. html • Description of HDFS_READ and JAQL_SUBMIT: http: //www 01. ibm. com/support/docview. wss? uid=swg 27040438&aid= 1 • https: //hadoop. apache. org/ • http: //sqoop. apache. org/ 54

55

Emil Kotrc CA Technologies Emil. Kotrc@ca. com Hadoop and DB 2 11 – both sides of the story Please fill out your session evaluation before leaving!