Intro to Talend Open Studio for Data Integration

  • Slides: 25
Download presentation
Intro to Talend Open Studio for Data Integration Philip Yurchuk http: //philip. yurchuk. com

Intro to Talend Open Studio for Data Integration Philip Yurchuk http: //philip. yurchuk. com

What is Talend? �Eclipse-based visual programming editor �Generates executable Java code �Jobs can run

What is Talend? �Eclipse-based visual programming editor �Generates executable Java code �Jobs can run standalone or embedded (no special server) �Batch or interactive (user input)

What is ETL? �Extract: suck up data �Transform: mess with it �Load: blow it

What is ETL? �Extract: suck up data �Transform: mess with it �Load: blow it out �Batch, integration, migration, etc.

Extract from/load to where? �Over 600 components �Over 450 connectors �Allows multiple inputs/outputs in

Extract from/load to where? �Over 600 components �Over 450 connectors �Allows multiple inputs/outputs in single job

Connectors � Flat files � Applications/Platforms �Delimted (tab, CSV…) �Alfresco �XML �Microsoft Dynamics �JSON

Connectors � Flat files � Applications/Platforms �Delimted (tab, CSV…) �Alfresco �XML �Microsoft Dynamics �JSON �Excel �Positional �Apache HTTP logs, HL 7. . . (CRM, AX) �SAP �Sage ERP X 3 �Salesforce �Sugar. CRM

Connectors (continued) �Relational Databases �My. SQL �Postgresql �MS SQL �Oracle �Many more � No.

Connectors (continued) �Relational Databases �My. SQL �Postgresql �MS SQL �Oracle �Many more � No. SQL/Columnar/OLAP/ Other �Amazon Red. Shift �Greenplum �Hive �OLAP cubes �LDAP �Vector. Wise �Teradata �More in Big Data ed.

How do we transport data? � File system � FTP � SFTP/SCP � Web

How do we transport data? � File system � FTP � SFTP/SCP � Web service (SOAP, REST) � HTTP � Mail, POP � XMLRPC, Sockets, JMS, RSS. . .

Other Components �Process data: join, filter, aggregate �Flow control: loops, job invocation �Logs, statistics

Other Components �Process data: join, filter, aggregate �Flow control: loops, job invocation �Logs, statistics �Code: Java, Groovy �On row data or standalone �Can load libraries

Demo

Demo

Nifty Components �Fuzzy. Match - calculate Levenshtein distance or phonetic similarity �Interval. Match –

Nifty Components �Fuzzy. Match - calculate Levenshtein distance or phonetic similarity �Interval. Match – perform lookup/join based on values falling within an interval �Replace, Replace. List - search and replace, substitution �Uniq. Row - output distinct rows based on defined key columns

More Nifty Components �XMLMap - Allows joins, column or row filtering, transformations, and multiple

More Nifty Components �XMLMap - Allows joins, column or row filtering, transformations, and multiple outputs �Normalize/Denormalize - split delimited strings into columns or join columns into a string �Aggregate. Row – GROUP BY; min, max, sum, other functions used to aggregate rows on a column

Tips and Tricks �Camel. Case job names for embedded jobs. �Or prefix with ETL

Tips and Tricks �Camel. Case job names for embedded jobs. �Or prefix with ETL phase and order of execution �Whenever appropriate (esp. for inserting data), use the schema from the repository. �When connecting, propagating changes to a DB component will change it to a built-in schema, which won't get updated.

Tips and Tricks �Propagating changes to a DB component will change it to a

Tips and Tricks �Propagating changes to a DB component will change it to a built-in schema, which won't get updated after repo changes. �On the other hand, remember that for lookup/join (i. e. , SELECT) queries you can modify the query to only select the fields you need. Propagating the schema is useful then.

Tips and Tricks �Failure handling subjob: �It’s an unconnected job (no triggers point to

Tips and Tricks �Failure handling subjob: �It’s an unconnected job (no triggers point to it) �Use Log. Catcher to catch, record component failures. �Record failure in DB, file, email, etc. �Add rollback component to undo DB changes if necessary. May need to do this in the job if strategic placement is needed.

Tips and Tricks �In Java expressions, use methods, not operators. E. g. , concat(String)

Tips and Tricks �In Java expressions, use methods, not operators. E. g. , concat(String) instead of the dot operator, equals(Object) instead of ==. �Technical components (like hash maps) are hidden by default. See: http: //www. talendforge. org/forum/viewtopic. p hp? pid=110860

Tips and Tricks �When connecting, propagating changes to a DB component will change it

Tips and Tricks �When connecting, propagating changes to a DB component will change it to a built-in schema, which won't get updated after repo changes. �On the other hand, remember that for lookup/join (i. e. , SELECT) queries you can modify the query to only select the fields you need. Propagating the schema is useful then.

Tips and Tricks �Use a context for job variables. �Note you can specify type

Tips and Tricks �Use a context for job variables. �Note you can specify type for variables. �You can read from a file or database, or pass in a context if an embedded Java job.

Tips and Tricks �For multi-host deployment: �Export the job with a “bootstrap” context that

Tips and Tricks �For multi-host deployment: �Export the job with a “bootstrap” context that has all variables, but populates only a context config location that is the same for all machines. �The context config file has all values required for that host, e. g. test DB connection for test machine. �You can rely on the fact that Windows will interpret root as the main system drive, so “/Data/” will translate to C: Data �Be mindful of file permissions for sensitive context data (e. g. , DB password)

Tips and Tricks �Use “Bulk” output components when possible. �For transactional behavior: �Start the

Tips and Tricks �Use “Bulk” output components when possible. �For transactional behavior: �Start the job with DB connection �Check “use existing connection” in all relevant components �Check "Die on error" in all relevant components �End job with commit component

Room for Improvement �UI stability �Documentation

Room for Improvement �UI stability �Documentation

Books �Getting Started with Talend Open Studio for Data Integration by Bowen Jonathan �Talend

Books �Getting Started with Talend Open Studio for Data Integration by Bowen Jonathan �Talend Open Studio Cookbook by Rick Daniel Barton �Big Data book coming…

Talend Forge �http: //www. talendforge. org/ �Forum – super helpful �Exchange – free community

Talend Forge �http: //www. talendforge. org/ �Forum – super helpful �Exchange – free community components! �Tutorials �Bug tracker �Source code

Talend Resources �http: //www. talend. com/resources �Help Center �Knowledge Base �Webinars, screencasts �Tutorials �Docs

Talend Resources �http: //www. talend. com/resources �Help Center �Knowledge Base �Webinars, screencasts �Tutorials �Docs are on download page �And by pressing F 1 on a component

Questions? Compliments? Consulting gigs? �Contact me: �philip@yurchuk. com �http: //philip. yurchuk. com �http: //www.

Questions? Compliments? Consulting gigs? �Contact me: �philip@yurchuk. com �http: //philip. yurchuk. com �http: //www. linkedin. com/in/philipyurchuk/

Thank You!

Thank You!