Intro to Talend Open Studio for Data Integration
- Slides: 25
Intro to Talend Open Studio for Data Integration Philip Yurchuk http: //philip. yurchuk. com
What is Talend? �Eclipse-based visual programming editor �Generates executable Java code �Jobs can run standalone or embedded (no special server) �Batch or interactive (user input)
What is ETL? �Extract: suck up data �Transform: mess with it �Load: blow it out �Batch, integration, migration, etc.
Extract from/load to where? �Over 600 components �Over 450 connectors �Allows multiple inputs/outputs in single job
Connectors � Flat files � Applications/Platforms �Delimted (tab, CSV…) �Alfresco �XML �Microsoft Dynamics �JSON �Excel �Positional �Apache HTTP logs, HL 7. . . (CRM, AX) �SAP �Sage ERP X 3 �Salesforce �Sugar. CRM
Connectors (continued) �Relational Databases �My. SQL �Postgresql �MS SQL �Oracle �Many more � No. SQL/Columnar/OLAP/ Other �Amazon Red. Shift �Greenplum �Hive �OLAP cubes �LDAP �Vector. Wise �Teradata �More in Big Data ed.
How do we transport data? � File system � FTP � SFTP/SCP � Web service (SOAP, REST) � HTTP � Mail, POP � XMLRPC, Sockets, JMS, RSS. . .
Other Components �Process data: join, filter, aggregate �Flow control: loops, job invocation �Logs, statistics �Code: Java, Groovy �On row data or standalone �Can load libraries
Demo
Nifty Components �Fuzzy. Match - calculate Levenshtein distance or phonetic similarity �Interval. Match – perform lookup/join based on values falling within an interval �Replace, Replace. List - search and replace, substitution �Uniq. Row - output distinct rows based on defined key columns
More Nifty Components �XMLMap - Allows joins, column or row filtering, transformations, and multiple outputs �Normalize/Denormalize - split delimited strings into columns or join columns into a string �Aggregate. Row – GROUP BY; min, max, sum, other functions used to aggregate rows on a column
Tips and Tricks �Camel. Case job names for embedded jobs. �Or prefix with ETL phase and order of execution �Whenever appropriate (esp. for inserting data), use the schema from the repository. �When connecting, propagating changes to a DB component will change it to a built-in schema, which won't get updated.
Tips and Tricks �Propagating changes to a DB component will change it to a built-in schema, which won't get updated after repo changes. �On the other hand, remember that for lookup/join (i. e. , SELECT) queries you can modify the query to only select the fields you need. Propagating the schema is useful then.
Tips and Tricks �Failure handling subjob: �It’s an unconnected job (no triggers point to it) �Use Log. Catcher to catch, record component failures. �Record failure in DB, file, email, etc. �Add rollback component to undo DB changes if necessary. May need to do this in the job if strategic placement is needed.
Tips and Tricks �In Java expressions, use methods, not operators. E. g. , concat(String) instead of the dot operator, equals(Object) instead of ==. �Technical components (like hash maps) are hidden by default. See: http: //www. talendforge. org/forum/viewtopic. p hp? pid=110860
Tips and Tricks �When connecting, propagating changes to a DB component will change it to a built-in schema, which won't get updated after repo changes. �On the other hand, remember that for lookup/join (i. e. , SELECT) queries you can modify the query to only select the fields you need. Propagating the schema is useful then.
Tips and Tricks �Use a context for job variables. �Note you can specify type for variables. �You can read from a file or database, or pass in a context if an embedded Java job.
Tips and Tricks �For multi-host deployment: �Export the job with a “bootstrap” context that has all variables, but populates only a context config location that is the same for all machines. �The context config file has all values required for that host, e. g. test DB connection for test machine. �You can rely on the fact that Windows will interpret root as the main system drive, so “/Data/” will translate to C: Data �Be mindful of file permissions for sensitive context data (e. g. , DB password)
Tips and Tricks �Use “Bulk” output components when possible. �For transactional behavior: �Start the job with DB connection �Check “use existing connection” in all relevant components �Check "Die on error" in all relevant components �End job with commit component
Room for Improvement �UI stability �Documentation
Books �Getting Started with Talend Open Studio for Data Integration by Bowen Jonathan �Talend Open Studio Cookbook by Rick Daniel Barton �Big Data book coming…
Talend Forge �http: //www. talendforge. org/ �Forum – super helpful �Exchange – free community components! �Tutorials �Bug tracker �Source code
Talend Resources �http: //www. talend. com/resources �Help Center �Knowledge Base �Webinars, screencasts �Tutorials �Docs are on download page �And by pressing F 1 on a component
Questions? Compliments? Consulting gigs? �Contact me: �philip@yurchuk. com �http: //philip. yurchuk. com �http: //www. linkedin. com/in/philipyurchuk/
Thank You!
- Talend open data studio
- Talend integration suite
- Talend adalah
- Open innovation open science open to the world
- Xna game studio express
- Xna game studio visual studio 2017
- Data integration in data preprocessing
- Mashups meaning
- Data mashups and gis are data integration technologies.
- Data preprocessing
- Forward integration and backward integration
- Forward backward integration
- Simultaneous integration and sequential integration
- Open ispac file in visual studio
- Sas open source integration
- Fspos vägledning för kontinuitetshantering
- Novell typiska drag
- Nationell inriktning för artificiell intelligens
- Ekologiskt fotavtryck
- Varför kallas perioden 1918-1939 för mellankrigstiden
- En lathund för arbete med kontinuitetshantering
- Underlag för särskild löneskatt på pensionskostnader
- Tidböcker
- A gastrica
- Densitet vatten
- Datorkunskap för nybörjare