Setting the Trend A Unique Methodology for Building
Setting the Trend: A Unique Methodology for Building Data Lakes Rob Nocera | Partner, NEOS
Introduction Founding Partner, NEOS LLC Business Management Consulting Technology Consulting Based in Hartford, CT Founded in 2000 In IT for 20+ years Databases and big data since 1994 Java since 1995 Enterprise Architecture since 2005 RNOCERA@NEOSLLC. COM
Overview • Benefits of a Data Lake • Two-Speed Architecture • The Five Zones • • • Raw Zone Structured Zone Curated Zone Analytics Zone Consumer Zone • Two-Speed Data Lake • Q&A
Benefits of a Data Lake SEARCH, üQuick and easy to ingest new data SCALE, & USE üAll enterprise data available in one place üData is quickly made available to Analysts & Data Scientists üUnstructured data is easily captured and stored üStreaming data is captured in real-time üCheaper storage than traditional warehouses üZoning provides benefits of a bimodal architecture to a Data Lake
Two-Speed Architecture Bimodal architecture supports traditional back-end systems Agile, quick development and support of analytics SLOW AND STEADY NIMBLE AND QUICK 100% of functionality needed Very robust Standards necessary Support available Slow to change 100% of functionality Functionality increases needed incrementally Very robust Each user programs as they see fit Standards necessary Support for underlying platform Support available only Slow to change Can evolve quickly Supports existing systems Supports existing Provides insight into data systems Can a bi-modal architecture help in designing a data lake? If so, how?
Two-Speed Data Lake Feed analytics quickly and painlessly Simultaneously support back-end systems and their requirements While it is ideal to provide new data for analytics as soon as possible, speed is not as important to the back-end systems as validated and quality data Analytics Data Lake Back-End Systems
Data Flow Raw data in multiple forms Analytics Zone Raw Zone Data feeding systems and applications Curated Zone Structured Zone Consumer Zone
Raw Zone WHAT • Raw data is stored in the format in which it’s received • Many different forms of input: • Flat files • SQOOP data from relational data stores • Messages and Streaming WHY • Storing original data allows for always validating with original source • Allows re-running of historical data with new processes or process improvements • New sources can be ingested into the lake quickly A small amount of processing is done to make data accessible from the raw schema, when appropriate
Raw Zone New sources are ingested quickly and in their original formats so that history is preserved as it was. ** Sources should not be loaded into the data lake without proper documentation in a source registry.
The Tools – Apache Ni. Fi was created by the NFS to automate the flow of data between systems. Used as part of a Data Lake for the ingestion of data.
The Tools – Apache Ni. Fi Process groups consists of processors that are configured and strung together to form flows. Creating the flows is a matter of pointing and clicking.
The Tools - SQOOP can be used to import data from an RDBMS, just about anything with a JDBC connection. sqoop import --connect "jdbc: oracle: thin: @(description=(address_list=(address=(protocol=tcps)(port=[PORT])(host=[HOSTNAME])))(connect_dat a=(SERVICE_NAME=[DB_SERVICE_NAME])))" --username [USERNAME] --password [PASSWORD] --query "select * from POSITIONS where LAST_CHANGE_TS >= [LAST_INGESTION_TS]" -m 6 --target-dir /apps/demo/raw/certified/[SOURCE]/process_timestamp=20171130_162054 --split-by POSITIONS_ID --temporary-rootdir /apps/demo/tmp/[SOURCEDIR] --append
The Tools – Hive Apache Hive provides a SQL like interface to databases and files stored in Hadoop. Managed Hive Table CREATE TABLE price (id INT, security STRING, price DECIMAL(9, 32)) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘, ’ LOCATION ‘ /hive/data/price’; External Hive Table CREATE EXTERNAL TABLE price (id INT, security STRING, price DECIMAL(9, 32)) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘, ’, LOCATION ‘ /app/demo/raw/certified/price’; Id, Security, Price 1, AMZN, 1149. 71 2, TSL, 305. 78 Id Security Price 1 AMZN 1149. 71 2 TSLA 305. 78
The Tools - Oozie is a Hadoop work flow manager. It is used to schedule and run jobs or sequences of jobs in the Hadoop environment. It consists of Oozie Workflow and Oozie Coordinator. Simple Fork and Join Oozie example <workflow-app xmlns="uri: oozie: workflow: 0. 4" name="concat-workflow"> <credentials> <credential name="Hive. Creds" type="hive 2">. . . </credentials> <start to="wave 1" /> <fork name='wave 1'> <path start='job 1'/> <path start='job 2'/> … </fork> <action name="job 1" cred="Hive. Creds"> <hive 2 xmlns="uri: oozie: hive 2 -action: 0. 1"> <job-tracker>${job. Tracker}</job-tracker> <name-node>${name. Node}</name-node> <jdbc-url>${jdbc_url}</jdbc-url> <script>${hql. Path}/backup_and_concat_hive_table. hql</script> <param>TABLE_NAME=demo</param> <param>SRC_SCHEMA=${typed. DB}</param> </hive 2> <ok to="wave 1 Join"/> <error to="kill"/> </action> <join name='wave 1 Join' to='wave 2'/> <fork name='wave 2'> <path start='job 4'/> <path start='job 5'/>. . . </fork>. . .
Structured Zone WHAT • 1 st stage of transformation • Data stored in typed table structures for easy querying • Some sources require complicated transformation • Structured schema created • Typically in Hive • Transformation from Raw to Typed should be automated, even for new sources when possible HOW WHY • Provide metadata for table structures • For XML, XSDs can be used • For RDBMS data – original table structure • For others, custom metadata files may be needed • Easy access for SQL-like queries • Additional transformations are easier • Other transformations not necessary for Analytics layer
Structured Zone This is the first destination where data research can be performed to begin making sense of the data in the Lake. Going forward, any new sources that are added can be quickly put into a form that is useful for analytics without extensive IT work.
Structured Zone Benefits can be achieved when ingestion and transformation to the structured zone is as automated as possible. For structured and semi-structured data, providing a proper metadata file with any new source should allow an automated process to be developed. An automated process allows for new sources to be ingested and made available for querying – without extensive IT resources
Structured Zone Unstructured data can be dealt with similarly, stored as raw data with some analysis done to get in to a structured format.
Curated Zone WHAT • Heavily transformed data available for data analytics and consumers • Transformations contain any business logic necessary in the lake • Can feed a data warehouse • Can be a data warehouse • Only contains data that will be used by consumers WHY • Provides validated data for analytics and consumers • Data is consolidated and validated for consumers
Curated Zone Data has been quality-checked and combined with like sources. Data is transformed into a common model. The curated zone forms the base for the consumer zone
Curated Zone In this implementation, all transformation was done in Hive. QL with the source and destination targets both being Hive. • • • Oozie jobs are kicked off by either a timer or the arrival of a file Total of about 30 Hive. QL scripts – some very complex 3 Main sets of jobs to split the load, based on what is needed when Each “wave” runs as much in parallel as is feasible Curated Zone is the Data Warehouse RI checks done after curation runs
Consumer Zone WHAT • Data provided in structured tables • Flattened structures • Can be used for data marts WHY • Easy access point • All of a consumer’s needs in a small number of structures • Provides access for downstream feeds
Consumer Zone The consumer zone provides data to consumers in a friendly format, making it easily accessible for business applications and reporting.
Consumer Zone The consumer zone contains the data cubes as well as other tables for feeding downstream systems. • • • Oozie jobs are kicked off by a timer More Hive. QL scripts – but not very complex Scripts to validate the data and alert support if there are problems Alternative Consumer “Marts” Technology: Oracle Databases, No. SQL
Analytics Zone WHY WHAT • • Dedicated area for Data Analysts and Data Scientists • Users can add their own datasets • Users can select from any of the zones (typically not the Consumer Zone) • • • Data Scientists can mine all existing data in one place Existing data is combined with new data sets without compromising the source data Sandbox is used to develop analytics, which are later operationalized in the Curated Zone No need to involve IT until something needs to be operationalized
Analytics Zone Data Scientists access all data in the Lake and can combine it with data from outside the Lake – without expending time and effort to set up an entire project
Analytics Zone Data Scientists have access to data in all the zones. • • • Currently jobs are manual Move to self-service There are some tools available to help Data dictionaries are key for certified data Users can import their own files
Two-Speed Data Lake Quick ingestion and population of the Raw and Structured Zones provide quick access to data from the Analytics Zone. The more rigid business rules and transformations populate the Curated and Consumer Zones consistently, but change is slower in these areas.
Questions? For additional information on Data Lakes and Client Stories, please email me at: RNOCERA@NEOSLLC. COM or visit: WWW. NEOSLLC. COM/DATALAKE
- Slides: 29