An Information Architecture for Hadoop Mark Samson Systems

Background • The trend is for organisations to build business-wide Hadoop implementations • Enterprise

What are the requirements? • Ingest data in its full fidelity, in as close

Where does an Enterprise Data Hub fit? Data Consumers Enterprise Data Hub Data consumers

How does data arrive? Data Consumers Enterprise Data Hub Data can arrive in any

Raw Layer Data Consumers Principle: Ingest data raw, in full fidelity – as close

Discovery Layer Data Consumers Used for Discovery and Exploration by small teams of Analysts

Shared Layer Data Consumers Shared Layer Discovery Layer Raw Layer Available across LOBs (subject

Optimised Layer Data Consumers Optimised Layer Shared Layer Discovery Layer Raw Layer Data Sources

What About Real Time? Optimised Layer Speed Layer To operationalise use cases in real

This is a Complex, Multi-Tenant Architecture Data Consumers Critical Enablers A broad and open

Considerations Data Consumers Speed Layer Optimised Layer Shared Layer Discovery Layer Raw Layer Data

Conclusion Data Consumers Move from Big Data Spaghetti EDWs Marts Servers Document Stores Storage

Conclusion Data Consumers Speed Layer Optimised Layer Move from Big Data Spaghetti Shared Layer

Slides: 16

Download presentation

An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera

Background • The trend is for organisations to build business-wide Hadoop implementations • Enterprise Data Hub / Data Lake / Hadoop as a Service • Many data sources • Many lines of business • Many use cases • Many engines and tools available to process and analyse data • Need to meet SLAs for data consumers • How do I organise my information architecture within Hadoop to cope with this variety? • Need a Logical Information Architecture for Hadoop! © Cloudera, Inc. All rights reserved. 2

What are the requirements? • Ingest data in its full fidelity, in as close to its original, raw form as possible • Provide a data discovery and exploration facility for analysts and data scientists • Bring together and link multiple data sets • Serve data efficiently to business users and applications – meeting SLAs © Cloudera, Inc. All rights reserved. 3

Where does an Enterprise Data Hub fit? Data Consumers Enterprise Data Hub Data consumers can be: • Analysts • Data Scientists • Business Users (Reports) • Applications Data Sources can be: • Databases / DWs • File Sources • Machines, Sensors (Io. T) • Internet (Social Media etc) • Mobile Data Sources Enterprise Data Hub sits in between! (but it’s not the only thing in between) © Cloudera, Inc. All rights reserved. 4

How does data arrive? Data Consumers Enterprise Data Hub Data can arrive in any form e. g. • Event data • Log files • Streaming e. g. via MQ, Kafka • Relational tables with any data model • Star schema • 3 NF • Files with any format • Text • JSON • XML • Avro • … Data Sources © Cloudera, Inc. All rights reserved. 5

Raw Layer Data Consumers Principle: Ingest data raw, in full fidelity – as close as possible to the form in which it arrives Data organised in HDFS by data source e. g. /landing/<source> Raw Writeable by ingestion processes e. g. Flume, Sqoop Layer Data Sources Readable by transformation processes e. g. Hive, Pig, MR, Spark © Cloudera, Inc. All rights reserved. 6

Discovery Layer Data Consumers Used for Discovery and Exploration by small teams of Analysts and Data Scientists Users or teams given their own “sandpits” (at a cost? ) Discovery Layer Raw Layer Data Sources Mix of views and materialised data Some data sets “enriched” e. g. by joining reference data Tools: Impala, Solr, Spark © Cloudera, Inc. All rights reserved. 7

Shared Layer Data Consumers Shared Layer Discovery Layer Raw Layer Available across LOBs (subject to security constraints) Incentives for Analyst / Data Science teams to move their data and use cases into this Layer Data from multiple sources joined together Tools: Impala, Hive, Pig, Spark Data Sources © Cloudera, Inc. All rights reserved. 8

Optimised Layer Data Consumers Optimised Layer Shared Layer Discovery Layer Raw Layer Data Sources Build this when you need to operationalise the use case Organised by data consumer and use case not by source Data modeled to provide optimised performance • Often denormalised • Uses optimised storage formats e. g. Parquet with partitioning, HBase • Accessed by low latency query engines e. g. HBase, Impala, Solr © Cloudera, Inc. All rights reserved. 9

What About Real Time? Optimised Layer Speed Layer To operationalise use cases in real time: • Low latency components e. g. Kafka, Flume, Spark Streaming • Consume straight from sources • Transform/analyse it • Deliver it direct to the Optimised Layer for low-latency query • Or deliver direct to consumer • Generally still persist raw data in Raw Layer • Follows the Lambda Architecture Data Consumers Shared Layer Discovery Layer Raw Layer Data Sources © Cloudera, Inc. All rights reserved. 10

This is a Complex, Multi-Tenant Architecture Data Consumers Critical Enablers A broad and open ecosystem Speed Layer Optimised Layer Shared Layer Discovery Layer Raw Layer Data Sources Security and Governance • Authentication • Authorisation • Auditing • Lineage and Metadata • Encryption Resource Management Chargeback © Cloudera, Inc. All rights reserved. 11

Considerations Data Consumers Speed Layer Optimised Layer Shared Layer Discovery Layer Raw Layer Data Sources This is not prescriptive • There could be more or fewer layers, depending on use cases This is a logical architecture There may be multiple physical clusters due to non functional requirements e. g. • Compliance and security e. g. some data can only be kept in EU • If there are tight SLAs, some engines perform better on dedicated clusters e. g. HBase, Kafka © Cloudera, Inc. All rights reserved. 12

BOOK SIGNINGS THEATER SESSIONS TECHNICAL DEMOS GIVEAWAYS Visit us at Booth #101 HIGHLIGHTS: Apache Kafka is now fully supported with Cloudera Learn why Cloudera is the leader for security & governance in Hadoop © Cloudera, Inc. All rights reserved. 15

Thank you