Hadoop or Hadont Saqib Mustafa Webinar Series 2016
Hadoop or Hadon’t Saqib Mustafa Webinar Series 2016
How Hadoop is used • Friction free data loading repository • Easy loading of data in HDFS • Easy scaling up • Scaling down can be an issue • Scalable engine for data transformation • Used to prep data for more expensive data warehouse systems • Exploratory analysis • Short term projects Hadoop is a 10 year old open source project that was originally created to solve search-engine woes • Store historical data Webinar Series 2016 2
Challenges of Hadoop • Struggles with structured data from enterprise applications Joins of multiple data sets can be slow Fast analytics at Scale Concurrency at scale Security is a nightmare Built for semistructured data • Complexity of use Difficult to Secure Inefficient for processing structured data • Lack of access control • Difficult to provide security through out the environment - Java programmers to write Map. Reduce Jobs - IT Dependencies Webinar Series 2016 - Hadoop does not support work load management - Serializes workload 3
Today’s realities Enterprise apps Data Warehouse(s) Datamarts 3 rd-party IOT Web Data challenges Silos of diverse data from diverse sources, growing rapidly Hadoop & no. SQL Costly, complex infrastructure Barriers to insight Significant resources consumed to build and maintain data platforms Data cannot be combined in one system and queried efficiently Analytics hindered by slow, incomplete access to data Data pipeline itself becomes a roadblock Webinar Series 2016 4
Limitations of current solutions Hadoop no. SQL platforms like Hadoop → → Complex: new skills & tools required like Java/Map. Reduce Slow: poor performance on analytics at scale → Incomplete: patchwork of tools, incomplete SQL → Security: Open access to environment, lack of enterprise controls Legacy Data warehousing → Costly: upfront capital costs, overprovisioning, … → Complex: partitioning, indexes, replication, … → Inflexible: forklift migrations, fixed schema, … Webinar Series 2016 5
Key unaddressed use cases Integrated data analytics Combine structured + semi-structured data for reporting & analytics Exploratory & ad hoc analytics Easy access to data for SQL analysts to explore data, identify correlations, build & test models Datamart & data silo consolidation Consolidate legacy datamarts to eliminate silos and serve data quicker Webinar Series 2016 6
Our vision: Reinvent the data warehouse Cloud Elasticity & agility Data Warehousing Big Data Performance & enterprise capabilities • Using SQL Connectivity to existing tools in the ecosystem • • Webinar Series 2016 Flexibility & scalability Can accommodate Semistructured data • JSON, AVRO, XML 7
What we built: The Snowflake Elastic Data Warehouse All-new SQL data warehouse Designed for the cloud No legacy code or constraints Running in Amazon Web Services Delivered as a service No infrastructure, knobs or tuning to manage Webinar Series 2016 All your data Deploy Structured and Semi-structured data in one place 8
Traditional Big Data Pipeline to Analyst Datamarts Data Warehouse(s) IOT Web Preprocessing Hadoop & no. SQL CSV File How JSON Data is adopted Tweet Sample {name =“Saqib”} Name City College {city = “Madison”} Saqib Madison Wisconsin Cust ID Name City College Time stamp 0001 Saqib Madison Wisconsin XX: YY: ZZ {college=“Wisconsin”} JSON File CSV File Customer Table Disadvantages Involves extra steps from JSON to CSV to DB Table Any changes to the model need changes to the whole environment Webinar Series 2016 9
Big Data Pipeline to Snowflake IOT Snowflake automatically ingests, columnarizes and optimizes the data Web Customer Table Tweet Sample {name =“Saqib”} {city = “Madison”} {college=“Wisconsin”} You can create joins on the variant type too Cust ID Time stamp 0001 XX: YY: ZZ JSON File Select {name = “Saqib”} Cust_id, Tweet_text. name, {college= “Wisconsin”} Tweet_text. city, 001, Saqib Madison Wisconsin Tweet_text. college Tweet Sample Name {name =“Saqib”} Saqib {college=“Wisconsin”} Result {city = “Madison”} How it is stored {city = “Madison”} Query Tweet_text (type VARIANT) From Customer Advantages City Madison College Wisconsin Webinar Series 2016 Direct ingestion into table No changes to schema for any change in the source data 10
A new architecture: Multi-cluster, shared data Hadoop / No. SQL architectures / Some Data warehouses Snowflake Analysts Test/Dev Sales Shared-nothing Multi-cluster, shared data Decentralized, local storage Centralized, scale-out storage Single cluster Multiple, independent compute clusters Webinar Series 2016 11
Enabling Strong Concurrency to allow Analytics at Scale Enabling Concurrency through scaling and warehouse/storage separation Management Security Optimization 01010 01101 00011 Metadata Single service Scalable, resilient cloud services layer coordinates access & management Elastically scalable compute Multiple “virtual warehouse” compute clusters scale horsepower & concurrency Database Storage Centralized storage Instant, automatic scalability & elasticity Webinar Series 2016 12
No infrastructure, knobs, or tuning Infrastructure Management Data Storage Management Metadata Management Manual Query Optimization **. . Hardware, software, availability, resiliency, disaster recovery managed by Snowflake Adaptive data distribution, automatic compression, automatic optimization Automatic statistics collection, scaling, and redundancy Webinar Series 2016 Dynamic optimization, parallelization, and concurrency management 13
Fits with your Ecosystem Diverse Data Sources (Big or Traditional) Data Management & Transformation >_ Scripting Java Reporting & Analytics Custom Webinar Series 2016 14
Protected by industrial-strength security …. X X Authentication Access control External validation Data encryption • Embedded multi-factor authentication • Role-based access control model • All data encrypted, always, end-to-end • Federated authentication available • Granular privileges on all objects & actions • Encryption keys managed automatically Webinar Series 2016 • Certified against enterpriseclass requirements (e. g. SOC 2 Type II, HIPAA) 15
Ad-tech analytics (JSON processing) Scenario • Analyze and monetize large data set of website Environment: traffic. - 100 N Hadoop environment - 36 N Data warehouse • Growth through Website acquisition Pain Points - Large data volumes of traditional and JSON data requiring exploration and analysis - Separate Data warehouse and Hadoop environments - Varying formats of JSON from different websites - Ask. com, about. com, investopedia. com, okcupid. com, dictionary. com - No Single version of Truth - Unpredictable performance on both data warehouse and Hadoop - No Dev/Test environment Webinar Series 2016 16
Enabling Ad-tech Analytics Solution • Use Snowflake to load all data into one warehouse • Load JSON and traditional data from different sources into Snowflake for data analysts to directly explore, build, test, and deploy new algorithms “Because of [Snowflake], business intelligence is moving from a cost center to a value center” Keith Lavery Sr. Director, BI, Data and Analytics • Tableau with native connection to Snowflake for analytics • Use Snowflake’s cloning feature to provide an up to date Dev/Test environment Webinar Series 2016 17
Serving diverse customers Snowflake is faster, more flexible, and more scalable than the alternatives on the market. The fact that we don’t need to do any configuration or tuning is great because we can focus on analyzing data instead of on managing and tuning a data warehouse. Craig Lancaster, CTO Webinar Series 2016 18
Recap Snowflake is… • An all-new data warehouse • Designed for the cloud • Combined structured and semi-structured data in an optimized manner Snowflake delivers. . . • One place for diverse data • Easier, faster analytics • Elastic scaling for any scale of data, workload, & concurrency • Without the cost and complexity of alternatives Webinar Series 2016 19
Sql> select questions from Audience Historical Results (also a Snowflake Feature) SQL Result Select Name, Email, website, twitter, json_example From presenter Where question =“Not addressed” Saqib Mustafa Saqib. mustafa@snowflake. net www. snowflake. net @snowflakedb & @drkoalz, THANK YOU Webinar Series 2016
CONTACT US Historical Results (also a Snowflake Feature) SQL Result Select Name, Email, website, twitter, json_example From presenter Where question =“Not addressed” Saqib Mustafa Saqib. mustafa@snowflake. net www. snowflake. net @snowflakedb & @drkoalz, THANK YOU Webinar Series 2016
Enabling Machine Learning analysis Scenario • Digital advertising click fraud detection through analytics Pain Points • Large data volumes of JSON data requiring exploration and analysis Solution • Load JSON ad impression and click data into Snowflake for data analysts to directly explore, build, test, and deploy new algorithms • Able to Speed up Algorithms from 24 hours to 2 hours Webinar Series 2016 Snowflake enables us to unlock large datasets to make it possible for business analysts, developers and account managers to ask their own questions directly of the data. Tamer Hassan Co-Founder and CTO 22
www. synerzip. com Ranjani Shah ranjani. shah@synerzip. com 469. 374. 0500 Webinar Series 2016 23 23 •
Synerzip in a Nutshell u Software product development partner for small/mid-sized technology companies • Exclusive focus on small/mid-sized technology companies, typically venture-backed companies in growth phase • By definition, all Synerzip work is the IP of its respective clients • Deep experience in full SDLC – design, dev, QA/testing, deployment u Dedicated team of high caliber software professionals for each client • Seamlessly extends client’s local team offering full transparency • Stable teams with very low turn-over • NOT just “staff augmentation, but provide full management support u Actually reduces risk of development/delivery • Experienced team – uses appropriate level of engineering discipline • Practices Agile development – responsive yet disciplined u Reduces cost – dual-site team, 50% cost advantage u Offers long-term flexibility – allows (facilitates) taking offshore team captive – aka “BOT” option Webinar Series 2016 24
Synerzip Clients Webinar Series 2016 25
Next Webinar To Estimate or #No. Estimates, That is the Question Thursday, June 23, 2016 @ Noon CST Presented by: Todd Little is Vice President of Product Development for IHS, a leading global provider of information, analytics, and expertise. Webinar Series 2016 26
Connect with Synerzip @Synerzip linkedin. com/company/synerzip facebook. com/Synerzip Ranjani Shah ranjani. shah@synerzip. com 469. 374. 0500 Webinar Series 2016 27
Webinar Series 2016
- Slides: 28