vito Big Data Normalization for Massively Parallel Processing

vito Big Data Normalization for Massively Parallel Processing Databases Lars Rönnbäck, Nikolay Golov © 2014 NIMBLE STORAGE | CONFIDENTIAL: DO NOT DISTRIBUTE ‹#›

AVITO - Clear #1 in Russia Avito Business Development Weekly Page Views (m) 2 400 Q 2 2014 Launch of Domofond, a dedicated real estate classified 2 000 1 600 1 200 September 2010 Target 13 additional cities Q 1 2010 Focus on Moscow and St. Pete 800 August 2011 Target total of 28 cities January 2012 Avito has national coverage Q 2 2013 Merger with Slando and Olx reaffirmed #1 position in the Russian market 400 0 янв-09 июл-09 янв-10 июл-10 янв-11 июл-11 янв-12 янв-13 +Jobs +Services +RE & Cars +B 2 C Goods C 2 C июл-12 июл-13 янв-14 Q 4 2014 Launch of a new revenue stream: Listing Fees июл-14 янв-15 +Vertical +Listing Fees +Pro tools Path from Investment Stage to Cash Flow Generation Stage 1 Stage 2 Stage 3 Position • Competing with others • Ahead of competition • x times ahead of competition Economics • Heavy investment • Approximately break-even • High EBITDA margin Focus • Build user base • Develop business model and build leading brand • Focus on monetization enhancement; attract professional classifieds market spend Source: Google Analytics, Live. Internet, Internal data © 2014 NIMBLE STORAGE | CONFIDENTIAL: DO NOT DISTRIBUTE 2 ‹#›

Back office Antifraud Node 01 Click stream Node 05 Node 04 Node 02 Node 03 MDM CRM BI Team © 2014 NIMBLE STORAGE | CONFIDENTIAL: DO NOT DISTRIBUTE 3 ‹#›

Table types of Anchor Modeling § Anchor. Table for entity. § Attribute. Table for single attribute of an entity. § Tie. Table for link between entities. § Knot. Table for dictionary. © 2014 NIMBLE STORAGE | CONFIDENTIAL: DO NOT DISTRIBUTE 4 ‹#›

Avito DWH evolution Cluster(s) Size (servers) 16 Cluster(s) Size (TB) 60 800 15 14 50 Integrated systems count 25 740 700 51 12 23 20 600 560 40 10 500 10 8 30 300 20 14 10 300 200 4 10 2 0 5 11 100 0 2013 15 400 26 6 4 Click. Stream size (Mln events/day) 2014 2015 © 2014 NIMBLE STORAGE | CONFIDENTIAL: DO NOT DISTRIBUTE 3 0 2013 2014 2015 5 ‹#›

Benefits and drawbacks of normalization in a MPP environment ▪ ▪ Semi-automatic addition of new entities, attributes and links to data model Universal approach to choosing data segmentation for tables Efficient logical compression of data Existing query optimizer can not produce efficient query execution plans for reports over normalized data model in a HP Vertica. © 2014 NIMBLE STORAGE | CONFIDENTIAL: DO NOT DISTRIBUTE 6 ‹#›

Efficient query plans for highly normalized data in a MPP environment ▪ Maximum merge join utilization instead of hash join to minimize risk of RAM depletion ▪ Temporary tables utilization to avoid repeated reading of table data from disk ▪ Automatic query plans generation according to Anchor Modeling metadata ▪ Reports over hundred-billion rows tables can be processed in minutes instead of hours © 2014 NIMBLE STORAGE | CONFIDENTIAL: DO NOT DISTRIBUTE 7 ‹#›
- Slides: 7