Scalable data store and analytics platform for monitoring

Scalable data store and analytics platform for monitoring WLCG, a distributed data-intensive scientific infrastructure Uthay Suthakar Brunel University eepguus@brunel. ac. uk

Topics • Introduction to current architecture • Proposed architecture • Lambda architecture • Review of technologies

Current architecture: • Robust architecture. • It does the job! But • Expensive. • Does not scale well. • Does not support real-time analytics.

Proposed architecture: Batch Layer Serving Layer Stores constantly growing dataset. Stores the batch processed views for interactive querying. Real-Time Processing Layer Perform analytics on fresh data.

Lambda Architecture Three layers architecture: • Batch Layer – for batch processing on Big Data and producing queryable views. • Serving Layer – for ad-hoc query (ideally from views generated by the batch layer). • Speed Layer – for real-time views based on incremental algorithms.

Batch Layer (i): Hadoop & Map. Reduce • • Programming model proposed by Google. Solve the complex issues (compute in parallel, load balance & fault tolerance). Two primitive parallel methods (Map and Reduce).

Batch Layer (ii): Stratosphere • • • Stratosphere extends the well-known Map. Reduce model with new operators. All operators will start working in memory. Support Java or Scala. Scales horizontally. Seamlessly integrates into existing Hadoop. Built-In Optimizer.

Serving Layer (1): Apache Drill • • • Inspired by Google’s Dremel. Drill provides a distributed execution engine for interactive queries. Low latency ad-hoc queries to many different data sources. Goal is to scale to 10, 000 servers and process petabytes of data within seconds. Supports multiple data models: - Schema: Protocol Buffers & Apache Avro - Schema-less: JSON, BSON, etc. .

Serving Layer (ii): Cloudera Impala • • • Massively Parallel Processing query engine. Low-latency SQL queries. Interactive analytics directly on data stored in Hadoop without data movement or predefined schemas. Shares workload management, metadata, ODBC driver, SQL syntax and user interface with Apache. SQL-92 features of Hive Query Language including SELECT, joins, and aggregate functions.

Serving Layer(iii): Presto (Facebook) • Distributed SQL query engine optimized for ad-hoc analysis. • Supports complex queries, aggregations, joins, and window functions. • Read-Only.

Speed Layer (i): Storm • • Exposes parallel real-time computation model. Highly Scalable. Guarantees that every message will be processed. Transactional topologies. Stream Processing. Continuous Computation. Distributed RPC. Stream Groupings.

Speed Layer (ii): Amazon Kinesis • Streaming data as managed service (Cloud Service). • Based on metering system (charged based on shards and HTTP PUT transaction). • Capacity of the streams are configured as shards (throughput capacity). • Kinesis Client Library – responsible for load balancing, coordination and error handling.

Speed Layer (iii): Samza • • • Three layers; stream layer, executing layer and processing layer. Samza is pluggable. Streams are partitioned and ordered sequentially. stream is composed of immutable messages of a similar type (kafka topics). States are co-located with each tasks. Check pointing for failure recovery.

Speed Layer (iv): S 4 • • Distributed stream processing engine inspired by the Map. Reduce. Combination of Map. Reduce and the Actors model. Provides a simple Programming Interface. Decentralized and Symmetric architecture (managed by Zoo. Keeper). Pluggable architecture. Lossy failover is acceptable – Processes are moved to standby. Several PEs are available for standard tasks such as count, aggregate, join, and so on…

Spark, Shark, Spark Stream, etc… (i) • In-memory distributed computing framework. • Provides a general programming model (operators such as Map, Reduce, Join, Filter, Group. By, Sort, Left. Outer. Join, Right. Outer. Join, Count, Union, Cross, etc. . ). • Low-latency computations by caching the working dataset in memory. • Fault tolerance by lineage or check pointing. • Spark extends it’s engine for stream processing. • Provides same Spark APIs for processing stream.

Summary • Map. Reduce to generate reports and answer historical queries. • Interactive computation for ad-hoc queries. • Stream for real-time analytics. Separate technologies == Complex to manage and maintain.