BIG DATA TECHNOLOGIES LECTURE 4 SCALABILITY ALGORITHM DATA

BIG DATA TECHNOLOGIES LECTURE 4: SCALABILITY: ALGORITHM + DATA + HARDWARE Assoc. Prof. Marc FRÎNCU, Ph. D. Habil. marc. frincu@e-uvt. ro

SCALABILITY Ability of a system to manage an increasingly volume of work Capacity of a system to grow to process larger data Ideally by doubling the processing power the volume to be processed doubles as well λ - slope

SCALABILITY Horizontal (in/out) Adding processing nodes to existing ones � Commonity clusters � � Group of networked machines by using Gigabit, Infiniband, Myrinet, … Requires data replication and synchronization mechanisms Vertical (up/down) Adding more resources on existing nodes � Virtualization � � Adding more cores, RAM, disk, etc. to a VM Cloud computing (on demand) Limited by the physical capacity of a node

VIRTUALIZATION Creates a virtual version of an OS, server, storage device, network, … Allows sharing physical resources amont multiple VMs (multi-tenancy) Enables the installation of hardware independent software Enables the configuration of images usable on a wide range of devices VMs are managed by a hypervisor (VMM) � Hardware abstraction � OS takes control of the hardware through the VMM

VIRTUALIZATION Classic software stack Virtualized software stack

CONTAINERS Lightweight VMs Emulate the OS interface through native interface No VMM � OS offers all the required support Examples: � Linux containers, Solaris containers, BSD jails Advantages � Fast allocation � Performance similar to running on OS � Lightweight

CONTAINERS

DOCKER Extension of Linux containers (LXC) Previously named dot. Cloud namespace � Restricts what a container can see cgroups � Restricts what a container can use from a resource

SCALABILITY Strong � Measuring execution time while keeping data volume constant but increasing the no. of processors Expectation: execution time drops k times if k processors are used Weak � Measuring execution time while increasing the no. of processors but keeping the work volume per processor constant Expectation: execution time constant

SCALABILITY Mith � The more we parallelize code the faster it runs � Ideally 2 x resources = 2 x faster � In reality Code is not 100% parallelizable Communication & IO Resources are limited By adding resources we do get an improvement but it is limited σ – percentage of code not parallelizable

LAW OF UNIVERSAL SCALABILITY The more load the system receives the less work it will perform k – communication penalty coefficient Sweet point There is no purpose to add more resources beyond it

EXAMPLES Community detection in social networks Weather forecast

COMMUNICATION PRICE Communication low speedup Communication price: More processors drop in speedup Advantage of hybrid approach Communication price

COMMUNICATION ADVANTAGE Example: matrix multiplication � Open. MP � For small dimensions: advantage of shared memory For large dimensions: application does not scale MPI For small dimensions: communication cost For large dimensions: scalability(throughput, speedup)

IMPACT OF DATA & ALGORITHM For the same algorithm different data can impact its scalability Example: graph processing � Platform Amazon EC 2 m 3. large (2 Intel Xeon E 5 -2670 cores, 7. 5 RAM, 100 GB SSD, 1 GB Ethernet) � 2 data sets: CARN, WIKI � No. of nodes: 3, 6, 9 � 3 algorithms: Hashtag Aggregation At each step compute a statistics about a given tag in the graf Meme Tracking Analyze meme spread in a graph TDPS (Time Dependent Shortest Path) Used in routing Recompute the shortest path at each step

IMPACT OF DATA & ALGORITHM CARN � Large diameter � Node distribution: uniform WIKI � Small diameter � Node distribution: power law Idea � Partition graph on many processors Number of interprocessor edges impacts communication Increasing the no. of partitions reduces scalability due to interprocessor communication (TPDS, MEME)

IMPACT OF DATA & ALGORITHM Setup I/O Processing (% parallel) Shutdown Example: detect influence spread in parallel on large graphs

IMPACT OF PARALLEL APIS Various MPI implementations

IMPACT OF HARDWARE PLATFORM Example: weather forecast (WRF) Bluegene scales well No. procs/speedup ratio

LECTURE SOURCES https: //www. slideshare. net/vividcortex/quantifyingscalability-with-the-usl http: //www 1. chapman. edu/~radenski/research/pape rs/mergesort-pdpta 11. pdf https: //arxiv. org/pdf/1012. 2273. pdf http: //serc. iisc. ernet. in/~simmhan/pubs/simmhanipdps-2015. pdf https: //books. google. ro/books? id=Jtha 3 w. RWCk. QC &pg=PA 485&lpg=PA 485 http: //lass. cs. umass. edu/~shenoy/courses/spring 16/ lectures/Lec 06. pdf https: //robinsystems. com/blog/containers-deep-dive -lxc-vs-docker-comparison/

NEXT LECTURE Data analysis � Independent � Dependent Graphs BSP model � Data flows Heterogeneous vs. homogeneous data Processing platforms � Map. Reduce � Spark Streaming � Apache Giraph
- Slides: 21