Supercomputing versus Big Data processing Whats the difference

Supercomputing versus Big Data processing — What's the difference? Helmut Neukirchen helmut@hi. is Professor for Computer Science and Software Engineering

The Big Data buzz • Google search requests 1/2004– 10/2016 “Supercomputer” vs. “Big Data”: “Supercomputer” “Big Data”

Excursion: Moore’s Law • “Number of transistors in an integrated circuit doubles every two years. ” • Clock speed & performance per clock cycle doubled each as well every two years. – Not true anymore! http: //wccftech. com/moores-law-will-be-dead-2020 -claim-experts-fab-process-cpug-pus-7 -nm-5 -nm/

Consequences of hitting physical limits • Today’s only way to achieve speed: – Parallel processing: • Many cores per CPU, • Many CPU nodes. https: //hpc. postech. ac. kr/~jangwoo/research. html • Both, Big Data processing and Supercomputing use this approach. – Investigate them to see their difference!

Supercomputing / High-Performance Computing (HPC) • Computationally intensive problems. Mainly: – Floating Point Operations (FLOP), – Numerical problems, e. g. weather forecast. • HPC algorithms implemented rather low-level (=close to hardware/fast): – Programming languages: Fortran, C/C++. – Explicit intermediate results exchange. • Input & output data processed by a node fit typically into its main memory (RAM). – Output of similar size as input. http: //www. vedur. is/vedur/frodleikur/greinar/nr/3226 https: //www. quora. com/topic/ Message-Passing-Interface-MPI

HPC hardware • Compute nodes: fast CPUs. • Nodes connected via fast interconnects (e. g. Infini. Band). • Parallel File System storage: accessed by compute nodes via interconnnect. – Many hard disks in parallel (RAID): high aggregated bandwidth. • Expensive, but needed for highest performance of HPC processing model: – Read input once, compute & exchange intermediate results, write final result. Supercomputer at Icelandic Meteorological Office, owned by Danish Meteorological Institute Storage 1500 Tera Byte Thor: 100 Tera FLOP/s Freya: 100 Tera FLOP/s For comparison: Garpur @ Reiknistofnun Háskóla Íslands: 37 Tera FLOP/s http: //www. dmi. dk/nyheder/arkiv/nyheder-2016/marts/ny-supercomputer-i-island-en-billedfortaelling/

Big Data • Data created in the age of Internet: – Volume (amount of data), • Unlikely to fit into main memory (RAM) of cluster. ÞNeed to process data chunk by chunk. • Extract condensed summary as output. – Variety (range of data types and sources), – Velocity (speed of data in and out). https: //youtu. be/H 7 NLECd. Bnps

http: //www. semantic-evolution. com Big Data processing • Typically, simple operations instead of number crunching. – E. g. search engine crawling the web: index words & links on web pages. • Algorithms require not much intermediate results exchange. Þ Input/Output (I/O) of data most time consuming. – Computation and communication less critical. Þ Big Data algorithms can be implemented rather high-level: – Programming languages: Java, Scala, Python. – Big Data platforms: Apache Hadoop, Apache Spark: • Automatically read new data chunks, • Automatically execute algorithm implementation in parallel, • Automatically exchange intermediate results as needed.

Big Data hardware • Cheap standard PC nodes with local storage, Ethernet network. – Distributed File System: each node stores locally a part of the whole data. – Hadoop/Spark move processing of data to where the data is locally stored. ÞSlow network connection not critical. – Cheap hardware more likely to fail: Hadoop and Spark are fault tolerant. • Processing model: read chunk of local data, process chunk locally, repeat; finally: combine and write result. https: //www. flickr. com/photos/cmnit/ 2040385443 mantic-evolution. com

HPC vs. Big Data • We need both – HPC and Big Data Processing: – Do not run compute/communication intensive HPC jobs on Big Data cluster: • Slower CPUs, • Slower communication, • Slower high-level implementations. – Do not run Big Data jobs on HPC cluster: • Typically slower (fast local access missing), • Waste of money to use expensive HPC hardware.

HPC and Big Data @ HÍ • Research & teach both at Computer Science department: – Guest Prof. Dr. Morris Riedel, Prof. Dr. Helmut Neukirchen: • HPC: REI 101 F High Performance Computing A. • Big Data: REI 102 F High Performance Computing B, TÖL 503 M/TÖL 102 F Distributed Systems. – By inventing clever algorithms, HPC/Big Data not even needed. • 15: 45– 16: 00, Askja 131: Páll Melsted: “Kallisto: hvernig RNA greining sem tók hálfan dag tekur nú 5 mínútur” – Thank your for your attention! Any questions or comments?