BIG DATA TECHNOLOGIES LECTURE 1 BIG DATA BIG

  • Slides: 28
Download presentation
BIG DATA TECHNOLOGIES LECTURE 1: BIG DATA & BIG DATA ANALYSIS Assoc. Prof. Marc

BIG DATA TECHNOLOGIES LECTURE 1: BIG DATA & BIG DATA ANALYSIS Assoc. Prof. Marc FRÎNCU, Ph. D. Habil. marc. [email protected] ro

LECTURE STRUCTURE 1 lecture hour + 2 lab hours / week Aim of this

LECTURE STRUCTURE 1 lecture hour + 2 lab hours / week Aim of this lecture? Importance of Big Data analysis Impact of Big Data in science and technology Parallel and distributed architectures Parallelization of algorithms Importance of hardware and data structure in the design of Big Data processing algorithms � Analisys of independent, dependent and streaming data (homogeneous and heterogeneous) � � � Practically (laborator) Using Google Cloud to run parallel and distributed applications � Paralelizing sequential basic sequential algorithms � Design, test, evaluation

MINIMAL REQUIREMENTS Passing grade (5) 1 parallel algorithm implemented (in a single technology) and

MINIMAL REQUIREMENTS Passing grade (5) 1 parallel algorithm implemented (in a single technology) and evaluated � One scientific presentation (10 mins the presentation + 2 questions) about a scientific paper (published or tech report) with focus on Big Data, Cloud Computing, Bioinformatics, Security in Big Data. � Maximum grade (10) All algorithms given during lab hours should be implemented, and a final technical report should be presented � One scientific presentation (10 mins the presentation + 2 questions) about a top scientific paper (IPDPS, Supercomputing, Europar, CCGrid, ICDCS, IEEE Trans. PDC, IEEE Trans. Computing, FGCS, TPDS) with focus on Big Data, Cloud Computing, Bioinformatics or Security in Big Data. �

A WORLD INCREASINGLY CONNECTED Je Suis Charlie: 6500 retweets per minute

A WORLD INCREASINGLY CONNECTED Je Suis Charlie: 6500 retweets per minute

A WORLD INCREASINGLY INTERCONNECTED Cyberphysical systems: IT + communication + intelligence

A WORLD INCREASINGLY INTERCONNECTED Cyberphysical systems: IT + communication + intelligence

KNOWLEDGE = POWER = DATA Data: decision control autonomy intelligence

KNOWLEDGE = POWER = DATA Data: decision control autonomy intelligence

WHAT IS BIG DATA? Oxford English Dictionary (OED) � Wikipedia � data of a

WHAT IS BIG DATA? Oxford English Dictionary (OED) � Wikipedia � data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze The ability of society to harness information in novel ways to produce useful insights or goods and services of significant value” and “…things one can do at a large scale that cannot be done at a smaller one, to extract new insights or create new forms of value. The broad range of new and massive data types that have appeared over the last decade or so The new tools helping us find relevant data and analyze its implications The convergence of enterprise and consumer IT The shift (for enterprises) from processing internal data to mining external data The shift (for individuals) from consuming data to creating data. The merger of Madame Olympe Maxime and Lieutenant Commander Data The belief that the more data you have the more insights and answers will rise automatically from the pool of ones and zeros A new attitude by businesses, non-profits, government agencies, and individuals that combining data from multiple sources could lead to better decisions. https: //www. forbes. com/sites/gilpress/2014/09/03/12 -big-data-definitions-whats-yours/#66 e 783 be 13 ae

WHAT IS BIG DATA? Volume Velocity Variety . . .

WHAT IS BIG DATA? Volume Velocity Variety . . .

WHAT IS BIG DATA? Big Data Small Data TB or PB of data >TB,

WHAT IS BIG DATA? Big Data Small Data TB or PB of data >TB, PB GB 30 Ki. B - 30 Gi. B / sec Fixed data Furthermore, Big Data means: � Using multiple data sources � Data ambiguities and human/machine errors Big Data != Better Data Unprocessed data has no meaning! Data analysis increases their value → information!

BIG DATA IN NUMBERS

BIG DATA IN NUMBERS

BIG DATA IN NUMBERS

BIG DATA IN NUMBERS

BIG DATA IN NUMBERS

BIG DATA IN NUMBERS

BIG DATA IN THE CURRENT GLOBAL CONTEXT

BIG DATA IN THE CURRENT GLOBAL CONTEXT

WHY NOW? ”We could have gotten started a lot earlier. We simply weren’t stepping

WHY NOW? ”We could have gotten started a lot earlier. We simply weren’t stepping back and looking at how to use the data” – Brad Smith, Intuit Data are too precious to be erased! Hardware/price • • • Low storage cost Powerful multicore processors Low latency for distributed systems • Fast links: 40 Gbps, 100 Gbps Virtualization/containers • Isolate resources • VMWare, Virtual. Box, Docker Cheap access to resources • Cloud Computing Tehnologies • • • A better understanding of process distribution • Map. Reduce New database systems • No. SQL (Key-value store, columnar): Redis, Cassandra, Dinamo, Monet. DB Advanced data analysis methods • Machine Learning Easily accessible Big Data platforms • Google Cloud, Amazon Web Service Open-source software • Open. Stack, Open. Nebula, HDFS

WHAT DO WE DO WITH THE DATA? Ethics! Private data � Sensible data �

WHAT DO WE DO WITH THE DATA? Ethics! Private data � Sensible data �

INFORMATION EXTRACTION Exploratory � Theory based on observing phenomena Constructive � Theory based on

INFORMATION EXTRACTION Exploratory � Theory based on observing phenomena Constructive � Theory based on axioms and theorems Modelling (teory) Analysis Hypothesis Experiment

THE 4 TH PARADIGM Big Data + analysis � Prediction of the future Analysis

THE 4 TH PARADIGM Big Data + analysis � Prediction of the future Analysis � Follows an exploratory path and studies data � Infers knowledge based on statistics or machine learning techniques Constructs models and validates them based on data

DATA ANALYSIS The process of studying various types of data and to identify so

DATA ANALYSIS The process of studying various types of data and to identify so far unknown correlations and to extract other useful information Based on data mining Data flow

TYPES OF DATA ANALYTICS Descriptive � What Diagnostics � Why it happened? Predictive �

TYPES OF DATA ANALYTICS Descriptive � What Diagnostics � Why it happened? Predictive � What has happened? will happen? Prescriptive � What should we do with the data? Level of understanding data & its value

FEW EXAMPLES Medical monitoring of children to alert doctors and parents when an intervention

FEW EXAMPLES Medical monitoring of children to alert doctors and parents when an intervention is needed Predicting the status of industrial machinery Preventing traffic jams, saving fuel, cutting down pollution

DATA VALUE

DATA VALUE

DATA ANALYSIS FLOW Data Acquisition Data cleaning, annotation and extraction � Missing values, outliers,

DATA ANALYSIS FLOW Data Acquisition Data cleaning, annotation and extraction � Missing values, outliers, duplicates � Between 50 -70% of the effort is put here! Heterogeneous data integration and representation in a common format Data analysis Automated and visual interpretation of results � People often see patterns that algorithms fail to identify! Decision making

BIG DATA ROLES Data scientist Data science = systematic method dedicated to uncovering knowledge

BIG DATA ROLES Data scientist Data science = systematic method dedicated to uncovering knowledge through data analysis � In business � � In science � Observed/experimental data analysis with the aim of drawing a conclusion Requirements Process optimizations for increased efficiency Statistics Java, Python, R, . . Domain knowledge Data engineering = field that designs, implements and offers systems for Bog Data analysis � Builds scalable and modular platforms for data scientists � Installs Big Data solutions � Requirements � databases, software engineering, parallel and cloud processing, real-time processing C++, Java, Python Understating performance factors and the limitations of the systems/algorithms

AREAS OF INTEREST

AREAS OF INTEREST

DATA VS. PROCESSING SPEED

DATA VS. PROCESSING SPEED

PRACTICAL EXAMPLE CLASSIFICATION IN DNA MICROARRAY STUDIES Classifying and prediction of the diagnosis based

PRACTICAL EXAMPLE CLASSIFICATION IN DNA MICROARRAY STUDIES Classifying and prediction of the diagnosis based on the gene profile Measuring gene expression on a sample of 4, 026 genes from 59 patients (39 used for training) exhibiting lymphoma and divided in 3 classes based on the type of the lymphoma Problem � Few classes, hard to classify data (volume) Algorithm Find the centroid (mean expression of each gene) of each lymphoma � Find the genes that belong to it � http: //statweb. stanford. edu/~tibs/ftp/ncshrink 2. pdf

USEFUL LINKS http: //www. comp. nus. edu. sg/~tankl/cs 5344/slides/2016/intro. pdf http: //infolab. stanford. edu/~echang/Big.

USEFUL LINKS http: //www. comp. nus. edu. sg/~tankl/cs 5344/slides/2016/intro. pdf http: //infolab. stanford. edu/~echang/Big. Dat 2015 -Lecture 1 -Edward-Chang. pdf https: //wr. informatik. unihamburg. de/_media/teaching/wintersemester_2015_2016/bd-1516 einfuehrung. pdf https: //www. ee. columbia. edu/~cylin/course/bigdata/EECS 6893 Big. Data. Analytics-Lecture 1. pdf

NEXT LECTURE Parallel and distributed architectures � Parallel systems Shared memory Distributed memory �

NEXT LECTURE Parallel and distributed architectures � Parallel systems Shared memory Distributed memory � Distributed systems Cloud computing