Introduction to Big Data Analytics Course August 28
Introduction to Big Data Analytics Course August 28 th 2017 Kyung Eun Park, D. Sc. kpark@towson. edu
Contents 1. 2. 3. 4. Course Information Scope of the Class Introduction to Big Data Analytics Project Groups with Analytical Platform 2
Welcome aboard! • Instructor: Kyung. Eun Park • Office: YR 457 • Email: kpark@towson. edu • Office hour: Tuesdays (11: 00 AM to 12: 00 PM) or Mondays by appointment • TA: Chung Hao Juan • Email: cjuan 1@students. towson. edu • Homepage • http: //tigerweb. towson. edu/kpark/courses/bda/index. html • Textbook • Practical Data Science with Hadoop and Spark by Mendelevitch, Stella, and Eadline, Addison Wesley, 2017 3
Scope of the Class 1. New Analytical Ecosystem 2. Overview of Big Data, Data Science, and Analytics Methods 3. Innovations in processing large datasets by Internet Giants 4. Data Scientist 5. Big Data Analytics Environment 6. Big Data Processing Platforms 4
New Approach for Big Data Massively parallel software running on tens, hundreds, or even thousands of servers • Relational database management systems and desktop statistics, visualization, plus horizontal scaling existing packages often have difficulty in handling big data. • Paradigm shift in working with data • Hadoop Big Data framework: distributed file system with big data processing algorithm such as Map. Reduce (similar to traditional divide and conquer approach) • But it provides static batch analysis with less support for real-time data https: //en. wikipedia. org/wiki/Big_data 5
Hadoop Big Data Cluster Platform 6
Organizations with Interest in Big Data, also Concerns on it What would you suggest or build for those organizations as a data expert? • Design and build a modern data warehouse (DW) / business intelligence (BI) / data analytics architecture • Provide a flexible, multi-faceted analytical ecosystem. • Leverage both internal and external data to obtain valuable, actionable insights that allows the organization to make better decisions. Transforming long-established data warehousing architectures into vibrant, multi-faceted analytical ecosystems with Hadoop and Map. Reduce!!! Hadoop make it possible for organizations to cost-effectively consume and analyze large volumes of semi-structured data http: //www. b-eye-network. com/blogs/eckerson/archives/2012/02/the_new_analyti. php http: //www. rosebt. com/blog/big-data-analytics-infrastructure 7
New Analytical Ecosystem � Traditional data warehousing environment � New modern BI architecture with Hadoop, No. SQL DB, etc. http: //www. b-eye-network. com/blogs/eckerson/assets_c/2012/02/BI%20 Ecosystem-474. php http: //www. rosebt. com/blog/big-data-analytics-infrastructure 8
New DW / BI / Data Analytics Architecture beyond Hadoop!!! From Hadoop for storing semi-structured data (log files and machine generated data) and traditional structured data to Data Warehousing hub and query using SQL-based reporting and analysis tools Direct analysis of raw data inside Hadoop, No. SQL DB, and in-memory DB by writing Map. Reduce and familiar SQL or JSON–based query tools http: //www. b-eye-network. com/blogs/eckerson/archives/2012/02/the_new_analyti. php http: //www. rosebt. com/blog/big-data-analytics-infrastructure 9
Big Data No single definition, according to Wikipedia: • Big data is the term for data sets that are so large or complex that traditional data processing application software is inadequate to deal with them. • Big data challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating and information privacy. • Big data tends to refer to the use of predictive analytics, user behavior analytics, or certain other advanced data analytics methods that extract value from data sets. • Analysis of data sets can find new correlations to "spot business trends, prevent diseases, combat crime and so on. " https: //en. wikipedia. org/wiki/Big_data 10
Big Data 3 -V’s Characterized by the 3 -V’s • Volume: larger than “normal” • Expensive to do ETL (Extract, Transform, and Load), index, and search • Velocity: Almost real-time arrival rate • Service on-the-fly • cf) Batch ETL operations • Variety: mix of unstructured and semi-structured data from various sources • • • Images Videos Text context such as blogs, tweets, media, etc Reactions on social media Machine generated data Log files 11
Text Data: Biggest Driver of Growth of Data • Text Data is the biggest driver of growth of Big Data in real world. • Map. Reduce as a solution to the problem of processing large amount of text data by Google • Indexing web pages • Re-indexing continues • Simple search performed on large unstructured data at large scale • Map. Reduce algorithm is not universally applicable to any big data • Works well for counting and simple statistics • Still valid for Real-time data? 12
Data Science No clear consensus around one definition • Data science is the (1) exploration of data via the scientific method to discover meaning or insight and the (2) construction of software systems that utilize such meaning and insight in a business context • Interdisciplinary field about scientific method, processes, and systems to extract knowledge or insights from data in various forms, either structured or unstructured, similar to data mining • Also known as data-driven science • Concept to unify statistics, data analysis and their related methods in order to understand analyze actual phenomena with data • Uses techniques and theories drawn from many fields within the broad areas of mathematics, statistics, information science, and computer science, in particular from the subdomains of machine learning, classification, cluster analysis, data mining, databases, and visualization. https: //en. wikipedia. org/wiki/Data_science 13
Two Key of Data Science (1) Exploration of data using the scientific method • Ask ─ Hypothesis ─ Implement/Test ─ Evaluate Ask a Question Evaluate Results Form a Hypothesis Implement the Analysis (2) Implementation of software systems: • Make the output of the system available and usable 14
Data Science History • Rise of Data Science with key technological and scientific achievements • Statistics and machine learning machine learns patterns from data • Open source libraries fast and robust machine learning algorithm • Computer technology easy to collect, store, and process large sets of data with less cost • Innovation from Internet Companies • Yahoo!, Google, Amazon, Netflix, Facebook, and Pay. Pal realizes that they had huge data. • Driven to apply machine learning and statistical techniques to the data they could expect significant benefit to their business: • Business growth • New business opportunities • Innovative products 15
Machine Learning and Statistical Techniques Recognition of the potential of using large existing raw datasets in new, innovative ways. • Google, Yahoo!, (now Bing): Improvement in search engine results, search suggestions, and spelling • Many search giants: use to analyze page view and click information to predict CTR (click-through rate) and deliver relevant online ads to search users • Linked. In, Facebook: analyze the social graph of relationships between users for “People You May Know (PYMK) • Netflix, e. Bay, Amazon: automated product or movie recommendations • Pay. Pal: large-scale graph algorithms to detect payment fraud Wave of innovation with new tools and technologies: Google File System, Map. Reduce, Hadoop, Pig, Hive, Cassandra, Spark, Storm, Hbase, etc. 16
Data Science in Modern Enterprise Key technologies from Internet giants commercial tools and open source products • Cheap, fast storage, cluster computing technologies, and Hadoop: • Capability to collect and store vast amounts of data inexpensively • (Raw form) Data as a valuable asset without additional cost Data science enterprise applications previously not possible • Machine learning and statistical data mining algorithms • R, Python scikit-learn, Spark MLlib, etc. • Easy and flexible application of advanced algorithms to datasets Reduction of the overall effort, time, and cost to achieve business results from data assets 17
Data Scientist Role • Generally speaking, a successful data scientist needs to have a balanced skillset from both data engineering and applied science • Data Engineer • An experienced software engineer in building high-quality production-grade software systems with specialty in fast (and distributed) data pipelines • Expertise in major programming language • Data collecting, storing, and processing capability with RDBMS, No. SQL DB, Hadoop stack (HDFS, Map. Reduce, Hbase, Pig, Hive, and Storm) • Applied Scientist • Primarily interested in solving a real-world problem by applying the right algorithm to data • Hands-on with statistical tools and some scripting languages (R, Python, or SAS) • Soft skills • curiosity, continuous learning, persistence, communication or story-telling skill 18
Expected Skillset of Data Scientist Data Engineering Applied Science Distributed Systems Data Analytics Data Processing Experiment Design Computer Science Machine Learning Software Engineering Statistics 19
Successful Transition to Data Scientist IDEAL Data Scientist: Stronger development experiences as well as more depth in machine learning and statistics Recommendations in reality: Both data engineer and applied scientist work together on the same problem and thus learn from each other and accelerate their transition to becoming a data scientist 20
Data Science Project Life Cycle • Ask the right question • Understanding the business problem and translate it into an easy-to understand form • Well-defined success criterion • Data acquisition • Getting data to Big Data framework (e. g. Hadoop) • Hadump: data dumped into Hadoop with no plan • Data cleaning • keeping data consistency without human error • Explore data and design model features • Building and tuning the model • Deploy to Production 21
Iterative Data Science Project Life Cycle Ask a Question from a Hypothesis Acquire Data Clean Data Deploy and Implement the Analysis Explore Data and Design Features Build Model Evaluate and Visualize Results 22
Data Science Project Management • Seems to be natural to manage Data Science Project using general techniques from other software development projects • But that is not the case! Different nature, different approach and mindset needed! • Unknown data quality at the start of the project • Measurement and evaluation is of utmost importance. So developing the analytical infrastructure is as important as the algorithm. But it takes time and is the additional overhead. • Difficult to determine the expected level of accuracy for statistical techniques and machine learning and the clear exit criteria Iterative modeling is performed : Shortening the iteration times !!! • So far, no field-tested methodology to manage data science projects! 23
Project Group • Group 1: Hadoop Ecosystem • • • Shubham Puri Ksenia Venevtseva Zhengzheng Li • Group 2: No. SQL (Mongo. DB, …) • • • Robert Brasso Kevin Chu Simi Akinmuda • Group 3: Elastic or Splunk • • • Hyun Park Dooil Kim Muhammad Saifullah • Group 4: Hadoop Spark • • • Abhay Kaushai Angad Maggo Pengfei Ran 24
Applicable Cases: Predictive Analytics • • • Anomaly Detection Alerting Metrics and Log Analysis Operational Analytics Behavior Analysis Business Analytics 25
Infrastructures You May Choose … • Hadoop (HDFS, Map. Reduce) • R, Python scikit-learn, Spark MLlib, etc • Amazon EC 2 • No. SQL Database: Mongo. DB, Couch. DB, Cassandra • Elasticsearch: ELK Stack + X-Pack (or Cloud) and more … 26
Open Datasets for the Course Projects • Twitter • Public Streams using streaming APIs (https: //dev. twitter. com/streaming/public) • Sensor-based datasets for Io. T (Internet of Things) • energy, urban planning, healthcare, engineering, weather, and transportation sectors (http: //www. datasciencecentral. com/profiles/blogs/great-sensor-datasets-to-prepare-yournext-career-move-in-iot-int) • User-created data • Log, web, motion or location data • Open datasets • U. S. Government’s open data: • https: //www. data. gov/ • Maryland’s open data • Maryland open data portal: https: //data. maryland. gov/ • Maryland mapping and GIS data protal: http: //imap. maryland. gov/Pages/default. aspx • Baltimore datasets available at the following websltes: • Baltimore Open Data: https: //data. baltimorecity. gov/ • Baltimore Neighborhood Indicators Alliance: http: //bniajfi. org 27
Reading Assignments • Textbook: • Chapter 1 and 2 • Articles: 1. 2. 3. 4. The Anatomy of Big Data Computing, https: //arxiv. org/ftp/arxiv/papers/1509. 01331. pdf Computing Infrastructure for Big Data Processing , http: //www. istccc. cmu. edu/publications/papers/2013/Ling. Liu-FCS. pdf Scaling Big Data Mining Infrastructure: The Twitter Experience, http: //www. kdd. org/exploration_files/V 14 -02 -02 -Lin. pdf Smarter Infrastructure: Thoughts on Big Data and Analytics, http: //www. redbooks. ibm. com/redpapers/pdfs/redp 5161. pdf • References 1. 2. 3. 4. 5. https: //en. wikipedia. org/wiki/Big_data https: //en. wikipedia. org/wiki/Data_science http: //scikit-learn. org/stable/tutorial/basic/tutorial. html https: //spark. apache. org/docs/latest/ http: //www. rosebt. com/blog/big-data-analytics-infrastructure 28
- Slides: 28