Big Data Acknowledgment and Disclaimer This presentation is
Big Data Acknowledgment and Disclaimer: This presentation is supported in part by the National Science Foundation under Grant 1240841. Any opinions, findings, and conclusions or recommendations expressed in these materials are those of the authors and do not necessarily reflect the views of the National Science Foundation.
Big Data We live in the information age with an exponential growth of data. In 2010 Eric Schmidt, the CEO of Google, said, “There were 5 Exabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days. ” In 2019, the World Economic Forum estimated that "the entire digital universe is expected to reach 44 zettabytes by 2020. "
The Digital Revolution
The Big Data Challenge How do we. . . ● process data sets that are too large for traditional algorithms and software tools.
The Big Data Challenge How do we. . . ● process data sets that are too large for traditional algorithms and software tools. ● extract knowledge from large datasets in a wide variety of domains, from science to medicine to consumer data.
We will explore Big Data through the video “Big Data in 5 Minutes”: (https: //www. youtube. com/watch? v=b. Ayr. Obl 7 TYE) Watch the first 4: 00
The field of Data Science deals with extracting information from and visualizing the results of manipulating large data sets. Given enough data, we can often extract useful information from it, by looking for patterns in the data. From this information, further analysis may yield knowledge or even wisdom. We often think of data, information, knowledge and wisdom forming a pyramid. Computing enables new methods of deriving information from data, driving monumental change across many disciplines — from art to business to science.
Definitions ● DIKW pyramid
Definitions ● DIKW pyramid ● Data: discrete raw facts, signals, observations. ○ Patient data, sensor data, scientific data.
Definitions ● DIKW pyramid ● Data: discrete raw facts, signals, observations. ○ Patient data, sensor data, scientific data. ● Information: processed, organized or structured data that is used for some purpose. ○ 14% of Americans are over 65
Definitions ● DIKW pyramid ● Data: discrete raw facts, signals, observations. ○ Raw patient data. ● Information: processed, organized or structured data that is used for some purpose. ○ 14% of Americans are over 65 ● Knowledge and Wisdom: More philosophical. ○ E = mc 2 ○ Religious knowledge Keep the DIKW pyramid in mind as you watch the short 3 minute video, Learning Revealed: How A Word Is 'Birthed'.
Measuring Data
Review of Bits and Bytes ● A bit is the basic unit of computing. ○ The 0 s and 1 s (binary).
Review of Bits and Bytes ● A bit is the basic unit of computing. ○ The 0 s and 1 s (binary). ● A byte is a group of 8 bits. ○ Considered the basic unit of memory.
Review of Bits and Bytes ● A bit is the basic unit of computing. ○ The 0 s and 1 s (binary). ● A byte is a group of 8 bits. ○ Considered the basic unit of memory. ● A kilobyte is roughly 1000 bytes. ○ Actually 210 = 1024 bytes.
Review of Bits and Bytes ● A bit is the basic unit of computing. ○ The 0 s and 1 s (binary). ● A byte is a group of 8 bits. ○ Considered the basic unit of memory. ● A kilobyte is roughly 1000 bytes. ○ Actually 210 = 1024 bytes. ● A megabyte is roughly 1 million bytes. ○ Actually 220 = 1, 048, 576 bytes.
Data Storage Terminology (Src: Wikipedia)
Data Storage Terminology (Src: Wikipedia) Everyday interpretations of bytes. (Base 10)
Data Storage Terminology (Src: Wikipedia) Everyday interpretations of bytes. (Base 10) Actual computer memory sizes. (Base 2)
Very Large Numbers (Src: UCSD Global Information Industry Center) Byte (B) 1 byte 1 1 character of text Kilobyte (KB) 103 bytes 1, 000 1 page of text Megabyte (MB) 106 bytes 1, 000 1 small photo Gigabyte (GB) 109 bytes 1, 000, 000 1 hour of HD video Terabyte (TB) 1012 bytes 1, 000, 000 1 (largest) consumer hard drive in 2008 Petabyte (PB) 1015 bytes 1, 000, 000 AT&T carried 18. 7 PB of data traffic on average day in 2008 Exabyte (EB) 1018 bytes 1, 000, 000 All of the hard drives in MN (pop 5. 1 M) Zettabyte (ZB) 1021 bytes 1, 000, 000, 000 World’s 27 M servers processed 9. 57 ZB in 2008
Storing Data
History: Storage Capacity (Src: Wikipedia) ● 1950 s-60 s – ~3 -4 Mb (refrigerator size) ● 1982 – IBM-PC Hard drive 5 Mb ● 2007 – First 1 terabyte (≈0. 9095 Ti. B) hard drive (Hitachi GST) ● 2008 – First 1. 5 terabyte (≈1. 3642 Ti. B) hard drive (Seagate) ● 2009 - First 2 terabyte internal 3. 5″ hard drive. (Western Digital) ● 2010 - First >1 terabyte, 1. 5 terabyte (≈1. 3642 Ti. B) commercial tape system ● 2011 - First 4 terabyte drive (Hitachi) ● 2012 - First 1 terabyte USB flash drive, sold by Victorinox.
History: Storage Capacity 1956 - IBM 350 Disk Drive -- 5 Mb
History: Storage Capacity 1985 - IBM PC Hard Drive -- 50 Mb
History: Storage Capacity 2007 - Hitachi, First Terabyte (1012 bytes) Drive
History: Storage Capacity ~2010 - 32 GB Toshiba USB Flash Drive
History: Storage Capacity (Src: Wikipedia) 2012 - First 1 terabyte (1012 bytes) USB flash drive Victorinox 1 TB flash drive Swiss Army knife combo
Storage Capacity: Exponential Growth (Src: Wikipedia) Note: The scale on the Y-axis is exponential.
Processing Data
Sorting and Searching ● Traditional Searching & Sorting Algorithms ○ Data must fit into main memory (RAM)
Sorting and Searching ● Traditional Searching & Sorting Algorithms ○ Data must fit into main memory (RAM) ○ Data are processed sequentially (not in parallel).
Analyzing Data ● Spreadsheets ○ ○ Structured data: Columns and rows. Small to large size data sets (50 MB - 2 GB) Analysis, visualization. Widely used in business. Google Spreadsheets
Analyzing Data ● Databases ○ ○ Structured Data: Tables. Moderate to large data sets (2 GB - 2 TB) Storage and retrieval of relational data. Logical searching and analysis of data. ■ Retrieve records for all males > 65.
Google Fusion Tables ● Fusion tables ○ Free Web service (250 MB Limit) ○ Data Visualization, maps
Processing Large Data Sets
Processing Large Data Set Sort a Petabyte (1015) bytes of data ○ 1015 = 103 x 103 bytes ○ Quicksort, mergesort, radix sort assume the data are in RAM. ○ 1 Petabyte would occupy ■ 1, 000 1 -TB disk drives, or ■ 10, 000 100 -GB drives
The Map. Reduce Model Map. Reduce is a programming model for processing large data sets.
Using Data
Big Data: Government Data ● 2012, Data. gov, 84 programs, six departments ○ Benefit: helping government address problems. ○ Tradeoff: Gov’t has too much data on us?
Big Data: Web Analytics ● Analytics -- discovery and use of meaningful patterns in data. ○ Benefit: Provide customers with targeted ads. ○ Tradeoff: Loss of privacy and anonymity of web search.
Big Data: Data Mining ● Data Mining -- discovering patterns in large data sets. ○ Benefit: Discovering risk factors in medical data. ○ Tradeoff: Can we keep patient medical data secure? Normal patients Diabetic patients
Data Visualization ● IBM chromogram of Wikipedia edits reveals known and new editing patterns.
Data Mining: Neonatal monitoring ● Data mining real-time data (heart rate, respiratory rate, O 2 saturation) provides a non-invasive way of predicting neonatal health. ● Traditional approach: Apgar score uses blood drawn from heel prick.
Big Data and Mobile Computing
Big Data and Mobile Google: Translate “Ciao mondo!” “Hello world!” Map/Reduce (speech recognition) Benefit: Improves ability to learn foreign language. Tradeoff: Google knows what we’re thinking about.
Big Data and Mobile Google: Augmented reality Map/Reduce Benefit: Better awareness of what’s around us. Tradeoff: Google knows where we are, what we’re thinking.
Summary ● The Digital era involves Large Data Sets ● Presents challenges and opportunities. ● Requires new processing and visualization techniques. ● Comes with the promise of benefits. ● Comes with tradeoffs in terms of privacy and security.
- Slides: 47