Data Mining Exploring Data Lecture Notes for Chapter

  • Slides: 47
Download presentation
Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by

Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 1

What is data exploration? A preliminary exploration of the data to better understand its

What is data exploration? A preliminary exploration of the data to better understand its characteristics. l Key motivations of data exploration include – Helping to select the right tool for preprocessing or analysis – Making use of humans’ abilities to recognize patterns u People can recognize patterns not captured by data analysis tools © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 2

Techniques Used In Data Exploration l In EDA, as originally defined by Tukey –

Techniques Used In Data Exploration l In EDA, as originally defined by Tukey – The focus was on visualization – Clustering and anomaly detection were viewed as exploratory techniques l In our discussion of data exploration, we focus on – Summary statistics – Visualization – Online Analytical Processing (OLAP) © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 3

Iris Sample Data Set l Many of the exploratory data techniques are illustrated with

Iris Sample Data Set l Many of the exploratory data techniques are illustrated with the Iris Plant data set. – Can be obtained from the UCI Machine Learning Repository http: //www. ics. uci. edu/~mlearn/MLRepository. html – From the statistician Douglas Fisher – Three flower types (classes): Setosa u Virginica u Versicolour u – Four (non-class) attributes Sepal width and length u Petal width and length u © Tan, Steinbach, Kumar Introduction to Data Mining Virginica. Robert H. Mohlenbrock. USDA NRCS. 1995. Northeast wetland flora: Field office guide to plant species. Northeast National Technical Center, Chester, PA. Courtesy of USDA NRCS Wetland Science Institute. 8/05/2005 4

Summary Statistics l Summary statistics are numbers that summarize properties of the data –

Summary Statistics l Summary statistics are numbers that summarize properties of the data – Summarized properties include frequency, mean and standard deviation – Most summary statistics can be calculated in a single pass through the data © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 5

Frequency and Mode l The frequency of an attribute value is the percentage of

Frequency and Mode l The frequency of an attribute value is the percentage of time the value occurs in the data set – For example, given the attribute ‘gender’ and a representative population of people, the gender ‘female’ occurs about 50% of the time. The mode of a an attribute is the most frequent attribute value l The notions of frequency and mode are typically used with categorical data l © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 6

Percentiles l For continuous data, the notion of a percentile is more useful. Given

Percentiles l For continuous data, the notion of a percentile is more useful. Given an ordinal or continuous attribute x and a number p between 0 and 100, the pth percentile is a value of x such that p% of the observed values of x are less than. l For instance, the 50 th percentile is the value such that 50% of all values of x are less than © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 . 7

Measures of Location: Mean and Median The mean is the most common measure of

Measures of Location: Mean and Median The mean is the most common measure of the location of a set of points. l However, the mean is very sensitive to outliers. l Thus, the median or a trimmed mean is also commonly used. l © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 8

Measures of Spread: Range and Variance Range is the difference between the max and

Measures of Spread: Range and Variance Range is the difference between the max and min l The variance or standard deviation is the most common measure of the spread of a set of points. l © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 9

Visualization is the conversion of data into a visual or tabular format so that

Visualization is the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported. l Visualization of data is one of the most powerful and appealing techniques for data exploration. – Humans have a well developed ability to analyze large amounts of information that is presented visually – Can detect general patterns and trends – Can detect outliers and unusual patterns © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 10

Example: Sea Surface Temperature l The following shows the Sea Surface Temperature (SST) for

Example: Sea Surface Temperature l The following shows the Sea Surface Temperature (SST) for July 1982 – Tens of thousands of data points are summarized in a single figure © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 11

Representation Is the mapping of information to a visual format l Data objects, their

Representation Is the mapping of information to a visual format l Data objects, their attributes, and the relationships among data objects are translated into graphical elements such as points, lines, shapes, and colors. l Example: l – Objects are often represented as points – Their attribute values can be represented as the position of the points © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 12

Visualization Techniques: Histograms l Histogram – Usually shows the distribution of values of a

Visualization Techniques: Histograms l Histogram – Usually shows the distribution of values of a single variable – Divide the values into bins and show a bar plot of the number of objects in each bin. – The height of each bar indicates the number of objects – Shape of histogram depends on the number of bins l Example: Petal Width (10 and 20 bins, respectively) © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 13

Two-Dimensional Histograms Show the joint distribution of the values of two attributes l Example:

Two-Dimensional Histograms Show the joint distribution of the values of two attributes l Example: petal width and petal length l – What does this tell us? © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 14

Visualization Techniques: Box Plots l Box Plots – Invented by J. Tukey – Another

Visualization Techniques: Box Plots l Box Plots – Invented by J. Tukey – Another way of displaying the distribution of data – Following figure shows the basic part of a box plot outlier 90 th percentile 75 th percentile 50 th percentile 25 th percentile 10 th percentile © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 15

Example of Box Plots l Box plots can be used to compare attributes ©

Example of Box Plots l Box plots can be used to compare attributes © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 16

Visualization Techniques: Scatter Plots l Scatter plots – Attributes values determine the position –

Visualization Techniques: Scatter Plots l Scatter plots – Attributes values determine the position – Two-dimensional scatter plots most common, but can have three-dimensional scatter plots – Often additional attributes can be displayed by using the size, shape, and color of the markers that represent the objects – It is useful to have arrays of scatter plots can compactly summarize the relationships of several pairs of attributes u See example on the next slide © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 17

Scatter Plot Array of Iris Attributes © Tan, Steinbach, Kumar Introduction to Data Mining

Scatter Plot Array of Iris Attributes © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 18

Visualization Techniques: Contour Plots l Contour plots – Useful when a continuous attribute is

Visualization Techniques: Contour Plots l Contour plots – Useful when a continuous attribute is measured on a spatial grid – They partition the plane into regions of similar values – The contour lines that form the boundaries of these regions connect points with equal values – The most common example is contour maps of elevation – Can also display temperature, rainfall, air pressure, etc. u An example for Sea Surface Temperature (SST) is provided on the next slide © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 19

Contour Plot Example: SST Dec, 1998 Celsius © Tan, Steinbach, Kumar Introduction to Data

Contour Plot Example: SST Dec, 1998 Celsius © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 20

Visualization Techniques: Matrix Plots l Matrix plots – Can plot the data matrix –

Visualization Techniques: Matrix Plots l Matrix plots – Can plot the data matrix – This can be useful when objects are sorted according to class – Typically, the attributes are normalized to prevent one attribute from dominating the plot – Plots of similarity or distance matrices can also be useful for visualizing the relationships between objects – Examples of matrix plots are presented on the next two slides © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 21

Visualization of the Iris Data Matrix standard deviation © Tan, Steinbach, Kumar Introduction to

Visualization of the Iris Data Matrix standard deviation © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 22

Visualization of the Iris Correlation Matrix © Tan, Steinbach, Kumar Introduction to Data Mining

Visualization of the Iris Correlation Matrix © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 23

Visualization Techniques: Parallel Coordinates l Parallel Coordinates – Used to plot the attribute values

Visualization Techniques: Parallel Coordinates l Parallel Coordinates – Used to plot the attribute values of high-dimensional data – Instead of using perpendicular axes, use a set of parallel axes – The attribute values of each object are plotted as a point on each corresponding coordinate axis and the points are connected by a line – Thus, each object is represented as a line – Often, the lines representing a distinct class of objects group together, at least for some attributes – Ordering of attributes is important in seeing such groupings © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 24

Parallel Coordinates Plots for Iris Data © Tan, Steinbach, Kumar Introduction to Data Mining

Parallel Coordinates Plots for Iris Data © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 25

Other Visualization Techniques l Star Plots – Similar approach to parallel coordinates, but axes

Other Visualization Techniques l Star Plots – Similar approach to parallel coordinates, but axes radiate from a central point – The line connecting the values of an object is a polygon l Chernoff Faces – Approach created by Herman Chernoff – This approach associates each attribute with a characteristic of a face – The values of each attribute determine the appearance of the corresponding facial characteristic – Each object becomes a separate face – Relies on human’s ability to distinguish faces © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 26

Star Plots for Iris Data Setosa Versicolour Virginica © Tan, Steinbach, Kumar Introduction to

Star Plots for Iris Data Setosa Versicolour Virginica © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 27

Chernoff Faces for Iris Data Setosa Versicolour Virginica © Tan, Steinbach, Kumar Introduction to

Chernoff Faces for Iris Data Setosa Versicolour Virginica © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 28

Datawarehouse and OLAP © Tan, Steinbach, Kumar Introduction to Data Mining 29 8/05/2005 29

Datawarehouse and OLAP © Tan, Steinbach, Kumar Introduction to Data Mining 29 8/05/2005 29 29

What is a Data Warehouse? l A decision support database that is maintained separately

What is a Data Warehouse? l A decision support database that is maintained separately from the organization’s operational database l “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decisionmaking process. ”—W. H. Inmon © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 30 30

Data Warehouse—Subject-Oriented l Organized around major subjects, such as customer, product, sales l Focusing

Data Warehouse—Subject-Oriented l Organized around major subjects, such as customer, product, sales l Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 31 31

Data Warehouse—Integrated l Constructed by integrating multiple, heterogeneous data sources – relational databases, flat

Data Warehouse—Integrated l Constructed by integrating multiple, heterogeneous data sources – relational databases, flat files, on-line transaction records © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 32 32

Data Warehouse—Time Variant l The time horizon for the data warehouse is significantly longer

Data Warehouse—Time Variant l The time horizon for the data warehouse is significantly longer than that of operational systems – Data warehouse data: provide information from a historical perspective (e. g. , past 5 -10 years) © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 33 33

Data Warehouse—Nonvolatile l A physically separate store of data transformed from the operational environment

Data Warehouse—Nonvolatile l A physically separate store of data transformed from the operational environment l Operational update of data does not occur in the data warehouse environment – Requires only two operations in data accessing: uinitial © Tan, Steinbach, Kumar loading of data and access of data Introduction to Data Mining 8/05/2005 34 34

OLTP vs. OLAP © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 35 35

OLTP vs. OLAP © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 35 35

What is OLAP http: //openmultimedia. ie. edu/Open. Products/Business_Intellig ence/Business_Intelligence/index. html © Tan, Steinbach, Kumar

What is OLAP http: //openmultimedia. ie. edu/Open. Products/Business_Intellig ence/Business_Intelligence/index. html © Tan, Steinbach, Kumar 16 September 2020 Introduction to Data Mining: Concepts and Techniques 8/05/2005 36 36

Data Warehouse: A Multi-Tiered Architecture OLAP Server Other sources Operational DBs Extract Transform Load

Data Warehouse: A Multi-Tiered Architecture OLAP Server Other sources Operational DBs Extract Transform Load Refresh Data Warehouse Serve Analysis Query Reports Data mining Data Marts Data Sources © Tan, Steinbach, Kumar Data Storage Introduction to Data Mining OLAP Engine Front-End Tools 8/05/2005 37

Extraction, Transformation, and Loading (ETL) l l l Data extraction – get data from

Extraction, Transformation, and Loading (ETL) l l l Data extraction – get data from multiple, heterogeneous, and external sources Data cleaning – detect errors in the data and rectify them when possible Data transformation – convert data from legacy or host format to warehouse format Load – sort, summarize, consolidate, compute views, check integrity, and build indicies and partitions Refresh – propagate the updates from the data sources to the warehouse © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 38 38

From Tables to Data Cubes l A data warehouse is based on a multidimensional

From Tables to Data Cubes l A data warehouse is based on a multidimensional data model which views data in the form of a data cube l A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions – Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 39 39

View of Warehouses and Hierarchies © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005

View of Warehouses and Hierarchies © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 40 40

© Tan, Steinbach, Kumar 16 September 2020 Introduction to Data Mining: Concepts and Techniques

© Tan, Steinbach, Kumar 16 September 2020 Introduction to Data Mining: Concepts and Techniques 8/05/2005 41 41

© Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 42

© Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 42

SQL SERVER Anaylsis Services OLAP Operations http: //www. youtube. com/watch? v=ct. Ui. HZHr-5 M

SQL SERVER Anaylsis Services OLAP Operations http: //www. youtube. com/watch? v=ct. Ui. HZHr-5 M © Tan, Steinbach, Kumar 16 September 2020 Introduction to Data Mining: Concepts and Techniques 8/05/2005 43 43

1 Qtr 2 Qtr 3 Qtr od TV PC VCR sum Date 4 Qtr

1 Qtr 2 Qtr 3 Qtr od TV PC VCR sum Date 4 Qtr Total annual sales sum of TVs in U. S. A. Pr U. S. A Canada Mexico Country uc t A Sample Data Cube sum © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 44 44

Typical OLAP Operations l Roll up (drill-up): summarize data – by climbing up hierarchy

Typical OLAP Operations l Roll up (drill-up): summarize data – by climbing up hierarchy or by dimension reduction l Drill down (roll down): reverse of roll-up l – from higher level summary to lower level summary or detailed data, or introducing new dimensions Slice and dice: project and select l Pivot (rotate): – reorient the cube, visualization, 3 D to series of 2 D planes © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 45 45

Fig. 3. 10 Typical OLAP Operations © Tan, Steinbach, Kumar Introduction to Data Mining

Fig. 3. 10 Typical OLAP Operations © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 46 46

Browsing a Data Cube © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 47

Browsing a Data Cube © Tan, Steinbach, Kumar Introduction to Data Mining 8/05/2005 47 47