Data CSLU 2850 Lo 1 Spring 2008 Cameron

  • Slides: 37
Download presentation
Data CSLU 2850. Lo 1 Spring 2008 Cameron Mc. Inally mcinally@fordham. edu Fordham University

Data CSLU 2850. Lo 1 Spring 2008 Cameron Mc. Inally mcinally@fordham. edu Fordham University May contain work from the Creative Commons.

Data • Data § § § Factual information. Numerical or other information represented in

Data • Data § § § Factual information. Numerical or other information represented in a form suitable for processing by a computer. Values derived from scientific experiments.

Data • Random Event § Something that happens, with a random outcome. Example: if

Data • Random Event § Something that happens, with a random outcome. Example: if we have a normal deck of cards, B, and we turn the top card over. What is the probability that it is an Ace, A?

Data • Variable § § A characteristic of an object or event. There are

Data • Variable § § A characteristic of an object or event. There are two types of variables § Quantitative § Values are measurable numbers. § Qualitative § Values fall into categories.

Data • Quantitative Variable § Discrete Variables § § can only take a finite

Data • Quantitative Variable § Discrete Variables § § can only take a finite number of possible values. Continuous Variables § can take on an infinite number of possible values.

Data • Qualitative Variable § Ordinal Variables § § § categories that can be

Data • Qualitative Variable § Ordinal Variables § § § categories that can be assigned some natural order. E. g. ranking a service [not satisfied, very satisfied]. Nominal Variables. § § categories that cannot be put into a natural order. E. g. gender. All qualitative variables are discrete.

Data • Data § Scalar § § § Formal Definition: an atomic quantity that

Data • Data § Scalar § § § Formal Definition: an atomic quantity that only has magnitude. Our Definition: a variable (or field) that can only hold one value at a time. Can also be called a: cell.

Data • Organizing Data § There are many different structures to hold data. We

Data • Organizing Data § There are many different structures to hold data. We will focus on these two: 1. Vector § § A chain of scalar values. Can also be called a: row, column, 1 -dimensional array, or list. 2. Matrix § § A Vector of Vectors. Can also be called a: table, spreadsheet*, or worksheet*

Data • Vectors § § Also called a 1 -dimensional array. A group of

Data • Vectors § § Also called a 1 -dimensional array. A group of elements accessed by a value. Example: Position Value 0 4 1 3 2 8 3 5 4 7 5 3

Data • Vectors § § § What value is in position x 2? What

Data • Vectors § § § What value is in position x 2? What value is in position x 0? What value is in position x 5? 4 3 8 5 =8 =4 =3 7 3

Data • Matrices § Also called a 2 -dimensional array or spreadsheet. § A

Data • Matrices § Also called a 2 -dimensional array or spreadsheet. § A group of elements accessed by 2 values. Example: Position 0 1 2 3 0 4 3 4 2 1 3 2 2 8 9 8 6 3 5 3 8 8 4 7 2 3 2 5 3 8 2 3

Data • Matrices (RC-Convention Arow, col) § § § What value is in position

Data • Matrices (RC-Convention Arow, col) § § § What value is in position A 2, 3? What value is in position A 3, 4? What value is in position A 0, 0? Position 0 1 2 3 0 4 3 4 2 1 3 2 2 8 9 8 6 =8 =2 =4 3 5 3 8 8 4 7 2 3 2 5 3 8 2 3

Data • Matrices (Excel-Convention) § § § What value is in position B 0?

Data • Matrices (Excel-Convention) § § § What value is in position B 0? =3 What value is in position D 3? =8 What value is in position C 1? =9 Position 0 1 2 3 A 4 3 4 2 B 3 2 2 8 C 8 9 8 6 D 5 3 8 8 E 7 2 3 2 F 3 8 2 3

Data • Descriptive Statistics § § A branch of Statistics focused on summarizing data.

Data • Descriptive Statistics § § A branch of Statistics focused on summarizing data. Charts § § § XY(Scatter) Plots Bubble Plots Histograms and Bar Charts Box Plots Summary Statistics § § § Mean Mode Median Standard Deviation Range

Data • Presenting Data § Cartesian Plane § Two values represent a data point.

Data • Presenting Data § Cartesian Plane § Two values represent a data point. § § One value plots against the x-axis (abscissa). One value plots against the y-axis (ordinate).

Data • Presenting Data § Scatter Plot § § Data is represented, as points,

Data • Presenting Data § Scatter Plot § § Data is represented, as points, on a Cartesian coordinate system. A unit of data consists of two values. One is plotted against the x-axis and the other is plotted against the y-axis.

Data • Presenting Data § Bubble Plot § § Used to show three dimensions

Data • Presenting Data § Bubble Plot § § Used to show three dimensions of data on a Cartesian coordinate system. A unit of data consists of three values. One is plotted against the x-axis, another is plotted against the y-axis, and the last is the size of the marker.

Data • Presenting Data § Bar Charts § § Height of the bar represents

Data • Presenting Data § Bar Charts § § Height of the bar represents the value. Bars can be horizontally or vertically orientated.

Data • Presenting Data § Histograms § § Visual representation of a frequency distribution.

Data • Presenting Data § Histograms § § Visual representation of a frequency distribution. Area of the bar represents the value.

Data • Distribution § § § Describes how observations are spread out over a

Data • Distribution § § § Describes how observations are spread out over a range of values. The values must cover all possible values of the event. Distributions can be continuous or discrete.

Data • Normal Distribution Example § The CS 2850 final grades fall into this

Data • Normal Distribution Example § The CS 2850 final grades fall into this distribution. This means: § § § Most of the students receive a grade in the middle (i. e. C to B+). Few students fail (i. e. C- and below). Few students get perfect scores (i. e. A- and A).

Data • Uniform Distribution Example § The CS 2850 final grades fall into this

Data • Uniform Distribution Example § The CS 2850 final grades fall into this distribution. This means: § Students are equally likely to receive any grade (i. e. F through A).

Data • Shapes of Distributions § Skewness § § § A frequency distribution is

Data • Shapes of Distributions § Skewness § § § A frequency distribution is said to be skewed when its mean and median are different. When a distribution is skewed, one tail has more values than expected. If the distribution is not skewed, then it is called symmetric.

Data • Shapes of Distributions § Kurtosis § The kurtosis of a frequency distribution

Data • Shapes of Distributions § Kurtosis § The kurtosis of a frequency distribution is the concentration of scores at the mean. Graphically, kurtosis is how peaked the distribution is shaped.

Data • Landmark Summaries § pth percentile § § § Quartiles § § §

Data • Landmark Summaries § pth percentile § § § Quartiles § § § Approximately p% of the values fall below the pth percentile. Conversely, approximately (100 -p)% of the values are above the pth percentile. Values located at 25%, 50%, and 75% percentiles. These are often called the first, second, and third quartiles, respectively. Can also be defined as the subsets of all the data points that fall within each part. Interquartile range § The difference between the First and Third quartiles.

Data • Mean § § § The average of all the values. i. e.

Data • Mean § § § The average of all the values. i. e. The sum of the values divided by the number of values. Represented as

Data • Mean § Trimmed mean § § Outliers are excluded from this average.

Data • Mean § Trimmed mean § § Outliers are excluded from this average. Since, they can greatly effect the outcome. Geometric mean § § The n numbers are multiplied together and then the nth root of the product is taken. Think about this Geometrically! The geometric mean, for two values, is the side-length of the square with area equal to the rectangle with side lengths x 1 and x 2. How would you find the Geometric mean for 3 values?

Data • Mean § Harmonic mean § § Used to find the average of

Data • Mean § Harmonic mean § § Used to find the average of rates (e. g. average velocity of a car on a trip. ) The number of variables divided by the sum of the reciprocals of the variables.

Data • Working with Data § Median § § § The value at the

Data • Working with Data § Median § § § The value at the 50 th percentile. The value that separates the higher ½ of the values from the lower ½. If we have an even number of values, the median is not unique. We often take the average of those values.

Data • Working with Data § Mode § § Most frequently occurring value in

Data • Working with Data § Mode § § Most frequently occurring value in a distribution. If no value occurs more than once, there is no mode. If there are exactly two values that occur with the same, highest frequency, we say the data is bimodal. If there are two or more values that occur with the same, highest frequency, we say the data is multimodal.

Data • Working with Data § Range § The difference between the maximum and

Data • Working with Data § Range § The difference between the maximum and minimum value in a distribution.

Data • Working with Data § Deviation § § Variance § § of a

Data • Working with Data § Deviation § § Variance § § of a value from the average: Square each deviation (why? ) and then sum them together. Divide by n-1 (why not n? ). (Sample) Standard Deviation § Measures variability.

Data § Degrees of Freedom § So, when calculating Variance, why do we divide

Data § Degrees of Freedom § So, when calculating Variance, why do we divide by n-1, and not by n? § § Let’s say we have n values that must add up to x. We are free to pick n-1 values at will. The nth number must be chosen so that all the numbers sum to x. So, we say there are n-1 degrees of freedom. Dividing by n-1 in the sample standard deviation is to correct for errors. Sample standard deviations are usually smaller than the population standard deviation. This is often difficult to understand even more difficult to describe in a high-level way. Try your best!

Data • Outliers § § A value that is unusually numerically distant from the

Data • Outliers § § A value that is unusually numerically distant from the remaining values. should be examined! These may suggest an error in how the sample was drawn.

Data • Outliers § § Here are some common definitions: Moderate (Mild) Outlier §

Data • Outliers § § Here are some common definitions: Moderate (Mild) Outlier § Extreme Outlier Where Q 1 and Q 3 are the first and third Quartiles, respectively, and IRQ is the Interquartile range.

Data • Box Plots § § May indicate which values are outliers. In one

Data • Box Plots § § May indicate which values are outliers. In one image, Box Plots show the: § § § First and third quartiles Median Interquartile range Minimum and maximum values Moderate and extreme outliers We will learn how to calculate these in the Statistical Package Project 1.

Data Homework (Always Due in One Week) Week • Skim Chapters 1 -3. Read

Data Homework (Always Due in One Week) Week • Skim Chapters 1 -3. Read Chapter 4. • Complete Chapter 4, pg. 161: 1(a-e), 2, 4, 5, 6 and 7.