Statistical Methods in Computer Science Data 1 Frequency

  • Slides: 34
Download presentation
Statistical Methods in Computer Science Data 1: Frequency Distributions Ido Dagan Statistical Methods in

Statistical Methods in Computer Science Data 1: Frequency Distributions Ido Dagan Statistical Methods in Computer Science © 2006 -now Gal Kaminka / Ido Dagan 1

Concrete Theory: Relates Variables to Each Other Examples: Mathematically accurate Memory = 2*sizeof(input) +

Concrete Theory: Relates Variables to Each Other Examples: Mathematically accurate Memory = 2*sizeof(input) + 3 Runtime = 500 + 30*sizeof(input) + 20 Asymptotically correct Memory = O(sizeof(input)) in worst case, Runtime = O(log (sizeof(input))) in best case Accuracy is proportional to run-time Qualitative User performance is increased with reduced cognitive load number of bugs discovered is monotonically decreasing, but positive, if the same programmer is used, otherwise, it increases Statistical Methods in Computer Science © 2006 -now Gal Kaminka / Ido Dagan 2

Behavior Parameters/Variables (typical of Computer Science) Hardware parameters CPU model and organization, cache organization,

Behavior Parameters/Variables (typical of Computer Science) Hardware parameters CPU model and organization, cache organization, latencies in the system System parameters Memory availability, usage CPU running time (sometimes approximated by world-clock time) Communication bandwidth, usage Program characteristics requires floating-point, heavy disk usage, integer math, graphics large heap, large stack, uses non-local information, . . . Statistical Methods in Computer Science © 2006 -now Gal Kaminka / Ido Dagan 3

Additional Behavior Variables Algorithm parameters: Algorithm choice, correctness/accuracy of results Performance curves (accuracy vs.

Additional Behavior Variables Algorithm parameters: Algorithm choice, correctness/accuracy of results Performance curves (accuracy vs. run-time) Size of input Worst case, best case, average case (!!) Other Development person-hours User (programmer) satisfaction, productivity Lines of code, number of components, . . . Robotics: Speed of movement, accuracy of positioning Learning: precision and recall Statistical Methods in Computer Science © 2006 -now Gal Kaminka / Ido Dagan 4

Scales of Measurements Nominal (also called categorical): No order, just labels Ordinal (also called

Scales of Measurements Nominal (also called categorical): No order, just labels Ordinal (also called rank): Order, but not numerical Difference between ranks is not necessarily the same e. g. , ranks in (hierarchical/military) organization Interval: Difference between values has same meaning everywhere e. g. , “Algorithm Name” e. g. , temperature in Celsius (rise of 10 degrees is the same everywhere) But 100 C is not twice as hot as 50 C, and 0 C is not lack of heat Ratio: Interval + Fixed zero point e. g. , temperature in Kelvin, robot position, memory usage, run-time Statistical Methods in Computer Science © 2006 -now Gal Kaminka / Ido Dagan 5

Scale Hierarchy Nominal < Ordinal < Interval < Ratio “Numerical” Propositions that are true

Scale Hierarchy Nominal < Ordinal < Interval < Ratio “Numerical” Propositions that are true for some level, are true above it e. g. , we can calculate the mean (average) value for numerical variables But not necessarily the other way around But not for nominal and ordinal e. g. , we can calculate the most frequent value for all variables http: //en. wikipedia. org/wiki/Levels_of_measurement Statistical Methods in Computer Science © 2006 -now Gal Kaminka / Ido Dagan 6

Variables Discrete: Can take on only certain values: symbols, exact numbers For ordinal, interval

Variables Discrete: Can take on only certain values: symbols, exact numbers For ordinal, interval and ratio scales, this means there will be gaps e. g. , User satisfaction surveys, memory usage Continuous: Can take on any value within its range: no gaps e. g. , run-time, CPU temperature, robot velocity and position In practice: limited by measurement accuracy Up to researcher to determine needed accuracy, approximate carefully Statistical Methods in Computer Science © 2006 -now Gal Kaminka / Ido Dagan 7

Data • The collection of values that a variable X took during the measurement

Data • The collection of values that a variable X took during the measurement Statistical Methods in Computer Science © 2006 -now Gal Kaminka / Ido Dagan 8

Describing Data Our task: Describe the data we have collected Find ways to characterize

Describing Data Our task: Describe the data we have collected Find ways to characterize it, represent it Find properties that are true of the data So that we can relate the values to those of other variables Statistical Methods in Computer Science © 2006 -now Gal Kaminka / Ido Dagan 9

Frequency Distribution Examine the frequency of values f(x) = # of times variable took

Frequency Distribution Examine the frequency of values f(x) = # of times variable took on value x. Statistical Methods in Computer Science © 2006 -now Gal Kaminka / Ido Dagan 11

Frequency Distribution Examine the frequency of values f(x) = # of times variable took

Frequency Distribution Examine the frequency of values f(x) = # of times variable took on value x. ? Statistical Methods in Computer Science © 2006 -now Gal Kaminka / Ido Dagan 12

Frequency Distribution Examine the frequency of values f(x) = # of times variable took

Frequency Distribution Examine the frequency of values f(x) = # of times variable took on value x. Statistical Methods in Computer Science Convention (Ordinal/Numerical): Sort by value © 2006 -now Gal Kaminka / Ido Dagan 13

Grouped Frequency Distributions In ordinal/numerical variables, possible to group values together Create Grouped Frequency

Grouped Frequency Distributions In ordinal/numerical variables, possible to group values together Create Grouped Frequency Distributions Statistical Methods in Computer Science © 2006 -now Gal Kaminka / Ido Dagan 14

Grouped Frequency Distributions In ordinal/numerical variables, possible to group values together Create Grouped Frequency

Grouped Frequency Distributions In ordinal/numerical variables, possible to group values together Create Grouped Frequency Distributions Warning: Loss of Information Statistical Methods in Computer Science © 2006 -now Gal Kaminka / Ido Dagan 15

Real and Apparent Limits Continuous values are more difficult to divide into intervals By

Real and Apparent Limits Continuous values are more difficult to divide into intervals By convention, the real limits of a score are within ½ the measurement resolution Score of 95 falls within 95 -99, not within 90 -94 But what about temperature of 94. 87 ? 94 < 94. 87 < 95 ! If our resolution is 0. 1, then limits are within 0. 05 If our resolution 100, then limits are within 50 We break convention only for exceptional cases e. g. , age: “I am 35” is true of 35. 0. . 36. 0 (not including 36). Statistical Methods in Computer Science © 2006 -now Gal Kaminka / Ido Dagan 16

Real/Apparent Limits For example: Resolution of 0. 01. Interval 95. . 99 really covers

Real/Apparent Limits For example: Resolution of 0. 01. Interval 95. . 99 really covers values 94. 995 to 99. 005 Apparent limits: 95. . 99 Real limits: 94. 995 to 99. 005 Resolution of 10: 740 -800 really covers values 735 to 805. Statistical Methods in Computer Science © 2006 -now Gal Kaminka / Ido Dagan 17

Relative Frequency Distributions A frequency count can be misleading We need a way to

Relative Frequency Distributions A frequency count can be misleading We need a way to compare values, i. e. , relate them to each other Relative frequency distributions: translate f into percentage or ratio Algorithm X was fastest on 60, 000 trials: Is this good? 100, 000 people voted for candidate A: Is she the winner? rel f (propor) = f/N rel f (%) = 100 * f/N Warning: Can be misleading, if ignoring count magnitude 50% of all test cases succeeded (with only two cases…) Statistical Methods in Computer Science © 2006 -now Gal Kaminka / Ido Dagan 18

Relative Frequency Distributions Example: f/N Statistical Methods in Computer Science © 2006 -now Gal

Relative Frequency Distributions Example: f/N Statistical Methods in Computer Science © 2006 -now Gal Kaminka / Ido Dagan 19

Cumulative Frequency Distribution For ordinal/numerical variables Where values are with respect to others: How

Cumulative Frequency Distribution For ordinal/numerical variables Where values are with respect to others: How many below or above Cumulative frequency distribution Statistical Methods in Computer Science © 2006 -now Gal Kaminka / Ido Dagan 20

Cumulative Frequency Distribution Based on the cumulative distribution, can answer question such as: What

Cumulative Frequency Distribution Based on the cumulative distribution, can answer question such as: What percentage of scores fall below 80? How many scores below 95? Statistical Methods in Computer Science © 2006 -now Gal Kaminka / Ido Dagan 21

Percentiles, Percentile Ranks Percentile X: Value for which X percent of values are lower

Percentiles, Percentile Ranks Percentile X: Value for which X percent of values are lower e. g. baby height We use Px to denote the Xth percentile, e. g. , P 98 is in range 90 -94. Percentile rank X: the percent of values that fall below X. e. g. , percentile rank of the interval 65 -69 is 12. Statistical Methods in Computer Science © 2006 -now Gal Kaminka / Ido Dagan 22

Computing Percentiles, P. Ranks How do we compute percentiles and percentile ranks from grouped

Computing Percentiles, P. Ranks How do we compute percentiles and percentile ranks from grouped data? What is the score which defines the top 20% of scores? Is it between 84 and 85? Statistical Methods in Computer Science © 2006 -now Gal Kaminka / Ido Dagan 23

Computing Percentiles We want to compute P 80. 80% of 50 cases = 40

Computing Percentiles We want to compute P 80. 80% of 50 cases = 40 cases. We look under the cum f heading. 32 of the 40 scores are less than 84. 5 (real limit). Statistical Methods in Computer Science © 2006 -now Gal Kaminka / Ido Dagan 24

Computing Percentiles We want to compute P 80. 80% of 50 cases = 40

Computing Percentiles We want to compute P 80. 80% of 50 cases = 40 cases. We look under the cum f heading. 32 of the 40 scores are less than 84. 5 (real limit). Statistical Methods in Computer Science © 2006 -now Gal Kaminka / Ido Dagan 25

Computing Percentiles We want to compute P 80. 80% of 50 cases = 40

Computing Percentiles We want to compute P 80. 80% of 50 cases = 40 cases. We look under the cum f heading. 32 of the 40 scores are less than 84. 5 (real limit). We need 8 more. Statistical Methods in Computer Science © 2006 -now Gal Kaminka / Ido Dagan 26

Computing Percentiles We want to compute P 80. 80% of 50 cases = 40

Computing Percentiles We want to compute P 80. 80% of 50 cases = 40 cases. We look under the cum f heading. 32 of the 40 scores are less than 84. 5 (real limit). We need 8 more. Statistical Methods in Computer Science The interval 85 -89 contains 47 -32 = 15 cases. real limit 84. 5 These are spread over width of 5 (= 89. 584. 5). Assume scores are evenly distributed within interval 8 more cases ==> 8/15 * 5 = 2. 67 (linear interpolation) P 80 = 84. 5 + 2. 67 = 87. 17 © 2006 -now Gal Kaminka / Ido Dagan 27

Computing Percentile Ranks We want to compute the percentile rank of 86 Lies in

Computing Percentile Ranks We want to compute the percentile rank of 86 Lies in the interval 85 -89, real limits 84. 5 – 89. 5. 86 -84. 5 = 1. 5 score points. Width of interval = 5. 1. 5/5 = 0. 3 ==> 30% of scores in interval (0. 3*15 = 4. 5) Statistical Methods in Computer Science © 2006 -now Gal Kaminka / Ido Dagan 28

Computing Percentile Ranks We want to compute the percentile rank of 86 Lies in

Computing Percentile Ranks We want to compute the percentile rank of 86 Lies in the interval 85 -89, real limits 84. 5 – 89. 5. 86 -84. 5 = 1. 5 score points. Width of interval = 5. 1. 5/5 = 0. 3 ==> 30% of scores in interval (0. 3*15 = 4. 5) Statistical Methods in Computer Science So we have 32 scores up to 84. 5 scores from 84. 5 to 86. Total: 4. 5 + 32 = 36. 5 scores. 36. 5/50 = 73%. This is the percentile rank of 86. © 2006 -now Gal Kaminka / Ido Dagan 29

Frequency Distributions and Scales Statistical Methods in Computer Science © 2006 -now Gal Kaminka

Frequency Distributions and Scales Statistical Methods in Computer Science © 2006 -now Gal Kaminka / Ido Dagan 30

Displaying Frequency Distributions: Nominal Data Statistical Methods in Computer Science © 2006 -now Gal

Displaying Frequency Distributions: Nominal Data Statistical Methods in Computer Science © 2006 -now Gal Kaminka / Ido Dagan 31

Displaying Frequency Distributions: Ordinal/Numerical Data Histogram Statistical Methods in Computer Science © 2006 -now

Displaying Frequency Distributions: Ordinal/Numerical Data Histogram Statistical Methods in Computer Science © 2006 -now Gal Kaminka / Ido Dagan 32

Displaying Frequency Distributions: Ordinal/Numerical Data Histogram: Different Grouping Statistical Methods in Computer Science ©

Displaying Frequency Distributions: Ordinal/Numerical Data Histogram: Different Grouping Statistical Methods in Computer Science © 2006 -now Gal Kaminka / Ido Dagan 33

Lying with Visuals Statistical Methods in Computer Science © 2006 -now Gal Kaminka /

Lying with Visuals Statistical Methods in Computer Science © 2006 -now Gal Kaminka / Ido Dagan 34

Characteristics of Distributions Shape, Central Tendency, Variability Different Central Tendency Statistical Methods in Computer

Characteristics of Distributions Shape, Central Tendency, Variability Different Central Tendency Statistical Methods in Computer Science Different Variability © 2006 -now Gal Kaminka / Ido Dagan 35