Inferential Spatial Statistics Introduction to Concepts Population Infer

  • Slides: 40
Download presentation
Inferential Spatial Statistics: Introduction to Concepts Population Infer Sample Today: Review standard statistical inference.

Inferential Spatial Statistics: Introduction to Concepts Population Infer Sample Today: Review standard statistical inference. Examine the concept of Spatial Randomness. Define a random point pattern. Next Time Using inferential spatial statistics to analyze point patterns 1 Briggs Henan University 2010

Spatial Analysis: successive levels of sophistication 1. Spatial data description: classic GIS capabilities –

Spatial Analysis: successive levels of sophistication 1. Spatial data description: classic GIS capabilities – Spatial queries & measurement, – buffering, map layer overlay 2. Exploratory Spatial Data Analysis (ESDA): – searching for patterns and possible explanations – Geo. Visualization through data graphing and mapping – Descriptive spatial statistics 3. Spatial statistical analysis and hypothesis testing – Are data “to be expected” or are they “unexpected” relative to some statistical model, usually of a random process 4. Spatial modeling or prediction – Constructing models (of processes) to predict spatial outcomes (patterns) Briggs Henan University 2010 2

Descriptive & Inferential Statistical Analysis Last time we discussed descriptive statistics for spatial analysis

Descriptive & Inferential Statistical Analysis Last time we discussed descriptive statistics for spatial analysis Concerned with obtaining summary measures to describe a set of data For example, the mean and the standard deviation, the centroid and the standard distance This time we will discuss inferential statistics begin by reviewing standard (non-spatial) inferential statistics then look at inferential spatial statistics 3 Briggs Henan University 2010

Standard Statistical Inference: Inferential statistics – Concerned with making inferences: • from a sample(s)

Standard Statistical Inference: Inferential statistics – Concerned with making inferences: • from a sample(s) about a population(s) • from observed patterns about underlying processes I hope you are already familiar with standard (non-spatial) inferential statistics. I will quickly review the main ideas. Briggs Henan University 2010 4

Populations and Samples Population: all occurrences of a particular phenomena Sample: a part (subset)

Populations and Samples Population: all occurrences of a particular phenomena Sample: a part (subset) of the population for which we have data. You are a sample of the population of all people in the world. The sample is used to make inferences about the population. Infer We draw conclusions about the population from the sample. 5 Briggs Henan University 2010

From Lecture #2 on Spatial Analysis Process, Pattern and Analysis • Often, we cannot

From Lecture #2 on Spatial Analysis Process, Pattern and Analysis • Often, we cannot observe the process, so we have to infer the process by observing the pattern • From the sample, we infer the process in the population. Infer Population Processes Create Sample Patterns 6 Briggs Henan University 2010

The Importance of the Sample How “good “ (or “accurate” or “true”) are our

The Importance of the Sample How “good “ (or “accurate” or “true”) are our inferences or conclusions? It depends upon the sample! If we get sample, the conclusions are good. Sample is representative of the population If we get sample, the conclusions are not good. Sample is a not representative of the population. 7 Briggs Henan University 2010

The Requirement of a Random Sample • All statistical inference is based on the

The Requirement of a Random Sample • All statistical inference is based on the assumption (requirement) that you have a random sample • What is a random sample? • A sample chosen such that every member of the population has an equal chance (probability) of being included • Doesn’t guarantee a representative sample • Could be really unlucky and get

Some Definitions • Population – All occurences • Parameters – Numbers calculated from the

Some Definitions • Population – All occurences • Parameters – Numbers calculated from the population • Sample − Subset of population for which we have data • Statistics – Numbers calculated from the sample statistics are estimates of parameters We can calculate the statistic because we have data for samples. We cannot calculate the parameter because we 9 do not have data for entire population. Briggs Henan University 2010

Example: Are girls more intelligent than boys? • Sample of boys • Sample of

Example: Are girls more intelligent than boys? • Sample of boys • Sample of girls – IQ* = 115 – IQ* = 130 *IQ = Intelligence Quotient Ha! Girls are more intelligent than boys. Here is the proof! No! It depends on the samples we have. The sample statistics are different, but the population parameters may be the same! Who is correct? Briggs Henan University 2010 10

How do we decide who is correct? The Null Hypothesis and the Alternative Hypothesis

How do we decide who is correct? The Null Hypothesis and the Alternative Hypothesis Assume that in the population the average (mean) IQ of girls is the same as the average IQ of boys This is called the Null Hypothesis: --there is no difference between girls and boys in the population The Alternative Hypothesis: --in the population, girls are smarter than boys 11 Briggs Henan University 2010

Choosing between Null and Alternative • In our two samples: – The difference between

Choosing between Null and Alternative • In our two samples: – The difference between the sample means was 15 • Ask the question: if the population means are the same, how probable is it that, from sampling variation alone, I would get a difference of 15 points between sample means? • If this is reasonable probable (or likely), accept the Null Hypothesis • If this is highly improbable (highly unlikely), reject the Null and accept the Alternative Hypothesis 12 Briggs Henan University 2010

How do I calculate the probability of getting a difference of 15? We use

How do I calculate the probability of getting a difference of 15? We use the sampling distribution. What is this? 13 Briggs Henan University 2010

All girls All boys (the population of girls) (the population of boys) Random samples

All girls All boys (the population of girls) (the population of boys) Random samples For every pair of samples, calculate the mean of each, and then the difference between these means. 14 Briggs Henan University 2010

The Sampling Distribution If we have a thousand sample pairs, we have a thousand

The Sampling Distribution If we have a thousand sample pairs, we have a thousand values for We can draw a frequency distribution showing how often or frequently different values occur 2. 5% -1. 96 0 1. 96 The sampling distribution is simply the frequency distribution for some value calculated each time from many, many samples. The calculated value is called the test statistic 15 Briggs Henan University 2010

Using the Sampling Distribution 2. 5% -1. 96 0 Here, a sample difference of

Using the Sampling Distribution 2. 5% -1. 96 0 Here, a sample difference of 15 15 is quite likely: Conclusion: Accept the Null. Boys and Girls are the same The probability should be less than 5% (. 05) to reject the null hypothesis. This probability is called the statistical significance of the test. 1. 96 15 Here, a sample difference of 15 is very unlikely: Conclusion: Reject the Null Accept the Alternative Girls are smarter than boys 16 Briggs Henan University 2010

Calculating a Test Statistic • To find the exact probability of getting a difference

Calculating a Test Statistic • To find the exact probability of getting a difference of 15 between the girls and boys we calculate a test statistic • a test statistic is: a number, calculated from a sample statistic, whose sampling distribution is known – That is, we know the shape of the frequency distribution of the test statistic when multiple samples are taken • In the case of the difference between two sample means the test statistic is: It is a Normal Frequency Distribution if the sample sizes are greater than 30. S 2 g =variance for girls S 2 b =variance for boys • Note: test statistics always have “degrees of freedom” which are calculated from the sample size (N)

Test Statistic for Normal Frequency Distribution 2. 5% -1. 96 0 1. 96 To

Test Statistic for Normal Frequency Distribution 2. 5% -1. 96 0 1. 96 To reject the Null Hypothesis, the Z test Conclusion: statistic should have a value greater than 1. 96 Reject the Null Accept the Alternative (or less than -1. 96). There is less than a 5% chance that, in the Girls are smarter than boys population, the means are the same. 18 Briggs Henan University 2010

Standard Error: Standard Deviation of the Sampling Distribution Smaller standard error Test statistic for

Standard Error: Standard Deviation of the Sampling Distribution Smaller standard error Test statistic for the difference between two means: Larger standard error 2. 5% -1. 96 2. 5% 0 1. 96 Standard error for the difference between two means • Standard error very important • Approximately, it tells you how far, on average, the sample statistic is away from the population parameter – Thus, it is a measure of sampling variability or error • The larger the standard error, the more difficult it is to reject the 19 Null Hypothesis Briggs Henan University 2010

 • • Reporting the Results of a Statistical Significance Test: many ways to

• • Reporting the Results of a Statistical Significance Test: many ways to say the same thing! When we use a test statistic and its sampling distribution we say that we are conducting a statistical significance test We reject the null hypothesis if there are less than 5 chances in 100 that it is true We say the results are “statistically significant at the 5% level” Or we say the results are “significant at the 95% confidence level” 20 Briggs Henan University 2010

The Normal or Gaussian Probability Distribution. 2. 5% -1. 96 0 This is the

The Normal or Gaussian Probability Distribution. 2. 5% -1. 96 0 This is the sampling distribution for tests involving differences between means. Why is it this shape? 1. 96 If the null hypothesis is true, − what would be the average value of the differences between the sample means? • It would be zero (0) – We expect many small difference values and few big differences • Values would be concentrated around mean – We expect as many negative differences as positive differences • Symmetrical—same on each side of the mean Briggs Henan University 2010 21

How do we find the Sampling Distribution and Test Statistic? Two methods: 1. By

How do we find the Sampling Distribution and Test Statistic? Two methods: 1. By mathematical theory: • • test statistics and sampling distributions already known through theory common distributions are Z (Normal), Chi-square, and F distributions 2. By computer simulation • The computer is used to “simulate” multiple samples, and we use these to draw a frequency distribution – As with our “boys and girls” example • Very common in spatial statistics Briggs Henan University 2010 22

Spatial Statistical Inference 23 Briggs Henan University 2010

Spatial Statistical Inference 23 Briggs Henan University 2010

Spatial Statistical Inference: Null and Alternative Hypotheses • Null Hypothesis: – The spatial pattern

Spatial Statistical Inference: Null and Alternative Hypotheses • Null Hypothesis: – The spatial pattern is random – IRP/CSR: independent random process/complete spatial randomness • Alternative Hypothesis: – The spatial pattern is not random – It may be clustered or dispersed 24 Briggs Henan University 2010

What do we mean by spatially random? RANDOM UNIFORM/DISPERSED CLUSTERED – Random: a point

What do we mean by spatially random? RANDOM UNIFORM/DISPERSED CLUSTERED – Random: a point is equally likely to occur at any location, and the position of a point is not affected by the position of any other point. – Uniform: every point is as far from other points as possible: “likely to be distant” – Clustered: every point is close to other points: “likely to be close”

Is it Spatially Random? Difficult to know! • Fact: Two times as many people

Is it Spatially Random? Difficult to know! • Fact: Two times as many people sit “on the corners” rather than opposite at tables in a restaurant – Conclusion: psychological preference for nearness • In actuality: an outcome to be expected from a random process: two ways to sit opposite, but four ways to sit on the corners From O’Sullivan and Unwin p. 69 26 Briggs Henan University 2010

High Peak district biomass index: ratio of remotely sensed data spectral bands B 3

High Peak district biomass index: ratio of remotely sensed data spectral bands B 3 and B 4 Spatially clustered Geographically random

Why Processes differ from Random Processes differ from random in two primary ways •

Why Processes differ from Random Processes differ from random in two primary ways • Variation in the study area – Diseases cluster because people cluster (e. g. cancer) – Cancer cases cluster ‘cos chemical plants cluster – First order effect • Interdependence of the points themselves – Diseases cluster ‘cos people catch them from others who have the disease (colds) – Second order effect In practice, it is very difficult to distinguish these two effects merely by the analysis of spatial data Briggs Henan University 2010 28

Bank Robberies—First Order or Second Order effect? – Bank robberies are clustered Bank robbery

Bank Robberies—First Order or Second Order effect? – Bank robberies are clustered Bank robbery – First order--because banks are Banks clustered Bank Robberies In lecture on Spatial Analysis we called this the effect of “non-uniformity of space” Could there also be a second order effect? Briggs Henan University 2010 29

Remember our data on software and telecommunications industries in Dallas? We can think of

Remember our data on software and telecommunications industries in Dallas? We can think of this data as a sample. We can use statistical inference to test if the spatial pattern is clustered, or “random” (no pattern) We will look at the actual tests later. 30 Briggs Henan University 2010

Spatial Statistical Hypothesis Testing: Simulation Approach • Because of the complexity of spatial processes,

Spatial Statistical Hypothesis Testing: Simulation Approach • Because of the complexity of spatial processes, it is often difficult to derive theoretically a test statistic with known probability distribution • Instead, we often use computer simulations • We take multiple samples from a random spatial pattern, the spatial statistic we are using is calculated for each sample, and then a frequency distribution is drawn • This simulated sampling distribution Empirical frequency is used to measure the probability distribution from 500 of obtaining our actual random patterns (“samples”) observed spatial statistic Our observed value: --highly unlikely to have occurred if the process was random --conclude that process is not random

Software for Spatial Statistics • Arc. GIS 9 The most common GIS Software, but

Software for Spatial Statistics • Arc. GIS 9 The most common GIS Software, but $$$$! – Spatial Statistics Tools for point and polygon analysis – Spatial Analyst tools for density kernel – Geo. Statistical Analyst Tools for interpolation of continuous surface data • Crime. Stat III download from http: //www. icpsr. umich. edu/NACJD/crimestat. html – Standalone package, free for government and education use – Calculates values for spatial statistics but no GIS graphics – Good documentation and explanation of measures and concepts • Open. Geo. DA, Geographic Data Analysis by Luc Anselin now at Arizona State – – Download from: http: //geodacenter. asu. edu/ Runs on Vista and Windows 7 (also MAC and UNIX) Earlier version called Geo. DA runs only on XP (0. 9. 5 i_6) Easy to use and has good graphic capabilities • R Open Source statistical package, – – originally on UNIX but now has MS Windows version Has the most extensive set of spatial statistical analyses Difficult to use Need to learn it if you are going to do major work in this area • S-Plus the only commercial statistical package with extensive support for spatial 32 statistics – www. insightful. com Briggs Henan University 2010

References • O’Sullivan and Unwin Geographic Information Analysis New York: John Wiley, 1 st

References • O’Sullivan and Unwin Geographic Information Analysis New York: John Wiley, 1 st ed. 2003, 2 nd ed. 2010 • Jay Lee and David Wong Statistical Analysis with Arc. View GIS New York: Wiley, 1 st ed. 2001 (all page references are to this book), 2 nd ed. 2005 – Unfortunately, these books are based on old software (Avenue scripts used with Arc. View 3. x) and no longer work in the current version of Arc. GIS 9 or 10. • Ned Levine and Associates Crime. Stat III Washington: National Institutes of Justice, 2010 – Available as pdf – download from: http: //www. icpsr. umich. edu/NACJD/crimestat. html • Arthur J. Lembo at http: //www. css. cornell. edu/courses/620/css 620. html (no longer active) 33 Briggs Henan University 2010

Next time: Inferential Statistics for Point Pattern Analysis 34 Briggs Henan University 2010

Next time: Inferential Statistics for Point Pattern Analysis 34 Briggs Henan University 2010

35 Briggs Henan University 2010

35 Briggs Henan University 2010

Software for Spatial Statistics: Examples Planned as a separate lecture …but we couldn’t meet

Software for Spatial Statistics: Examples Planned as a separate lecture …but we couldn’t meet last Friday …so I will look as some examples after today’s lecture, and again after the next lecture 36 Briggs Henan University 2010

1. Using Arc. GIS to find the Population Centroid of China Open Arc. GIS

1. Using Arc. GIS to find the Population Centroid of China Open Arc. GIS Add data files: China. shp and China. Province. Data. xls Join China. Province. Data. xlx to China, shp Right click China and select Joins. . Use GMI_Admin as join field Open Arc. Toolbox by clicking on Go to Spatial Statistics Tools>Measuring Geographic Distribution>Mean Center Input Feature Class: China Output: China_Mean. Center. shp Weight Field: Population 2008 Note the warning: we should have projected data first! WARNING 000916: The input feature class does not appear to contain projected data. It is in south Henan province! 37 Briggs Henan University 2010

2. Calculate Population Centroid using a Spreadsheet Program (e. g. Excel) Make a copy

2. Calculate Population Centroid using a Spreadsheet Program (e. g. Excel) Make a copy of China. Province. Data. xls and open this copy China. Province. Data Copy. xls It contains Centroids for each province obtained from Geo. DA. (You need the very expensive Arc. Info version to get centroids for all polygons from Arc. GIS and I do not have it!) Calculate: XCentroid * Weight (Population 2008), and then Sum YCentroid * Weight (Population 2008), and then Sum Divide each sum by the Sum of the Weights (Total Population 2008). These are the X and Y coordinates for the China Population Centroid 113. 4696704 32. 3797596 Copy these values into a new worksheet, and create a very simple data table ID X Y 1 113. 4697 32. 3798 Save spreadsheet and close Excel. Read this table into Arc. GIS Right click on table name and select Display XY Data This displays X, Y coordinates from a table on the map. The results are very similar to the value calculated by Arc. GIS itself! Briggs Henan University 2010 38

3. Use Arc. GIS to Calculate Standard Deviation Ellipse for Population and for Illiterate

3. Use Arc. GIS to Calculate Standard Deviation Ellipse for Population and for Illiterate Population SDE for Population Go to Spatial Statistics Tools>Measuring Geographic Distribution> Directional Distribution Input Feature Class: China Output: SDE_Population. shp Weight Field: Data$. Pop 2008 Mean Center for Illiterate Percent Go to Spatial Statistics Tools>Measuring Geographic Distribution>Mean Center Input Feature Class: China Output: MC_Illit_Per. Cent. shp Weight Field: Data$. Illiterate_Prcnt SDE for Illiterate Percent Go to Spatial Statistics Tools>Measuring Geographic Distribution> Directional Distribution Input Feature Class: China 39 Output: SDE_Illit_Per. Cent. shp Weight Field: Data$. Illiterate_Prcnt. Briggs Henan University 2010

4. Use Geo. DA to find the Centroids of the Provinces of China (Need

4. Use Geo. DA to find the Centroids of the Provinces of China (Need Arc. Info to do this in Arc. GIS, which is expensive. Geo. DA is free. ) --The Geo. DA program is on my Web site at: www. utdallas. edu/~briggs or go to http: //geodacenter. asu. edu/ --download, unzip, and click the file Open. Geo. DA. exe to start the software --it does have some “bugs” so some things may not work or it may crash! --Input the provinces shapefile: File>Open Shape File China. shp --Open the data table: Table>Promotion to see what is there --Create centroids for each province: Options> Add Centroids to Table Place check mark in X coordinates and Y coordinates box, click OK --Go to Table>Promotion to open the table—it has the X and Y centroid coordinates --Save as a new shapefile: Table> Save to Shapefile as China_Centroids. shp I then opened the China_centroids. dbf (part of the shapefile) file with Excel and copied the centroid values into the China. Provinces. Data. xls spreadsheet. 40 Briggs Henan University 2010