Numerical Analysis EE NCKU TienHao Chang Darby Chang
Numerical Analysis EE, NCKU Tien-Hao Chang (Darby Chang) 1
Correlation coefficient n What we need is a single summary number that answers the following questions: – does a relationship exist? – if so, is it a positive or a negative relationship? and – is it a strong or a weak relationship? n Correlation coefficient: A single summary number that gives you a good idea about how closely one variable is related to another variable 2
Correlation coefficient Two-way scatter plot n Suppose that we are interested in a pair of continuous random variables – For example, relationship between the percentage of children who have been immunized against the infectious DPT and mortality rate n Data for a random sample of 20 countries are shown in the next slide – X: the percentage of children immunized by age on year – Y: the under-five mortality rate n Before we do any analysis, we should create a two-way scatter plot of the data – relationship exists between x and y? n The mortality rate tends to decrease as the percentage of children immunized increase 3
4
Pearson’s CC n n n In the underlying population form which the sample of points (xi, yi) is selected, the population correlation between the variables X and Y The quantifies the strength of the linear relationship between the outcomes x and y The estimator of ρ or r is known as Pearson’s coefficient of correlation or correlation coefficient 5
n The correlation coefficient is dimensionless number; it has no units of measurement. – -1 ≤ r ≤ 1 – the value r=1 and r=-1 occur when there is an exact linear relationship between x and y – if y tends to increase in magnitude as x increases, r is greater than 0; x any y are said to be positively correlated – if y decreases as x increases, r is less than 0 and the two variables are negatively correlated – if r=0, there is no linear relationship between x and y and the variables are uncorrelated n http: //cclearn. npue. edu. tw/tuition/ccchen-web/教育統 計學/7. pdf 6
http: //upload. wikimedia. org/wikipedia/commons/0/02/Correlation_examples. png 7
CC is not a percent n In addition to telling you – whether two variables are related to one another, – whether the relationship is positive or negative and – how large the relationship is, n n The correlation coefficient tells you one more important bit of information—it tells you exactly how much variation in one variable is related to changes in the other variable A correlation coefficient is a “ratio” not a percent – many students tend to think when r =. 90 it means that 90% of the changes in one variable are accounted for or related to the other variable – even worse, some think that this means that any predictions you make will be 90% accurate – both are not correct! 8
Correlation Coefficient of determination n n However it is very easy to translate the correlation coefficient into a percentage All you have to do is “square the correlation coefficient” which means that you multiply it by itself So, if the symbol for a correlation coefficient is “r”, then the symbol for this new statistic is simply “r 2” which can be called “r squared” r 2, also called the “Coefficient of Determination”, tells you how much variation in one variable is directly related to (or accounted for) by the variation in the other variable 9
The correlation coefficient is r = 0. 80. By squaring r to get r 2, you fully 64% of the variation in scores on Variable B is directly related to how they scored on Variable A. 10
Statistical test 11
Correlation coefficient Statistical inference n To test a significant correlation between two variables – H 0: r = 0 – H 1: r ≠ 0 n The statistic (under H 0): – with n-2 degrees of freedom n http: //zoro. ee. ncku. edu. tw/mlb 2009/res/14 ch 5. pdf (pp. 9 -14) 12
n Test the significance of the correlation coefficient for the age and blood pressure data – suppose that n=6, r=0. 897 and α=0. 05 n Step 1: State the hypotheses – H 0: r = 0 n H 1: r ≠ 0 Step 2: Find the critical values – since α=0. 05 and there are 6– 2=4 degrees of freedom, the critical values are t = +2. 776 and t = – 2. 776. n Step 3: Compute the test value – t = 4. 059 n Step 4: Make the decision – reject the null hypothesis, since the test value falls in the critical region (4. 059 > 2. 776) n Step 5: Summarize the results – there is a significant relationship between the variables of age and blood pressure 13
Correlation coefficient Limitations n n n It quantifies only the strength of the linear relationship between two variables Care must be taken when the data contain any outliers, or pairs of observations that lie considerably outside the range of the other data points A high correlation between two variables does not imply a cause-and-effect relationship 14
http: //upload. wikimedia. org/wikipedia/commons/thumb/e/ec/Anscombe%27 s_quartet_3. svg/2000 px-Anscombe%27 s_quartet_3. svg. png Four sets of data with the same correlation of 0. 816 15
Spearman’s Rank CC n n n Pearson’s correlation coefficient is very sensitive to outlying values We may be interested in calculating a measure of association that is more robust One approach is to rank the two sets of outcomes x and y separately and known as Spearman’s rank correlation coefficient – where xri and yri are the rank associated the ith subject rather than the actual observations 16
Any Questions? About Correlation Coefficient 17
Statistical inference n Basic tests – – – tests about proportions tests about one mean tests of the equality of two means tests for variances references • • n http: //zoro. ee. ncku. edu. tw/mlb 2009/res/14 -ch 5. pdf (pp. 27 -33) http: //www. math. isu. edu. tw/finance/course/sta/ch 8. ppt http: //www. tnb. org. tw/Image/ttest. ppt http: //www. mis. ncyu. edu. tw/course/download/cftai/Chapter%206. %20 Continuou s%20 Probability%20 Distribution. PPT More advanced tests – ANOVA (analysis of variance) – goodness of fit (Wilcoxon test, Kolmogorov-Smirnov test, …) 18
Multivariate analysis n Statistics – – ANOVA Multiple linear regression • • – – – n n http: //www. sjsu. edu/faculty/gerstman/biostat-text/Gerstman_PP 15. ppt http: //www. stat. nuk. edu. tw/Ray-Bing/regression/Chapter 3. ppt PCA (principle component analysis) ICA (independent component analysis) LDA (linear discriminant analysis) So far, all techniques belong to statistics. You could find them in most statistical software, such as MATLAB, R (http: //www. r-project. org/), SPSS… Machine learning – – – Naïve Bayes (http: //zoro. ee. ncku. edu. tw/mlb 2009/res/11 -ch 4. pdf pp. 13 -27) LIBSVM (http: //www. csie. ntu. edu. tw/~cjlin/libsvm/) RVKDE (http: //mbi. ee. ncku. edu. tw/wiki/doku. php? id=rvkde) 19
- Slides: 19