The following lecture has been approved for University

The following lecture has been approved for University Undergraduate Students This lecture may contain information, ideas, concepts and discursive anecdotes that may be thought provoking and challenging It is not intended for the content or delivery to cause offence Any issues raised in the lecture may require the viewer to engage in further thought, insight, reflection or critical evaluation

Background to Statistics for non-statisticians Craig Jackson Prof. Occupational Health Psychology Faculty of Education, Law & Social Sciences BCU craig. jackson@bcu. ac. uk

Keep it simple “Some people hate the very name of statistics but. . . their power of dealing with complicated phenomena is extraordinary. They are the only tools by which an opening can be cut through the formidable thicket of difficulties that bars the path of those who pursue the science of man. ” Sir Francis Galton, 1889

How Many Make a Sample?

How Many Make a Sample? “ 8 out of 10 owners who expressed a preference, said their cats preferred it. ” How confident can we be about such statistics? 8 out of 10? 80 out of 100? 800 out of 1000? 80, 000 out of 100, 000?

Types of Data / Variables Continuous Discrete BP Height Weight Children Age last birthday colds in last year Age Ordinal Grade of condition Positions 1 st 2 nd 3 rd “Better- Same-Worse” Height groups Age groups Nominal Sex Hair colour Blood group Eye colour

Conversion & Re-classification Easier to summarise Ordinal / Nominal data Cut-off Points (who decides this? ) Allows Continuous variables to be changed into Nominal variables BP > 90 mm. Hg = BP =< 90 mm. Hg = Hypertensive Normotensive Easier clinical decisions Categorisation reduces quality of data Statistical tests may be more “sensational” BMI Obese vs Underweight Good for summaries Bad for “accuracy”

Types of statistics / analyses DESCRIPTIVE STATISTICS Describing a phenomena Frequencies Basic measurements How many… Meters, seconds, cm 3, IQ INFERENTIAL STATISTICS Inferences about phenomena Hypothesis Testing Confidence Intervals population Correlation Significance testing Proving or disproving theories sample If relates to the larger Associations between phenomena e. g diet and health

25 cells 22 cells Multiple Measurement or…. why statisticians and love don’t mix 26 25 24 24 cells 23 22 21 21 cells 20 Total Mean SD = 92 cells = 23 cells = 1. 8 cells

Small samples spoil research N Age IQ 1 2 3 4 5 6 7 8 9 10 20 20 20 100 100 100 1 2 3 4 5 6 7 8 9 10 18 20 22 24 26 21 19 25 20 21 100 119 101 105 113 120 119 114 101 1 2 3 4 5 6 7 8 9 10 18 20 22 24 26 21 19 25 20 45 100 119 101 105 113 120 119 114 156 Total Mean SD 200 20 0 100 0 Total Mean SD 216 21. 6 ± 4. 2 110. 2 ± 19. 2 Total Mean SD 240 24 ± 8. 5 1157 115. 7 ± 30. 2

Central Tendency Patient comfort rating 10 9 8 31 27 Frequency 70 Mode Median Mean 7 6 5 4 3 2 1 121 140 129 128 90 80 62

Dispersion Range Spread of data Mean Arithmetic average Median Location Mode Frequency SD Spread of data about the mean Range 50 -112 mm. Hg Mean 82 mm. Hg Median SD ± 10 mm. Hg 82 mm. Hg Mode 82 mm. Hg

Dispersion An individual score therefore possess a standard deviation (away from the mean), which can be positive or negative Depending on which side of the mean the score is If add the positive and negative deviations together, it equals zero (the positives and negatives cancel out) central value (mean) negative deviation positive deviation

Dispersion Range The interval between the highest and lowest measures Limited value as it involves the two most extreme (likely faulty) measures Percentile The value below / above which a particular percentage of values fall (median is the 50 th percentile) e. g 5 th percentile - 5% of values fall below it, 95% of values fall above it. A series 1 stof 5 th percentiles 25 th, 50 th, gives a 25 th (1 st, 5 th, 50 th 75 th, 95, 99 th) 95 th 99 th good general idea of the scatter and shape of the data Range 5’ 6” 5’ 7” 5’ 8” 5’ 9” 5’ 10” 5’ 11” 6’ 6’ 1” 6’ 2” 6’ 3” 6’ 4”

Standard Deviation To get around this, we square each of the observations Makes all the values positive (a minus times a minus…. ) Then sum all those squared observations to calculate the mean This gives the variance - where every observation is squared Need to take the square root of the variance, to get the standard deviation SD = Σ x 2 – (Σ x)2 / N (N – 1)

Grouped Data Normal Distribution SD is useful because of the shape of many distributions of data. Symmetrical, bell-shaped / normal / Gaussian distribution Normal Distribution Some distributions fail to be symmetrical If the tail on the left is longer than the right, the distribution is negatively skewed (to the left) If the tail on the right is longer than the left, the distributionis positively skewed (to the right)

Normal Distributions Standard Normal Distribution has a mean of 0 and a standard deviation of 1 The total area under thecentral curve amounts to 100% / unity of the value (mean) observations 3 SD 2 SD 1 SD 0 SD 1 SD 2 SD 3 SD Proportions of observations within any given range can be obtained from the distribution by using statistical tables of the standard normal distribution

Quincunx machine 1877 balls dropped through a succession of metal pins…. . a normal distribution of balls do not have a normal distribution here. Why?

Normal & Non-normal distributions The distribution derived from the quincunx is not perfect It was only made from 18 balls

Distributions Sir Francis Galton (1822 -1911) Alumni of Birmingham University 9 books and > 200 papers Fingerprints, correlation of calculus, twins, neuropsychology, blood transfusions, travel in undeveloped countries, criminality and meteorology) % of population Deeply concerned with improving standards of measurement 6’ 3” 5’ 6” 6’ 4” 5’ 7” 5’ 8” 5’ 9” 5’ 10” 5’ 11” 6’ 6’ 1” 6’ 2”

Normal & Non-normal distributions Galton’s quincunx machine ran with hundreds of balls a more “perfect” shaped normal distribution. Obvious implications for the size of samples of populations used The more lead shot runs through the quincunx machine, the smoother the distribution

Presentation of data Table of means Exposed n=197 Controls n=178 T P Age (yrs) 45. 5 ( 9. 4) 48. 9 ( 7. 3) 2. 19 0. 07 I. Q 105 ( 10. 8) 99 1. 78 0. 12 Speed 115. 1 0. 04 (ms) ( 13. 4) ( 8. 7) 94. 7 ( 12. 4) 3. 76

Presentation of data Category tables Exposed Controls Healthy 50 150 200 Unwell 147 28 175 197 178 375 Chi square (test of association) shows: Chi square = 7. 2 P = 0. 02

Bar Charts A set of measurements can be presented either as a table or as a figure Graphs are not always as accurate as tables, but portray trends Title of graph more easily y-axis Legend key y-axis label (ordinate) Data display area scale groups x-axis (abscissa)

Bar Charts Some Real Data A combination of distributions is acceptable to facilitate comparisons Movie goers’ ratings for both movies 7000 Vacation 6000 Empire Votes 5000 4000 3000 2000 1000 0 1 2 3 4 5 6 7 User rating 8 9 10

With a scatter diagram, each individual observation becomes a point on the scatter plot, based on two coordinates, measured on the abscissa and the ordinate Correlation and Association abscissa Two perpendicular lines are drawn through the medians dividing the plot into quadrants Each quadrant should outlie 25% of all observations

Correlation and Association Correlation is a numerical expression between 1 and -1 (extending through all points in between). Properly called the Correlation Coefficient. A decimal measure of association (not necessarily causation) between variables Correlation of 1 Maximal - any value of one variable precisely determines the other. Perfect +ve Correlation of -1 Any value of one variable precisely determines the other, but in an opposite direction to a correlation of 1. As one value increases, the other decreases. Perfect -ve Correlation of 0 - No relationship between the variables. Totally independent of each other. “Nothing” Correlation of 0. 5 - Only a slight relationship between the variables i. e half of the variables can be predicted by the other, the other half can’t. Medium +ve Correlations between 0 and 0. 3 are weak Correlations between 0. 4 and 0. 7 are moderate Correlations between 0. 8 and 1 are strong

Correlation and Association Correlation is a numerical expression between 1 and -1 (extending through all points in between). Properly called the Correlation Coefficient. A decimal measure of association (not necessarily causation) between variables How can the above variables be correlated?

Sampling Keywords POPULATIONS Can be mundane or extraordinary SAMPLE Must be representative INTERNALY VALIDITY OF SAMPLE Sometimes validity is more important than generalizability SELECTION PROCEDURES Random Opportunistic Conscriptive Quota

Sampling Keywords THEORETICAL Developing, exploring, and testing ideas EMPIRICAL Based on observations and measurements of reality NOMOTHETIC Rules pertaining to the general case (nomos - Greek) PROBABILISTIC Based on probabilities CAUSAL How causes (treatments) effect the outcomes

Clinical Research Types of clinical research Experimental vs. Observational Longitudinal vs. Cross-sectional Prospective vs. Retrospective Experimental Longitudinal Prospective Observational Longitudinal Prospective Cross-sectiona Retrospective Survey Randomised Controlled Trial. Cohort studies Case control studies

Experimental Designs Between subjects studies Treatment group Outcome measured Control group Outcome measured patients Within Subjects studies patients Outcome measured #1 Treatment Outcome measured #2

Observational studies Cohort (prospective) cohor t prospectively measure risk factors end point measured aetiology prevalence development odds ratios Case-Control (retrospective) start point measured aetiology odds ratios prevalence development retrospectively measure risk factors cases

Case-Control Study – Smoking & Cancer “Cases” have Lung Cancer “Controls” could be other hospital patients (other disease) or “normals” Matched Cases & Controls for age & gender Option of 2 Controls per Case Smoking years of Lung Cancer cases and controls (matched for age and sex) Cases n=456 Controls n=456 Smoking years 13. 75 6. 12 (± 1. 5) (± 2. 1) F P 7. 5 0. 04

Cohort Study: Methods Volunteers in 2 groups e. g. exposed vs non-exposed All complete health survey every 12 months End point at 5 years: groups compared for Health Status Comparison of general health between users and non-users of mobile phones ill healthy mobile phone user 292 non-phone user 108 89 381 400 313 421 402 802

Randomized Controlled Trials in GP & Primary Care 90% consultations take place in GP surgery 50 years old Potential problems 2 Key areas: Recruitment Bias Randomisation Bias Over-focus on failings of RCTs

RCT Deficiencies Trials too small Trials too short Poor quality Poorly presented Address wrong question Methodological inadequacies Inadequate measures of quality of life (changing) Cost-data poorly presented Ethical neglect Patients given limited understanding Poor trial management Politics Marketeering Why still the dominant model?

Quantitative Data Summary • What data is needed to answer the larger-scale research question • Combination of quantitative and qualitative ? • Cleaning, re-scoring, re-scaling, or re-formatting • Measurement of both IV’s and DV’s is complex but can be simplified • Binary measurement makes analysis easier but less meaningful • Binary data needs clear parameters e. g exposed vs controls

Quantitative Data Summary • Continuous & Discrete data can also be converted into Binary data • Normal distribution of participants / data points desirable • Means - age, height, weight, BMI, IQ, attitudes • Frequencies / Classifications - job type, sick vs. healthy, dead vs alive • Means must be followed by Standard Deviation (SD or ±) • Presentation of data must enhance understanding or be redundant

If you or anyone you know has been affected by any of the issues covered in this lecture, you may need a statistician’s help: www. statistics. gov. uk

Further Reading Abbott, P. , & Sapsford, R. J. (1988). Research methods for nurses and the caring professions. Buckingham: Open University Press. Altman, D. G. (1991). Designing Research. In D. G. Altman (ed. ), Practical Statistics For Medical Research (pp. 74 -106). London: Chapman and Hall. Bland, M. (1995). The design of experiments. In M. Bland (ed. ), An introduction to medical statistics (pp 5 -25). Oxford: Oxford Medical Publications. Bowling, A. (1994). Measuring Health. Milton Keynes: Open University Press. Daly, L. E. , & Bourke, G. J. (2000). Epidemiological and clinical research methods. In L. E. Daly & G. J. Bourke (eds. ), Interpretation and uses of medical statistics (pp. 143 -201).

Further Reading Jackson, C. A. (2002). Planning Health and Safety Research Projects. Health and Safety at Work Special Report 62, (pp 1 -16). Jackson, C. A. (2003). Analyzing Statistical Data in Occupational Health Research. Management of Health Risks Special Report 81, (pp. 2 -8). Kumar, R. (1999). Research Methodology: a step by step guide for beginners. London: Sage. Polit, D. , & Hungler, B. (2003). Nursing research: Principles and methods (7 th ed. ). Philadelphia: Lippincott, Williams & Wilkins.