Reliability and Validity Introduction to Study Skills Research
Reliability and Validity Introduction to Study Skills & Research Methods (HL 10040) Dr James Betts
Lecture Outline: • Definition of Terms • Types of Validity • Threats to Validity • Types of Reliability • Threats to Reliability • Introduction to Measurement Error.
Commonly used terms… “She has a valid point” “My car is unreliable” …in science… “The conclusion of the study was not valid” “The findings of the study were not reliable”.
Some definitions… • Validity “The soundness or appropriateness of a test or instrument in measuring what it is designed to measure” (Vincent 1999)
Some definitions… • Validity “Degree to which a test or instrument measures what it purports to measure” (Thomas & Nelson 1996)
Some definitions… • Reliability “…the degree to which a test or measure produces the same scores when applied in the same circumstances…” (Nelson 1997)
Some definitions… • Objectivity “…the degree to which different observers agree on measurements…” (Atkinson & Nevill 1998)
Types of Experimental Validity • Internal – Is the experimenter measuring the effect of the independent variable on the dependent variable? • External – Can the results be generalised to the wider population?
Validity AKA Criterion Logical Statistical Construct Face Content Reliability Concurrent Predictive Consistency Objectivity
Logical Validity • Face Validity – Infers that a test is valid by definition – It is clear that the test measures what it is supposed to e. g. If you want to assess reaction time, measuring how long it takes an individual to react to a given stimulus would have face validity Externally Valid?
Logical Validity • Face Validity – Infers that a test is valid by definition – It is clear that the test measures what it is supposed to i. e. Would assessing 15 m sprint time be a valid means of assessing reaction time? Assessing face validity is therefore a subjective process.
Logical Validity • Content Validity – Infers that the test measures all aspects contributing to the variable of interest e. g. Who is the most physically fit? ØVO 2 max test? ØWingate test? Ø 1 RM? …also a subjective process.
Overall: A logically valid test simply appears to measure the right variable in its entirety?
Statistical Validity • Concurrent Validity – Infers that the test produces similar results to a previously validated test e. g. VO 2 max Incremental Treadmill Protocol with expired gas analysis Multi-Stage Fitness (Beep) Test
Statistical Validity • Predictive Validity – Infers that the test provides a valid reflection of future performance using a similar test e. g. Can performance during test A be used to predict future performance in test B? A http: //www. youtube. com/watch? v=vd. PQ 3 Qx. DZ 1 s B
Overall: A statistically valid test produces results that agree with other similar tests?
Logical/Statistical Validity • Construct Validity – Infers not only that the test is measuring what it is supposed to, but also that it is capable of detecting what should exist, theoretically – Therefore relates to hypothetical or intangible constructs e. g. Team Rivalry Sportsmanship.
Logical/Statistical Validity • Construct Validity – Infers not only that the test is measuring what it is supposed to, but also that it is capable of detecting what should exist, theoretically – Therefore relates to hypothetical or intangible constructs – This makes assessment difficult, i. e. if what should exist cannot be detected, this could mean: a) Test Invalid? b) Theory Incorrect? c) Sensitivity/Specificity Issues?
Interesting Example: Breast Cancer • Incidence: ~1 % (0. 8 %) (i. e. a positive result should be detected for approximately 1 every 100 women tested) in • Sensitivity: ~90 % (87 %) (the mammogram is sensitive enough that approximately 90 in every 100 breast cancer patients will receive a positive result) • Specificity: ~90 % (93 %) (the mammogram is specific enough that approximately 90 in every 100 healthy patients will receive a negative result). Data from Kerlikowske et al. (1996)
Quick Test • What is the probability that a patient receiving a positive result actually has breast cancer?
Threats to Validity (and possible solutions? )
Threats to Internal Validity • Maturation – Changes in the DV over time irrespective of the IV
Threats to Internal Validity • Maturation e. g. One Group Pre-test Post-test O 1 T O 2
Threats to Internal Validity • Maturation (possible solution) Time series O 1 O 2 O 3 T O 4 O 5 O 6
Threats to Internal Validity • Maturation (possible solution) Pre-test Post-test Randomised Group Comparison O R 1 T O P O 2 n. b. RCT O 3 4
Threats to Internal Validity • Maturation (possible solution) Repeated measures designs can occasionally be an inappropriate solution, even when randomised and counterbalanced e. g. Muscle Damage (repeated bout effect) Vitamin Supplementation (wash-out period) In which case independent measures designs could be used.
Threats to Internal Validity • History – Unplanned events between measurements
Threats to Internal Validity • History O 1 T O 2 e. g. exercise? Therefore, solution = control extraneous variables!
Threats to Internal/External Validity • Pre-testing – Interactive effects due to the pre-test (e. g. learning, sensitisation, etc. ) – Also influences External Validity
Threats to Internal/External Validity • Pre-testing e. g. O R O 1 3 Assessing muscle mass here could make them train harder in both trials… T P …so it is actually T+O 1 that is better than P, not T alone. O 2 …but then respond better to the T than the P… O 4
Threats to Internal/External Validity • Pre-testing (possible solution) O R O T 1 P 3 Solomon Four. Group Design T P O 2 O 4 O 5 O 6
Threats to Internal Validity • Statistical Regression Sophomore Slump & SI ‘Cover Jinx’ – AKA regression to the mean – An initial extreme score is likely to be followed by less extreme subsequent scores e. g. Training has the greatest effect on untrained individuals. Therefore, solution = effective sampling.
Threats to Internal Validity • Instrumentation – A difference in the way 2 comparable variables were measured e. g. Uncalibrated equipment Therefore, solution = calibrate!
Threats to Internal Validity • Selection Bias – The groups for comparison are not equivalent
Threats to Internal Validity • Selection Bias e. g. Groups not randomly assigned T O 1 i. e. Static Group Comparison P Group T were resistance trained to start with Oa
Threats to Internal Validity • Selection Bias (possible solution) Either: T O 1 -Randomise group assignment, -Pre-test and posttest difference, -Repeated Measures Design. P Oa
Threats to Internal/External Validity • Experimental Mortality – Missing Data due to subject drop-out – Reduced n = reduced statistical Power – Not only challenges quality of data gathered (Internal Validity) but also our ability to generalise (External Validity). Therefore, solution = recruit sufficient (young? ) participants
Threats to External Validity • Inadequate description – 5 th characteristic of research… …should be replicable If nobody can replicate the methods of a given study, then it is irrefutable and therefore lacks external validity. Therefore, solution = comprehensive methodology
Threats to External Validity • Biased sampling – Linked to statistical regression – Sample does not reflect target population –n≠N Results generalised across gender Therefore, solution = random sample (of target population).
Threats to External Validity • Hawthorne Effect – DV is influenced by the fact that it is being recorded e. g. Fastest sprint when professor enters lab Therefore, solution = control the lab environment.
Threats to External Validity • Demand Characteristics – Participants detect the purpose of the study and behave accordingly e. g. Sports Science students already know that the carbohydrate drink is supposedly superior Therefore, solution = double or single CHO H 2 O blinding.
Threats to External Validity • Operationalisation – AKA Ecological Validity – The DV must have some relevance in the ‘real world’ e. g. TTE has no Olympic equivalent Therefore, solution = choose your DV carefully.
Reliability • Reliability is a pre-requisite of validity e. g. Direct versus Indirect measures of VO 2 max -Gold Standard (i. e. valid and reliable) -Expensive Complex -Predictive -Cheap Easy
Reliability Subject 1 60 ml. kg-1. min-1 Subject 2 55 ml. kg-1. min-1 Subject 3 70 ml. kg-1. min-1 Valid and Reliable
Reliability Subject 1 60 ml. kg-1. min-1 65 ml. kg-1. min-1 Subject 2 55 ml. kg-1. min-1 60 ml. kg-1. min-1 Subject 3 70 ml. kg-1. min-1 75 ml. kg-1. min-1 Not Valid but Reliable 5 ml. kg-1. min-1 correction?
Reliability Subject 1 60 ml. kg-1. min-1 72 ml. kg-1. min-1 57 ml. kg-1. min-1 Subject 2 55 ml. kg-1. min-1 61 ml. kg-1. min-1 52 ml. kg-1. min-1 Subject 3 70 ml. kg-1. min-1 40 ml. kg-1. min-1 84 ml. kg-1. min-1 Not Valid and not Reliable i. e. a test can never be valid without being reliable?
Types of Reliability • Relative • Absolute • Rater reliability (Objectivity) – Intrarater reliability – Interrater reliability.
Relative Reliability Subject 1 60 ml. kg-1. min-1 63 ml. kg-1. min-1 57 ml. kg-1. min-1 Subject 2 55 ml. kg-1. min-1 56 ml. kg-1. min-1 48 ml. kg-1. min-1 Subject 3 70 ml. kg-1. min-1 65 ml. kg-1. min-1 66 ml. kg-1. min-1 Relatively Reliable i. e. Individuals maintain position in the group
Absolute Reliability Subject 1 60 ml. kg-1. min-1 63 ml. kg-1. min-1 57 ml. kg-1. min-1 Subject 2 55 ml. kg-1. min-1 56 ml. kg-1. min-1 48 ml. kg-1. min-1 Subject 3 70 ml. kg-1. min-1 65 ml. kg-1. min-1 66 ml. kg-1. min-1 Not Absolutely Reliable i. e. Test-Retest within individuals
Rater Reliability • Intrarater reliability – The consistency of a given observer or measurement tool on more than one occasion
Rater Reliability • Interrater reliability – The consistency of a given measurement from more than one observer or measurement tool e. g. Score for the American Gymnast British Judge = 9. 9 French Judge = 4. 4 Japanese Judge = 7. 0
Threats to Reliability • Fatigue 8 am Subject 1 60 ml. kg-1. min-1 9 am 55 ml. kg-1. min-1 10 am 50 ml. kg-1. min-1 Therefore, solution = increase time between tests.
Threats to Reliability • Habituation Subject 1 60 ml. kg-1. min-1 65 ml. kg-1. min-1 70 ml. kg-1. min-1 Therefore, solution = familiarise prior to test.
Threats to Reliability • Standardisation of Procedures – Control of extraneous variables • Precision of Measurements – i. e. if we are happy to measure VO 2 max to the nearest 10 ml. kg-1. min-1, then it could probably be reliably predicted from your training volume and age.
Measurement Errors • Ultimately, reliability is dependent on the degree of measurement error in a given study • The overall error in any measurement is comprised of both systematic and random error • We will address measurement error further next week…
Literature Search Assignment • The handout lists 8 questions which can be answered through retrieving the corresponding source articles • Answer as many as possible and bring them to next week’s lecture • DO NOT contact author or order articles.
Selected Reading • Atkinson, G. and A. M. Nevill. Statistical methods for assessing measurement error (Reliability) in variables relevant to sports medicine. Sports Medicine. 26: 217 -238, 1998. • Holmes, T. H. Ten categories of statistical errors: a guide for research in endocrinology and metabolism. American Journal of Physiology. 286: E 495 -501. • Thomas J. R. & Nelson J. K. (2001) Research Methods in Physical Activity, 4 th edition. Champaign, Illinois: Human Kinetics
J. Betts@bath. ac. uk
- Slides: 61