Lecture 3 Reliability and validity of scales Reliability
Lecture 3: Reliability and validity of scales • Reliability: – internal consistency – test-retest – inter- and intra-rater – alternate form • Validity: – content, criterion, and construct validity – responsiveness 1
Multi-item scales • Measure constructs without a gold standard – e. g. , depression, satisfaction, quality of life • Items are intended to sample the content of the underlying construct • Items summarized in various ways: – sum or average of responses to individual items – item weighting or other algorithm – profiles/sub-scale scores 2
Example: Reliability and validity of a measure of severity of delirium Source: Mc. Cusker et al, Internat Psychogeriatrics 1998; 10: 421 -33 • Delirium - acute confusion • Common in older hospitalized patients • Diagnosis of delirium is based on the following symptoms: – acute onset, fluctuations – inattention, disorganized thinking – altered consciousness, disorientation – memory impairment, perceptual disturbances – psychomotor agitation or retardation 3
Requirements of new scale • Administered by interviewer at bedside • Not using patient chart (to maintain blinding) • Brief (avoid patient burden) • Responsive to within-patient changes over time 4
Delirium Index (DI) • Assesses severity of 7 symptoms of delirium (excl. acute onset, fluctuations, sleep disorder): – inattention, disorganized thinking – altered consciousness, disorientation – memory impairment, perceptual disturbances – psychomotor agitation or retardation 5
Administration and scoring • Administered in conjunction with first 5 questions of Mini-Mental State Exam (MMSE) • Each symptom rated on 4 -point scale: 0 = absent 1 = mild 2 = moderate 3 = severe • Operational definition of each symptom 6
Scoring • Score is sum of 7 item scores • Scoring of symptoms that could not be assessed: – patient non-responsive - coded as “severe” for items 1, 2, 4, 5 – coding instructions provided for questions 3, 6, 7 – patient refuses - questions 1, 2, 4, 5 scores replaced by score of item 3 7
Reliability • Internal consistency • Test-retest reliability • Inter-rater and intra-rater reliability 8
Internal consistency • Relevant to additive scales (that sum or average items) • Split-half reliability: – correlation between scores on arbitrary half of measure with scores on other half • Coefficient alpha (Cronbach) – estimates split half correlation for all possible combinations of dividing the scale 9
Internal consistency of DI • Cronbach’s alpha (overall) = 0. 74 • After exclusion of perceptual disturbance: 0. 82 • In sub-groups of patients: – delirium and dementia: – delirium alone: – dementia alone: – neither 0. 69, 0. 79 0. 67, 0. 79 0. 55, 0. 59 0. 44, 0. 52 10
Test-retest reliability (stability) • Scale is repeated – short-term • for constructs that fluctuate, 2 weeks often used to reduce effects of memory and true change – long-term • for constructs that should not fluctuate (e. g. , personality traits) • Correlation between 2 scores is computed • Also important to look at systematic increase or decrease in score 11
Test-retest reliability of DI • Delirium is marked by fluctuations • Variability over time is expected 12
Mean within-patient standard deviation in DI score during 1 st week in hospital 13
Inter- and intra-rater reliability Inter-rater reliability • For scales requiring rater skill, judgment • 2 or more independent raters of same event Intra-rater reliability • Independent rating by same observer of same event 14
Measures of inter- and intra-rater reliability: categorical data • Percent agreement – can be used for di- and polychotomous scales – limitation: value is affected by prevalence higher if very low or very high prevalence • Kappa statistic – takes chance agreement into account – defined as fraction of observed agreement not due to chance 15
Kappa statistic Kappa = p(obs) - p(exp) 1 - p(exp) p(obs): proportion of observed agreement p(exp): proportion of agreement expected by chance 16
17
Interpretation of kappa • Various suggested interpretations • Example: Fleiss (1981) excellent: 0. 75 and above fair to good: 0. 40 - 0. 74 poor: less than 0. 40 • Limitations – depends on prevalence (see Szklo & Nieto) – do not use as only measure of agreement 18
Measures of inter- and intra-rater reliability: continuous data • Measures of correlation – Correlation graph (scatter diagram) – Correlation coefficients • Measures of pairwise comparison 19
Correlation coefficients • Pearson’s r – assesses linear association, not systematic differences between 2 sets of observations – sensitive to range of values, especially outliers • Spearman r – ordinal or rank order correlation – less influenced by outliers – doesn’t assess systematic differences 20
Correlation coefficients • Intra-class correlation coefficient (ICC) – Estimate of total measurement variability due to between-individuals (vs error variance) – Equivalent to kappa and same range of values – Reflects true agreement, including systematic differences – Affected by range of values - if less variation between individuals, ICC will be lower 21
Inter-rater reliability of DI • Intraclass correlation coefficient (ICC): n = 26 patients (39 pairs of ratings) ICC = 0. 98 (SD 0. 06) 22
Alternate form reliability • Agreement between alternate forms of same instrument: – longer vs shorter version – alternate method of administration: • face-to-face vs telephone • subject vs proxy (see Magaziner paper) 23
Validity • Content and face validity • Criterion validity: concurrent and predictive • Construct validity 24
Validity • Depends on purpose: – screening: discrimination – outcome of treatment: responsive, sensitivity to change – prognosis: predictive validity 25
Content and face validity • Judgment of “experts” and/or members of target population • Does measure adequately sample domain being measured? • Does it appear to measure what it is intended to measure? (eyeball test) 26
Content validity of DI • Based on Confusion Assessment Method (CAM) – based on accepted diagnostic criteria (DSM) – widely used 27
Criterion validity • Criterion (“gold” standard) • Concurrent criterion validity – e. g. , screening test vs diagnostic test • Predictive criterion validity – e. g. , cancer staging test vs 5 -year survival 28
Criterion validity of DI • Correlation between psychiatrist-scored DI (based only on patient observation) and Delirium Rating Scale (using all available information) – original scale – adjusted scale, omitting 4 items not assessed by DI 29
Criterion validity of DI: results • Spearman correlation coefficient ( and 95% CI) between DI and adjusted DRS (using multiple observations): – at one point in time – within-subject change over time 0. 84 (0. 75, 0. 89) 0. 71 (0. 53, 0. 82) 30
Delirium severity and survival • Proportional hazards regression of delirium severity in delirium cohort • Mean of 1 st 2 DI scores • Results – significant interaction: DI predicted survival in patients with delirium alone, not in those with dementia 31
Construct validity • Is theoretical construct underlying the measure valid? • Development and testing of hypotheses • Requires multiple data sources and investigations: – Convergent validity: measure is correlated with other measures of similar constructs – discriminant validity: measure is not correlated with measures of different constructs 32
Construct validity (cont) • Multitrait-multi-method: – Convergent validity: measure is correlated with other measures of similar constructs – discriminant validity: measure is not correlated with measures of different constructs • Factorial method: – factor analysis or principle components analysis to identify underlying dimensions 33
Spearman correlation coefficients between Delirium Index and 3 baseline measures of current status 34
Spearman correlation coefficients between Delirium Index and 3 baseline measures of prior status 35
Responsiveness of measures • Ability to detect clinically important change over time or differences between treatments • Requirement of evaluative measures 36
Some sources of bias in scales • “Response sets” – Social desirability – Acquiescent 37
Social desirability • Tendency to give answers to questions that are perceived to be more socially desirable than the true answer • Different from deliberate distortion (“faking good”) • Depends on: – Individual characteristics (age, sex, cultural background) – Specific question 38
Social desirability • Measures of social desirability (SD) – SD scales (e. g. , Jackson SD scale, Crowne & Marlowe SD scale) – individual tendency to SD bias • Prevention – phrasing of questions – questionnairemode – training of interviewers 39
Acquiescent response set • Tendency to agree with Likert-type questions • Can be prevented by mix of positively and negatively-phrased questions, e. g. : – My health care is just about perfect – There are serious problems with my health care 40
Measurement of Quality of life (Qo. L) • Definition – individuals’ perception of their position in life in the context of the culture and value systems in which they live and in relation to their goals, expectations, standards, and concerns” (WHO QOL group, 1995) • Domains – physical, psychological, level of independence, social relationships, environment, and spirituality/religion/personal beliefs 41
Health-related quality of life (HRQo. L) • Dimensions of Qo. L related to health • Related terms: – health status – functional status • Usually includes: – physical health/function – mental health/function – social health/function 42
Evaluative HRQo. L instruments • Purpose – evaluate within-individual change over time • Reliability: – responsiveness • Construct validity: – correlations of changes in measures during period of time, consistent with theoretically derived predictions 43
Discriminative HRQo. L instruments • Purpose – evaluate differences between individuals at point in time • Reliability: – reproducibility • Construct validity: – correlations between measures at point in time, consistent with theoretically derived predictions 44
How is HRQo. L measured? • Mode – Interviewer • face-to-face • Telephone – Self-completed • Completed by – self – proxy/surrogate 45
Types of HRQo. L measures • Generic (global) – Health profiles – Utility measures • Specific 46
Generic vs specific • Generic – comparisons across populations and problems – robust and generalizable – measurement properties better understood • Disease-specific – shorter – more relevant and appropriate – sensitive to change 47
Appropriateness • Purpose: – describe health of population – evaluate effects of interventions (change over time) – compare groups at point in time – predict outcomes • Areas of function covered • Level of health • Generic/global or specific 48
- Slides: 48