FRCEM Critical Appraisal Format 90 minutes Diagnostic or

  • Slides: 68
Download presentation
FRCEM Critical Appraisal

FRCEM Critical Appraisal

Format • 90 minutes • Diagnostic or therapeutic papers • SAQ – – –

Format • 90 minutes • Diagnostic or therapeutic papers • SAQ – – – Summary / abstract Design – good / bad points Definitions What do the results mean? How do the results relate to practice? Implement or not?

STATS YOU NEED TO KNOW!

STATS YOU NEED TO KNOW!

Types of Data • Continuous – Normal distribution • parametric tests such as t

Types of Data • Continuous – Normal distribution • parametric tests such as t test, ANOVA • mean – Non-normal distribution • non-parametric tests such as Mann Whitney U, Wilcoxon rank sum, Kruskall Wallis • median • Categorical – Nominal • chi squared test • Fisher’s exact test • mode – Ordinal

P Value • Probability that the result (difference) you see has arisen purely by

P Value • Probability that the result (difference) you see has arisen purely by chance if null hypothesis true • Arbitrary level of 0. 05 (1 in 20) set as level of statistical significance • This is not the same as clinical significance!

If the coin comes down heads, does that mean it is loaded? What if

If the coin comes down heads, does that mean it is loaded? What if the coin had more metal on the tail side?

Confidence Interval • Usually quoted as 95% • We can be 95% sure /

Confidence Interval • Usually quoted as 95% • We can be 95% sure / confident / certain that the actual value lies within the range quoted • (There is a 5% chance that the actual value lies outside of this range of values) • NOT that 95% of the values lie within the range

Randomisation • Subjects are randomly assigned to a particular (treatment) group – – Random

Randomisation • Subjects are randomly assigned to a particular (treatment) group – – Random number generator Sealed envelopes Batch randomisation Cluster randomisation • Tries to ensure each group is similar (table 1 demographics) apart from the treatment • Some studies do not need randomisation! Diagnostic studies are cohorts where every subject should have test and gold standard

Blinding • Hawthorne, Rosenthal / Pygmalion, John Henry effects, self fulfilling prophecy • Allow

Blinding • Hawthorne, Rosenthal / Pygmalion, John Henry effects, self fulfilling prophecy • Allow for human behaviours that might affect subjective measures / outcomes • Not all studies need to be blinded! Objective measures

Blinding • Similar medication appearances • Sham surgery • Data collectors unaware of treatment

Blinding • Similar medication appearances • Sham surgery • Data collectors unaware of treatment group • Gold standard unaware of results of test

Inter-Observer Agreement • Do you get similar results with the same test when read

Inter-Observer Agreement • Do you get similar results with the same test when read by different people? • Kappa value • -1 (complete disagreement) to +1 (complete agreement) • 0 = agreement purely by chance

Power • The ability of a study to find a difference should a difference

Power • The ability of a study to find a difference should a difference exist • Determined by: – Size of difference – Level of accepted statistical significance (alpha usually standard 0. 05) – Desired chance / ability to detect the difference (beta usually set at 80%) – Sample size ‘Sample size required is N for an 80% power to detect a difference of x at the p=0. 05 level’

Intention to Treat

Intention to Treat

Intention to Treat • Preserves effects of randomisation • Mirrors real world activity –

Intention to Treat • Preserves effects of randomisation • Mirrors real world activity – withdrawals, incomplete treatments, using additional treatments

Test Characteristics • Sensitivity • Specificity • Predictive values • Likelihood ratios • ROC

Test Characteristics • Sensitivity • Specificity • Predictive values • Likelihood ratios • ROC curve

2 x 2 Table Gold standard Disease Present Absent Test score: Test positive Test

2 x 2 Table Gold standard Disease Present Absent Test score: Test positive Test negative a (TP) b (FP) c (FN) d (TN) Sensitivity Specificity = a/(a+c) = d/(b+d) Note ‘Sp. In’ vs ‘Sn. Out’

What Do the Sensitivity and Specificity Not tell You? • Sensitivity & specificity derived

What Do the Sensitivity and Specificity Not tell You? • Sensitivity & specificity derived from comparison with gold standard – Implies you already know the diagnosis • Doesn’t tell you what a particular test result means for your patient – So does my patient have disease or is the result a false positive? 19

Most Tests Provide a Continuous Score. Selecting a Cutting Point Test scores for a

Most Tests Provide a Continuous Score. Selecting a Cutting Point Test scores for a healthy population Sick population Healthy scores Pathological Possible cut-point scores Move this way to increase sensitivity increase specificity (include more of (exclude healthy people) sick group) Crucial issue: changing cut-point can improve sensitivity or specificity, but never both

2 x 2 Table for Testing a Test Gold standard Disease Absent Test score:

2 x 2 Table for Testing a Test Gold standard Disease Absent Test score: Present a (TP) b (FP) Test +ve d (TN) Test -ve c (FN) PPV = a/(a+b) NPV = d/(c+d) Sensitivity Specificity = a/(a+c) = d/(b+d) 21

Positive and Negative Predictive Values • Given a test result, what is the probability

Positive and Negative Predictive Values • Given a test result, what is the probability the patient has / doesn’t have disease? • But very dependent on prevalence • As prevalence goes down, PPV goes down (it’s harder to find the smaller number of cases) and NPV rises. • May not be applicable to your population if local prevalence is different

Prevalence and Predictive Values B. Primary care A. Specialist referral hospital D+ D- T+

Prevalence and Predictive Values B. Primary care A. Specialist referral hospital D+ D- T+ 50 10 T- 5 100 D+ D- T+ 50 100 T- 5 1000 Sensitivity = 50/55 = 91% Specificity = 100/110 = 91% Sensitivity = 50/55 = 91% Specificity = 1000/1100 = 91% Prevalence = 55/165 = 33% Prevalence = 55/1155 = 3% PPV = 50/60 = 83% NPV = 100/105 = 95% PPV = 50/150 = 33% NPV = 1000/1005 = 99. 5% 23

Likelihood Ratios • Odds of a given test result in a patient with the

Likelihood Ratios • Odds of a given test result in a patient with the disease as opposed to a patient without • Advantages: – – Combines sensitivity and specificity into one number Can be calculated for many levels of the test Not dependent on prevalence Can calculate probabilities of disease (Bayesian theory) • LR for positive test = Sensitivity / (1 -Specificity) • LR for negative test = (1 -Sensitivity) / Specificity • Relationship to ROC curve

ROC Curve

ROC Curve

Stats Summary • • • Types of data P value Confidence intervals Randomisation Blinding

Stats Summary • • • Types of data P value Confidence intervals Randomisation Blinding Interobserver agreement Power Intention to treat Test characteristics – – Sensitivity / specificity Predictive values Likelihood ratios ROC curves

Format • 90 minutes • Diagnostic or therapeutic papers • SAQ – – –

Format • 90 minutes • Diagnostic or therapeutic papers • SAQ – – – Summary / abstract Design – good / bad points Definitions What do the results mean? How do the results relate to practice? Implement or not?

Summary / Abstract • Aim / objective – What was the main point they

Summary / Abstract • Aim / objective – What was the main point they were looking at? • Methods – Who, where, when, how – Randomised? Blinded? (if relevant) • Results – Main points – think about the aim. Don’t get caught up with cramming in all of the secondary analyses • Conclusion – Authors’ not yours! Link back to aim • 200 word limit, use bullet points

Design • What are the good things about the design? • Are there any

Design • What are the good things about the design? • Are there any aspects that mean the patients may not be entirely the type of patients you see? • Highly selected – lots of exclusions? Restricted inclusion criteria? • Randomisation / blinding where appropriate – were these done well? • Did they use the correct statistical tests? • Look at the limitations (usually separate section or at beginning of Discussion)

Definitions • • • Types of data P value Confidence intervals Randomisation Blinding Interobserver

Definitions • • • Types of data P value Confidence intervals Randomisation Blinding Interobserver agreement Power Intention to treat Test characteristics – Sensitivity / specificity – Predictive values – Likelihood ratios

Results • Was there a difference? • If so, can the findings be put

Results • Was there a difference? • If so, can the findings be put into clinical practice? • What is the size of difference? • How does it relate to current or future practice?

Relevance to Clinical Practice • Would you implement the findings in the study? •

Relevance to Clinical Practice • Would you implement the findings in the study? • Look at the limitations of the study • Do these limitations mean the results can’t be generalised to the population we treat?

Tips • Read the questions before reading the paper • Don’t worry too much

Tips • Read the questions before reading the paper • Don’t worry too much about the numbers / stats, this is a comprehension exercise • KISS: don’t use technical jargon (unless you really know what it means, REALLY) • Answer the question: correct but irrelevant statements don’t score • Look at the size of the answer box and the marks awarded to guide how much to write

Example Papers Prospective Validation of the Pediatric Appendicitis Score in a Canadian Pediatric Emergency

Example Papers Prospective Validation of the Pediatric Appendicitis Score in a Canadian Pediatric Emergency Department Maala Bhatt, MD, MSc, Lawrence Joseph, Ph. D, Francine M. Ducharme, MD, MSc, Geoffrey Dougherty, MD, MSc, and David Mc. Gillivray, MD ACADEMIC EMERGENCY MEDICINE 2009; 16: 591– 596

Q 1 Provide a no more than 200 word summary of this paper in

Q 1 Provide a no more than 200 word summary of this paper in the box provided. Only the first 200 words will be considered – short bullet points are acceptable. Maximum of 7 marks available.

Q 1 Many candidates did not appear to read the title – ie validation

Q 1 Many candidates did not appear to read the title – ie validation , and therefore to use it in the summary Many candidates did not use all 200 words Candidates spent time counting their words – this is not useful, at standard size writing – the 200 words will fit on one side of paper Candidates did not state obvious aspects – ie prospective diagnostic observational study Candidates commonly did not appear to realise it was a diagnostic study – and many tried to apply a therapeutic appraisal framework including outcomes and intention to treat Candidates did not appear to realise that any validation of a diagnostic test will need a gold or reference standard – and most commonly referred to this as an “primary outcome”. simply mentioning the word standard or reference would have gained marks

Q 1 A summary needs to summarise so that the summary stands alone –

Q 1 A summary needs to summarise so that the summary stands alone – candidates failed to say what the cut off was – just referring to another paper (Samuel) so that the summary did not stand alone There is no need, in the summary of the paper, to summarise the background to the paper There needs to be, in the summary, actual results – numbers with some headline statistics Don’t have to put headings into the summary but if you do – don’t put results into the conclusion Use the conclusions the authors use –they will have stated them somewhere – this is an easy mark to pick up – don’t make up your own conclusions The summary should not include your opinion of the paper – the authors will not have written their own critique in the abstract! The easiest way to get marks is to learn the headings for the appraisal of a diagnostic and therapeutic paper – then write them down first in the exam and fill in the blanks

Q 2 The primary objective of this study was to determine the diagnostic properties

Q 2 The primary objective of this study was to determine the diagnostic properties of the pediatric appendicitis score cut-point of 6 for diagnosing appendicitis List four strengths of the study DESIGN in this paper

Q 2 Candidates did not list strengths of the design but of the paper

Q 2 Candidates did not list strengths of the design but of the paper in general Many candidates wrote a series of “buzz words” but in no relevant order or failed to explain what they meant. eg “pragmatic so generalisable” does not demonstrate understanding of the fact that the study was done with normal staff, using normal processes and nothing unusual required In a study such as this, it is a given that there will be ethics and consent as well as data analysis such as a ROC curve. Don’t state routine aspects as strengths Many candidates wrote correct statements – but they were not relevant to the answers Some candidates did not pay attention to detail – some stated that measuring inter-observer reliability does not decrease the error –this is incorrect, it just describes /quantifies it.

Q 2 Candidates put results in as strengths of design – ie no loss

Q 2 Candidates put results in as strengths of design – ie no loss to follow up. A more suitable answer would be –“ it was designed that all patients who were not operated on would have a telephone follow up to ensure no missed diagnoses” Candidates simply stated the stats used (sensitivity and specificity) rather than indicating how the authors set out to analyse the data in a particular way (ie designed the study) so that they could identify the reliability of the score in diagnosing appendicitis. Explanation of why elements of the design including choice of stats enhances the study is needed for this question The fact that the issue being investigated by the study is clinically relevant is not a strength of the design of the study

Q 3 The paper does not mention whether those ascertaining the outcome diagnosis (‘appendicitis’

Q 3 The paper does not mention whether those ascertaining the outcome diagnosis (‘appendicitis’ or ‘no appendicitis’) were blinded to the Pediatric Appendicitis Score. (a) Explain why a lack of such blinding may introduce possible bias into the results. (2 marks)

Q 3 Blinding is an essential part of all research and you must be

Q 3 Blinding is an essential part of all research and you must be able to discuss who might be blinded (all assessors, reviewers and those doing follow up) You should also be able to articulate the impact of lack of blinding – both in a subjective assessment and where the measurement is more objective eg automated outcome, alive/dead Some candidates believed that pathology reports could not be influenced by prior case knowledge and/or the knowledge of the PAS components.

Q 3 Candidates often failed to recognise that bias may work in both directions.

Q 3 Candidates often failed to recognise that bias may work in both directions. It was common to read answers suggesting that bias could only over-diagnose appendicitis Candidates failed to recognise all components of the gold standard in this study There were specific types of bias appropriate to this paper that candidates should be aware of, ie selection, sampling or attrition bias

Q 4 (a) The results section of the paper reports that a Pediatric Appendicitis

Q 4 (a) The results section of the paper reports that a Pediatric Appendicitis Score cut-point of 6 or more had a sensitivity of 92. 8% and a specificity of 69. 3% for the diagnosis of appendicitis. Comment on the utility of this cut point in ruling out appendicitis. (2 marks) (b) With reference to the discussion section of the paper, what is the probability that a child with a Pediatric Appendicitis Score of 8 or more does not have appendicitis? (2 marks)

Q 5 Figure 2 in the paper presents a Receiver operating characteristic (ROC) curve.

Q 5 Figure 2 in the paper presents a Receiver operating characteristic (ROC) curve. (a) List 2 ways by which ROC curves add to the understanding of diagnostic tests. (2 marks)

Q 6 Table 2 of the paper reports that 45% of those with appendicitis

Q 6 Table 2 of the paper reports that 45% of those with appendicitis and 37% with no appendicitis had imaging investigations. The difference (95% CI) is 12% (-1 to 24). (a) Is this a statistically significant difference? (1 mark) (b) Explain your answer. (1 mark)

Q 7 The following is a quote from the results section of the paper:

Q 7 The following is a quote from the results section of the paper: ‘Interobserver scores were obtained in 37 (14. 6%) of the 246 patients. The kappa coefficient was 0. 65 (95% CI = 0. 48 to 0. 81) …’ (The kappa coefficient is used to express level of agreement between observers) Comment on the level of agreement between observers in terms of the point estimate (0. 65) and the 95% confidence interval (0. 48 to 0. 81). (2 marks)

Stats Specificity and Sensitivity in ruling in and ruling out (SPIN and SNOUT). Candidates

Stats Specificity and Sensitivity in ruling in and ruling out (SPIN and SNOUT). Candidates should understand the difference between sensitivity and specificity and be able to relate this to the performance of a test in clinical practice. Positive predictive value as a way of expressing probability. Candidates should understand what a PPV or NPV means for a given population and for the result from an individual patient. ROC curves – Candidates should be able to articulate their understanding of ROC curves. They should be able to differentiate test performance using a ROC curve. They should be familiar with the concept of area under the curve analysis using ROC curves. Interpreting confidence intervals. Candidates should be able to give a concise explanation of the meaning and usefulness of confidence intervals. Candidates should be able to demonstrate how confidence intervals may influence their thinking about the precision of a result. Candidates should understand the principles of the Kappa statistic and its magnitude, and general features of the analysis of interobserver reliability

Q 8 Give four reasons why you would not adopt this test in your

Q 8 Give four reasons why you would not adopt this test in your Emergency Department.

Q 8 Candidates stated that the test used different practice to current – that

Q 8 Candidates stated that the test used different practice to current – that is not an acceptable reason for not adopting the test Candidates stated it was too expensive – there was no evidence of cost assessment so could not be stated Have to fully explain the statements made – cannot just say – not specific enough – you have to explain why that matters This question effectively asks the candidate to list the weaknesses/limitations of the study and its validity, applicability and importance to EM in UK.

Summary of Diagnostic Studies • • Derivation vs validation Usually prospective cohort Test vs

Summary of Diagnostic Studies • • Derivation vs validation Usually prospective cohort Test vs gold / reference standard All the patients receive the test and all have the gold / reference standard Randomisation is not a feature May need blinding Are these your patients, your staff, your department? Know your test characteristics

Example Papers A Randomized Trial of Nebulized 3% Hypertonic Saline With Epinephrine in the

Example Papers A Randomized Trial of Nebulized 3% Hypertonic Saline With Epinephrine in the Treatment of Acute Bronchiolitis in the Emergency Department Simran Grewal, MD; Samina Ali, MD; Don W. Mc. Connell, MD; Ben Vandermeer, MSc; Terry P. Klassen, MSc, MD ARCH PEDIATR ADOLESC MED/ VOL 163 (NO. 11), NOV 2009

Q 1 Provide a no more than 200 word summary of this paper in

Q 1 Provide a no more than 200 word summary of this paper in the box provided. Only the first 200 words will be considered – short bullet points are acceptable. Maximum of 7 marks available.

Q 1 • • • Objective: To determine whether nebulised 3% hypertonic saline with

Q 1 • • • Objective: To determine whether nebulised 3% hypertonic saline with epinephrine is more effective than nebulised 0. 9% saline with epinephrine in the treatment of bronchiolitis in the emergency department. Design: Randomised double blind controlled trial Setting: Single centre urban paediatric emergency department in Canada. Participants: Infants younger than 12 months with mild to moderate bronchiolitis. Interventions: Patients were randomised to receive epinephrine in either hypertonic or normal saline. Outcome measures: The primary outcome measure was the change in respiratory distress, as measured by the Respiratory Assessment Change Score (RACS) from baseline to 120 minutes. Change in oxygen saturation was also determined. Secondary outcome measures included rates of hospital admission and unbooked return to the ED following discharge. Results: 46 patients were enrolled. The two groups had similar baseline characteristics. RACS from baseline to 120 minutes demonstrated no improvement in respiratory distress in the hypertonic saline group (mean 4. 39, 95% CI 2. 64 -6. 13) when compared to the normal saline group (mean 5. 13, 95% CI 3. 71 - 6. 55). The change in oxygen saturations in the hypertonic group was also no different to that of the normal saline group (difference 1. 78, 95% CI -0. 5 – 1. 78). Rates of admission and unplanned return to the ED were similar between the two groups. Conclusion: In this study hypertonic saline with epinephrine did not improve clinical outcome in acute bronchiolitis when compared to normal saline with epinephrine.

Q 2 Give 3 strengths and 3 weaknesses of the study design? (3 marks)

Q 2 Give 3 strengths and 3 weaknesses of the study design? (3 marks)

Q 2 • Done in a paediatric ED • Patients defined quite tightly in

Q 2 • Done in a paediatric ED • Patients defined quite tightly in terms of clinical features and RDAI Score. Patients are thus likely to have bronchiolitis • Demographic and clinical data collected by research assistants using standard data collection form. • Excellent allocation concealment. Pharmacy made up identical looking syringes and retained the randomisation list until the end of the study. • Blinding also good. Neither staff nor patients were aware of their treatment • Outcomes are clearly defined and seem relevant and important.

Q 2 • Limited hours of enrolment (4 pm to 2 am). ? selection

Q 2 • Limited hours of enrolment (4 pm to 2 am). ? selection bias • Only conducted if research assistant was available • Whilst scoring system well defined it seems quite complex and open to interobserver variability (although the authors state not) • It’s unclear who is assigning the RDAI score • Only 2 doses of nebuliser solution available • Physicians could give any other treatment they thought appropriate – no indication who needed what

Q 3 What is block randomisation? (1 mark) What are the benefits and pitfalls

Q 3 What is block randomisation? (1 mark) What are the benefits and pitfalls of this method? (2 marks)

Q 3 Randomisation occurs within small blocks of patients so that there is an

Q 3 Randomisation occurs within small blocks of patients so that there is an equal number of subjects in each study arm within each block. This keeps the number of subjects in each study arm very similar. Useful where sample sizes are small and small random variations can have a proportionately large effect Towards the end of each block there may be the possibility of researchers predicting what comes next and affecting subjective assessments

Q 4 What do you understand by the term “intentionto-treat”? (1 mark) What are

Q 4 What do you understand by the term “intentionto-treat”? (1 mark) What are the advantages of this? (1 mark) What is the opposite approach and what advantages does this have? (2 marks)

Q 4 Analysing all subjects in the study arm they were randomised to irrespective

Q 4 Analysing all subjects in the study arm they were randomised to irrespective of drop out, non completion of treatment, etc. . ‘Real world’ evaluation of treatment effect as not all patients will have the treatment in the full and perfect way of the study protocol. Analysis ‘per protocol’. Gives a better assessment of actual treatment effect (efficacy versus effectiveness)

Q 5 The authors used Fishers Exact Test for analysis of some of their

Q 5 The authors used Fishers Exact Test for analysis of some of their data. What type of data can be analysed in this way and when is this used (2 marks)

Q 5 Categorical data To compare proportions of a variable across 2 different categories.

Q 5 Categorical data To compare proportions of a variable across 2 different categories. Better test than chi squared if sample size is small

Q 6 The authors state that a change in RACS Score of anything less

Q 6 The authors state that a change in RACS Score of anything less than 3 would not be clinically important. Why is it important to decide on the minimally clinically important effect and how does this affect power and sample size. (3 marks)

Q 6 There is no point making a change to practice if it does

Q 6 There is no point making a change to practice if it does not produce an improvement in outcome that is meaningful to the patient. A smaller difference would mean that a larger sample size is required or that the power of the study is reduced.

Q 7 The paper states the change in RACS is 0. 74 (95% CI

Q 7 The paper states the change in RACS is 0. 74 (95% CI -1. 45 – 2. 93). Define 95% confidence interval. (1 mark) What clinical relevance does the quoted interval have? (1 mark)

Q 7 The range of values that we are 95% certain the true difference

Q 7 The range of values that we are 95% certain the true difference lies The quoted interval crosses 0, meaning the actual difference may favour either treatment, i. e. there is no statistically significant difference between the two

Summary of Therapeutic Studies Double blind RCT best Sample size, power calculation Allocation concealment

Summary of Therapeutic Studies Double blind RCT best Sample size, power calculation Allocation concealment Has randomisation worked? Are all the patients accounted for? Appropriate follow up? What is the primary outcome? Secondary outcomes? Side effects? • Intention to treat analysis • Tests used appropriate for data type? • Are these patients similar to mine? • •