Study Designs in Molecular Epidemiology Ioanna Tzoulaki itzoulakcc

  • Slides: 47
Download presentation
Study Designs in Molecular Epidemiology Ioanna Tzoulaki itzoulak@cc. uoi. gr With thanks to David

Study Designs in Molecular Epidemiology Ioanna Tzoulaki itzoulak@cc. uoi. gr With thanks to David C. Muller and Marc Gunter for developing much of this material

Learning Objectives • To understand the major types of study designs that are employed

Learning Objectives • To understand the major types of study designs that are employed in molecular epidemiology: • Cross-sectional and short-term longitudinal studies with biomarker endpoints • Case-control Studies • Prospective Cohort Studies • Case-only and other study designs

Observational Studies vs. Clinical Trials Common opinion: 1. Randomized clinical trials are the gold

Observational Studies vs. Clinical Trials Common opinion: 1. Randomized clinical trials are the gold standard 2. Observational studies are an approximation of the clinical trial you would conduct if you could, if not for: • Ethical concerns (toxic exposure) • Practical concerns (can’t randomize genes or race) • etc

Systematic Comparison of Observational and Clinical Trial Results (Benson N Engl J Med 2000;

Systematic Comparison of Observational and Clinical Trial Results (Benson N Engl J Med 2000; 342: 1878 -86).

Women’s Health Initiative: Clinical Trial vs Observational Study (Petitti & Freedman. AJE 2005; 162:

Women’s Health Initiative: Clinical Trial vs Observational Study (Petitti & Freedman. AJE 2005; 162: 415 -8).

Evolving Perspective • Even when clinical trials are possible observational studies are important complementary

Evolving Perspective • Even when clinical trials are possible observational studies are important complementary source of data, especially given the real world limitations of clinical trials: • Carefully selected patients (e. g. , no comorbidities) & controlled environments may not reflect typical populations or common issues faced in clinical practice • Clinical trials tend to have short follow-up due to high costs • Clinical trials are also subject to variability in results (as shown above) • Observational studies cannot replace trials, nor do trials make observational studies unnecessary. Both designs are susceptible to particular biases, so neither provides perfect information. (Soerensen et al, Hepatology 2006; 44: 1075 -1082)

Study Designs in Molecular Epidemiology • Cross-sectional and short-term longitudinal studies with biomarker endpoints

Study Designs in Molecular Epidemiology • Cross-sectional and short-term longitudinal studies with biomarker endpoints • Case-control Studies • Prospective Cohort Studies • Case-only and other study designs

Cross-Sectional and Short-term longitudinal Studies • Has population been exposed to particular compound (level,

Cross-Sectional and Short-term longitudinal Studies • Has population been exposed to particular compound (level, range, internal/external determinants of exposure)? • Evaluate intermediate biologic effects (exposure-biomarker relationships), changes in exposure on intermediate endpoints • Early biologic effects caused by new exposures

Whole Grain-Fibre Measurement Assessment of whole-grains intake • From food-frequency questionnaires • Subject to

Whole Grain-Fibre Measurement Assessment of whole-grains intake • From food-frequency questionnaires • Subject to measurement error Alternative measurement using the biomarkers alkylresorcinols • Bran of wheat and rye • Validated both in intervention and observational studies • Sum of all 5 homologues (C 17: 0, C 19: 0, C 21: 0, C 23: 0, C 25) are used as an estimate of the wholegrain intake • Wheat or rye? C 17/C 21 ratio

Plasma Alkylresorcinol Concentrations in 10 European Countries P<0. 001 Median plasma total alkylresorcinol concentrations

Plasma Alkylresorcinol Concentrations in 10 European Countries P<0. 001 Median plasma total alkylresorcinol concentrations (nmol/L) by country (by centre for UK) in 2845 participants from the European Prospective Investigation into Cancer and Nutrition (EPIC), A. for all participants (n=2845)

Case-Control Design

Case-Control Design

Case-Control Study Case-Control Design

Case-Control Study Case-Control Design

Case-Control Studies: Necessary Assumptions • Cases are representative of all cases • Controls are

Case-Control Studies: Necessary Assumptions • Cases are representative of all cases • Controls are representative of all individuals who, had they developed disease, would have been selected as cases (not necessarily individuals representative of the entire non-diseased population). • These people may differ vastly from the general population, including in regard to prevalence of exposure. • For both cases and controls, sampling must be independent of exposure.

Principles of Comparability & Selection of Controls 1. Study Base (Target or Source population

Principles of Comparability & Selection of Controls 1. Study Base (Target or Source population - roughly - subset of the general population that is at risk of the exposure & the development of the disease); “cases & controls should be representative of the same base experience”. • Primary Base (e. g. , population-based studies) • Secondary Base (e. g. , convenience-based enumeration) • Studies are representative of study base, not always the general population • Membership in the base is dynamic • Often start by defining a source of cases, then defining the study base and controls. Additional Notes: • Nested case-control - study base is explicitly indicated, it is the cohort itself. • An exclusion rule that applies equally to cases and controls is valid because it simply refines the scope of the study base (a rule that applies to only one group violates the study base principle). • A tertiary hospital is often not representative of the local community, and community controls would be inappropriate.

 • The major challenge with a primary base is complete case identification. e.

• The major challenge with a primary base is complete case identification. e. g. , underascertainment of asymptomatic cases that do not come to medical attention • The major challenge with a secondary base is to define the study base.

Principles of Comparability & Selection of Controls 2. Deconfounding – need to control for

Principles of Comparability & Selection of Controls 2. Deconfounding – need to control for bias • Individual matching • Frequency matching • Restriction • Statistical Adjustment

Matching • Matching can reduce the variance of the factors under study as well

Matching • Matching can reduce the variance of the factors under study as well as other factors of interest resulting in inefficiency. • Finding matches can make enrollment difficult and time consuming. • It is generally not possible to study the effects of the variables used in matching, possibly making the study less informative. • Matching on a factor in the causal pathway can result in inefficiency and (statistical) bias. Therefore, matching should only be used when absolutely necessary (e. g. , when a potential confounder is of such strength and importance that the ability to assess the impact of the factor under study would otherwise be compromised). Example: Matching or selection by HPV status to study the effects of smoking on cervical cancer risk.

Principles of Comparability & Selection of Controls 3. Comparable accuracy All sources of measurement

Principles of Comparability & Selection of Controls 3. Comparable accuracy All sources of measurement error should be similar for cases and controls • 4. Example: poor sensitivity/specificity might suggest 50% exposure in controls, whereas greater accuracy in cases might show the true exposure rate in the study base = 80%. Efficiency Principle • • It is helpful if there is a broad distribution of values for the major exposure variable(s), and either a narrow range of values for nuisance variables or at least a balance in cases and controls. Matching should be used cautiously. Methods of selecting and enrolling cases and controls should be practical.

Advantages of Case-Control Method • Efficient and Informative • No need to follow patients

Advantages of Case-Control Method • Efficient and Informative • No need to follow patients over a long period of time • Ability to deal with rare outcomes • Able to test the effect and interaction of a large number of factors as they relate to the outcome • Inexpensive • Able to obtain detailed data (e. g. , expensive assay data) on mostly informative individuals

Limitations of Case-Control Method • Inability to assess temporality in any direct way •

Limitations of Case-Control Method • Inability to assess temporality in any direct way • Length biased sampling – overrepresentation of cases with long duration (e. g. , less severe disease). If exposure lessens severity then could even be elevated among cases, though no true effect on risk of disease. • Inaccurate measure of exposure, as it is measured at time of disease (reverse causality and inaccuracy) • Concern regarding non-participation bias if differential in relation to exposure and case-control status • Sick cases may not want to participate • Controls may not participate since less motivated • Participants healthier, better educated, less likely to smoke, etc. • Also problems (i) identifying study base, (ii) with deconfounding, and (iii) comparable accuracy

Prospective Cohort Design

Prospective Cohort Design

Cohort Design

Cohort Design

Cohort Studies • A group of individuals free of one or more disease endpoints,

Cohort Studies • A group of individuals free of one or more disease endpoints, but who vary in exposure to various factors, and are followed over time to determine differences in the rate at which disease develops in relation to exposure. • Participants are followed over an interval defined by the study’s beginning and end, and observations are made on outcome measures of interest. • In observational studies, the dynamic nature of many risk factors and their relation in time to disease occurrence can only be fully captured in the cohort design.

Cohort Studies Prospective cohorts provide an ideal context in which to bring together the

Cohort Studies Prospective cohorts provide an ideal context in which to bring together the best laboratory science, epidemiology, biostatistics and bioinformatics to investigate disease risk factors. To realize this potential, however, requires a commitment to develop and adapt laboratory tools for application to the bio -specimens being collected. In addition, emphasis is needed on the collection and processing of biological specimens in a manner, as far as is predictable, consistent with the future laboratory analyses and avoid biases at the time of sampling.

Hierarchy in Cohort Populations 1. General Population – all individuals in the community •

Hierarchy in Cohort Populations 1. General Population – all individuals in the community • Examples: “Population-based” registries (e. g. , cancer registries) • However, defining the “general population” is not always straightforward. Is Framingham, MA representative of US? • • Absolute risk versus relative risk. And there are many relevant subpopulations or target groups within any given general population • Principle of disaggregation – division of the population into meaningful subgroups (e. g. , high risk men and women) for study efficiency, or other subpopulations (e. g. , race, SES) for public health reasons.

Hierarchy in Cohort Populations 2. Target Population – total selected group to be studied,

Hierarchy in Cohort Populations 2. Target Population – total selected group to be studied, as defined by inclusion and exclusion criteria; who you would enroll if you could. • Can be defined by geography, membership in organizations, occupation, specific known exposure and demographic factors. • Advantage: disaggregation provides greater efficiency to design (e. g. , populations with high levels of exposure or disease of interest) and may have particular public health relevance. • Disadvantage: Target population may be unrepresentative and results not generalizable. • Examples: Male Indian immigrants to U. S. age >45 to study the effects of betel nut chewing on risks of cancer (it would be inefficient to study the entire population).

Hierarchy in Cohort Populations 3. • Sampling Frame - (The Operational Definition) – group

Hierarchy in Cohort Populations 3. • Sampling Frame - (The Operational Definition) – group of potential enrollees from the general population who have been enumerated (e. g. , by census; occupational records; random digit dialing; clinic population, etc. ). i. e. , Those who might be enrolled – pending informed consent, and ignoring missed opportunities. • All sequential women, 25 -33 years of age, presenting to a prenatal clinic. • All men & women, >65 years of age, living in London from 2006 -2007, with complete phone and address information The sampling frame is only partially representative of the target population • Not all men & women >65 in London are registered with phone number/address information (immigrant & other LES populations will be under-represented)

Hierarchy in Cohort Populations 4. Enrolled Subjects • Those who were enumerated and successfully

Hierarchy in Cohort Populations 4. Enrolled Subjects • Those who were enumerated and successfully contacted and actually became enrolled. • The enrolled subjects are only partially representative of the sampling frame. • Fast progressors, particular those who rapidly become sick & die, are unlikely to be enrolled, whereas healthy individuals who perceive themselves to be at risk will be especially likely to give consent. Note: only a fraction of the Sampling Frame consented and participated in the Nurses health Study and the Women’s Health Initiative, two of the most important and often cited cohort studies of health in women.

Types of Cohorts • Prospective – group assembled in the present & followed over

Types of Cohorts • Prospective – group assembled in the present & followed over time. • Advantage: Ability to collect the exact data wanted • Disadvantage: Time and cost • Retrospective • e. g. , enumerated through use of historical records (usually because of certain shared characteristics, such as employees of a type of industry) • The disease experience of the group is reconstructed between a defined period in the past & the present (e. g. , with company records). • Advantage: Immediate availability of data • Disadvantage: Data available may not be of type or quality desired (since it was collected for other reasons); Frequent missing data; unavailability of biologic specimens, etc. In both, the salient feature is that individuals comprising the cohort are identified & information on exposure is obtained before disease experience is ascertained.

Types of Cohorts • Fixed/Closed – No new members are added following the initial

Types of Cohorts • Fixed/Closed – No new members are added following the initial enrollment period, even when members are lost to follow-up or death. • Advantage: Population well defined; lower costs and easier logistically; baseline is clear and relatively unaffected by cohort & period effects. • Disadvantage: Over time cohort becomes less and less representative of the target population, both because the target population may change over time, and because the cohort itself is altered by loss to follow-up. • Open/Dynamic – enrollment is ongoing, with the possibility of establishing a steady state of in and out migration and death. • Advantage: Cohort remains full size; possibly representative of even a changing target population. • Disadvantage: Expensive & complicated to maintain; in a changing population it may be difficult to meaningfully analyze data due to period effects and varying date of enrollment.

Types of Cohorts • Incident – disease free at baseline. • Adv: Able to

Types of Cohorts • Incident – disease free at baseline. • Adv: Able to describe full natural history of disease; less likely to be biased than Prevalent cohort (see below). • Disadv: Unless disease is very common in population, study will be costly and lengthy; with diseases that have long duration may not observe late stages and few deaths. • Prevalent – disease present at baseline (e. g. , HIV-positive women with various CD 4) • Adv: Able to study multiple stages of disease, time/cost efficient. • Disadv: Fast progressors likely to be missed (better able to assess factors associated w/ long survival than fast progression); less able to describe early stages of disease or risk factors for obtaining disease (except crosssectionally in comparison to an unaffected subcohort of “individuals at risk”)

General Concerns and Design Issues: Cohort Studies • Changes in exposure status over time

General Concerns and Design Issues: Cohort Studies • Changes in exposure status over time – an individual can provide follow-up time to several different exposure-specific strata. • Summary Exposure – e. g. , pack-years of smoking, a composite exposure variable assumes that 2 packs/day for 10 years is equivalent to 1 pack/day for 20 years, which may not be accurate. Supplemental analysis as two exposure variables (duration and intensity) would address inadequacies of pack-years. Also, age at smoking onset, maximum or median exposure intensity. • Repeated Observations – time-dependent covariates (standard is to assume that latest exposure status represents category – but can define exposure more precisely; e. g. , one year prior instead of last 6 month visit).

General Concerns and Design Issues Contd. • Induction Period – interval between exposure and

General Concerns and Design Issues Contd. • Induction Period – interval between exposure and the development of disease (including pre-clinical disease? ). • Could be zero for smoking and MI, or years for cancer. • For unknown induction periods, some assumption regarding a reasonable induction time is better than not addressing the issue at all. • Alternatively, can assess the impact of using multiple different assumed induction times in the analysis. • Example: If in a study of hormone therapy and cancer the analyzed data included rates of disease beginning the day of hormone therapy initiation, the early cancer events could not biologically be related to therapy, and the effect estimate would be underestimated. • Lagging Exposure – only assess exposure up until some specified time before current time.

General Concerns and Design Issues Contd. • Informative Censoring - Departure from the Independent

General Concerns and Design Issues Contd. • Informative Censoring - Departure from the Independent Censoring Assumption, that censored patients have similar experience (are representative) of those that remain under observation. • Competing Risks • A form of informative censoring. Specifically, an event that precludes or modifies the probability of the onset of the event of interest (Satagopan et al. , BMJ, 2005) • e. g. , a study of smoking and ovarian cancer will underestimate this association if all smokers susceptible to the smoking-ovarian cancer effect die first of heart disease. Undergoing an oophorectomy is also a competing risk. • The alternative – requiring survival/follow-up through a certain period would miss effects among those who die quickly. • Interval censoring - Uncertainty regarding the timing of events (or exposures) that occurred (e. g. , between clinical visits). If these events occurred earlier in exposed than unexposed during the interval, then the HR would be underestimated using standard assumptions (e. g. , that all events occurred halfway through the period).

General Concerns and Design Issues Contd. • Participation: collection of biospecimens can affect participation

General Concerns and Design Issues Contd. • Participation: collection of biospecimens can affect participation rates. Generalizability? • Exposure Assessment-major strength in prospective design: particularly true for biomarker studies • Caveats? Pre-clinical disease? • Reverse causality-methods to overcome this? • Misclassification of exposures • Random intra-individual variation (diurnal variation-e. g. melatonin, seasonal variation-e. g. vitamin D; dietary intake/supplements) • Many believe that non-differential measurement error leads to bias toward the null. This is not true in general. • Methods to correct for misclassification • Take multiple measurements (repeat sub-sample)-may not be possible for entire cohort (cost etc) but sub-sample and obtain correction factor to adjust for

Useful Oversimplification? • Cross-Sectional Case-control - Efficient for studying the relationship of multiple exposures

Useful Oversimplification? • Cross-Sectional Case-control - Efficient for studying the relationship of multiple exposures and a single outcome • Cohort studies - are efficient for investigation of multiple outcomes from a given exposure, and for studying natural history

Theoretical example: a biomarker of preclinical effect (TEL-AML 1), DNA methylation, and gene or

Theoretical example: a biomarker of preclinical effect (TEL-AML 1), DNA methylation, and gene or protein expression is linked to a specific environmental exposure within a prospective cohort study. Vineis P , Perera F Cancer Epidemiol Biomarkers Prev 2007; 16: 1954 -1965

Hybrid Designs Nested Case-Control

Hybrid Designs Nested Case-Control

Nested Case-Control • Cases of a disease that occur within a cohort are identified

Nested Case-Control • Cases of a disease that occur within a cohort are identified and, for each case, an individual or individuals are identified as controls from among other participants in the cohort who at that same period of follow-up were disease free; i. e. , individually matched to the case based on time. • Other matching criteria can also be added (e. g. , age). • Controls can become cases. • Frequently used study design for biomarker analyses (cheaper, more efficient than cohort design)

Nested Case-Control • Selection of controls from the at risk population is typically done

Nested Case-Control • Selection of controls from the at risk population is typically done using random sampling without replacement of appropriate matches. • The design can result in substantially reduced costs (e. g. , in relation to measurement of biomarkers or collection of data from charts) with little loss of stat power compared with analysis of the entire cohort. • The analysis is relatively simple, typically involving the use of multivariate conditional logistic regression methods. • Other Adv: Controls are explicitly from the same population as cases; data/specimens collected prospectively and possibly repeatedly; efficient versus full cohort study; analysis simpler than for case-cohort design and possibly more powerful statistically than case-cohort (for a single outcome).

Case-Cohort Design Subjects • Cases - all cases or a random subset • Comparison

Case-Cohort Design Subjects • Cases - all cases or a random subset • Comparison group - a subset of all subject in the cohort defined at baseline • Sampling Fractions - Because the sampling fractions are known, it is possible to stratify the selection of subjects (e. g. , using stratified random sampling), to over-sample on relevant characteristics (e. g. , sex or race), to have adequate power to address relevant questions.

= Random Subcohort Case-Cohort Design

= Random Subcohort Case-Cohort Design

Case-Cohort Design Efficient 1. Test a small subset of all individuals in the cohort

Case-Cohort Design Efficient 1. Test a small subset of all individuals in the cohort 2. The same comparison group can be used for multiple studies, since it is a representative sample of all subjects 3. In theory, can begin testing (or obtaining data) on subcohort even before all cases have accrued (though concern re: laboratory drift with biomarker data). Especially useful when all cases can not be defined in advance (e. g. , in a genetics study without concern re: laboratory drift). Informative 1. Directly estimate relative risk 2. Can estimate incidence, prevalence, the effects of various parameters from a true representative subset of the whole cohort based on the comparison group 3. Can examine time-dependent covariates (also can be done in nested case-control)

Case-Cohort Design Analysis 1. Proportional hazards analysis / time to failure 2. A case

Case-Cohort Design Analysis 1. Proportional hazards analysis / time to failure 2. A case outside the subcohort is considered not at risk until just before failure & is not included in earlier risk sets 3. Different ways of weighting the data are possible The analysis is very complex and not all of the analytic issues have been well worked out. Appropriate software is sometimes unavailable.

Case-Cohort vs. Nested Case-Control • Main advantage of case-cohort over nested case-control is the

Case-Cohort vs. Nested Case-Control • Main advantage of case-cohort over nested case-control is the ability to use a single representative subsample of the cohort as the comparison group for multiple case groups (e. g. , different types of disease outcomes). • However, if testing of biomarkers is not conducted at one period of time one must be concerned regarding laboratory drift. Can not readily study one disease one year and another disease a year later, except when the assays are not susceptible to laboratory drift (e. g. , genetics study) – then could be strong advantage.

Other Study Designs Case-Series • Only cases with disease of interest enrolled (no controls)

Other Study Designs Case-Series • Only cases with disease of interest enrolled (no controls) • Evaluate aetiologic heterogeneity of disease end-point (e. g. tumour markers) • Studies of prognosis, treatment effects etc Clinical Trials • Cannot directly address aetiologic questions (no control population); can be useful for case-series analyses, exposurebiomarker comparisons etc

Reading • A Rundle, P Vineis and H Ahsan. Design options for molecular epidemiology

Reading • A Rundle, P Vineis and H Ahsan. Design options for molecular epidemiology research within cohort studies. Cancer Epidemiology Biomarkers and Prevention. 2005 14; 1899 -1904. • N Caporaso. Integrative study designs-next step in the evolution of molecular epidemiology? Cancer Epidemiology Biomarkers and Prevention. 2007 16; 365 • N Holland, L Pfleger, E Berger, A Ho and M Bastaki. Molecular epidemiology biomarkers-sample collection and processing considerations. Toxicology and Applied Pharmacology. 2005; 261 -268.