Matching Beate Ritz MD Ph D EPI 200
Matching Beate Ritz, MD, Ph. D. EPI 200 B Winter 2010 NOTE: many of the following slides are based on the lectures notes provided by Dr. Hal Morgenstern (Epi Methods I and II); some of the examples are taken from ME 3 Chapter 11
Matching: partial restriction in the selection of study subjects Is usually seen as one part of a strategy to control for confounders in observational studies. It is partial restriction in the selection of subjects. In most observational studies matching restricts the eligibility of comparison subjects by choosing them to be similar to index subjects with respect to one or more matching variables.
Matching Example: Target population (ME 3 p. 172) Stratum RR= 10 10 Both exposure and male sex are risk factors for the disease: n Within sex, exposed have 10 times the risk of unexposed ((0. 005/0. 0005)=(0. 001/0. 0001)=10) n Within exposure level, males have 5 times the risk of females ((0. 005/0. 001)=(0. 0005/0. 0001)=5) There is also substantial confounding, since 90% of the exposed individuals are male and only 10% of the unexposed are male. The crude and sex specific risk ratios are different (33 versus 10)
Matched cohort (Example ME 3 p. 172) Suppose a cohort study drawing from the exposed target population and matching the unexposed to the exposed cohort by sex: n First, we select 10% exposed independent of sex n Then, we draw a comparison group of unexposed subjects from the target matching by sex; n this prevents association between sex and exposure in study cohort; i. e. the expected RR(=10) is the same in sex strata and overall
Matched case-control study: crude results (Example ME 3 p. 173) In a case-control study one would n First, identify (all) cases (N=4740) from the source population n Then, we sample 4740 controls from the source population, 1: 1 matched to cases by sex NOTE: The crude Odds ratio is much less than the true risk ratio (5 rather than 10)!
Matched case-control study: stratified results (Example ME 3 p. 173) Stratifying by sex in the matched case-control study gives the correct results OR=10 Thus unlike the cohort matching, the case control matching has not eliminated confounding by sex in the point estimate of the risk ratio. The discrepancy between the crude results and the stratum specific results result from a bias that is introduced by selecting controls according to a factor related to exposure (sex, the matching factor). The bias behaves like confounding: the crude is biased and stratification removes it. However, note this is not a reflection of the confounding by sex in the source population; and it it differs in the direction from that bias. Case-control matching superimposed selection bias in place of the initial confounding in the source population.
Why case-control matching introduces bias (Example ME 3 p. 176) Controls are supposed to provide an estimate of the distribution of the exposure in the source population. Matching by a factor associated with exposure makes the control series more similar to the cases with respect to exposure; this biases the crude estimate towards the null no matter what the direction of the association between matching factor and exposure
Matching: logistically convenient and cost-efficient. Matching variables are usually known risk factors for the disease (and thus potential confounders) or "logistical" factors related to the selection of comparison subjects, e. g. n time of disease diagnosis, n place of residence, n the facility where a diagnosis is made or medical care is sought Although the latter type of matching variable may be a proxy confounder, matching on such variables is usually done to facilitate the process of obtaining eligible comparison subjects. Thus, matching these non-confounders is logistically convenient and cost-efficient. In addition, matching on time of the case's diagnosis is the most common method of density sampling controls in a case-control study.
Matching: done differently in casecontrol and cohort studies n In case-control studies, controls (the comparison group) are matched to cases (the index group). n In a cohort and cross-sectional studies, unexposed subjects (the comparison group) are usually matched to exposed subjects (the index group). n In randomized trials, matching is also known as blocked randomization: matching is done before randomization, then randomization is done within each matched set (block).
Matching: reasons Matching is more common in case-control studies and randomized trials than in cohort or cross-sectional studies The reasons are both statistical and practical: Matching can often enhance the cost efficiency of conducting a case-control study; although it is not guaranteed to do so. Also it increases the efficiency of typical randomized trials. The most common strategies for matching in an observational study are individual matching and frequency matching.
Types of Matching: Individual and frequency matching Individual matching One or more comparison subjects are selected separately for each index subject, selecting only those comparison subjects that are similar to the corresponding index subject on one or more matching variables. Frequency matching The number selected in each matching category of comparison group is made proportional to the number in that category of the index group
Types of Matching: Fixed and variable ratio matching Matching may also be classified by the ratio of comparison to index subjects in each matched set: n Fixed-ratio: The ratio of comparison to index subjects is the same for all matched sets–e. g. , pairwise or 1: 1 matching; 1: 2 matching etc. . n Variable-ratio: The ratio of comparison to index subjects varies among matched sets (by design or, more commonly, because of nonparticipation of certain selected subjects).
Individual Matching Methods There are two common methods for individual matching n Category matching: Each comparison subject is selected from the matching category (stratum) to which the index subject belongs (e. g. , white males, ages 30 -34). n Caliper matching: Each set of comparison subjects is selected to have values on the matching variable that are “close” to the corresponding value of the index subject There are two common methods of caliper matching: 1) Fixed caliper: The tolerance for eligibility is the same for all matched sets (e. g. , age of index subject ± 2 years). 2) Variable caliper: the tolerance for eligibility varies among matched sets (e. g. , nearest-neighbor matching–i. e. , selecting an available comparison subject whose value on the matching variable is closest to the index subject). This is done to avoid not getting a match for certain index subjects, or to get the best possible match. It is possible to use both category and caliper matching on different variables in the same study.
Individual Matching n n Direct matching: Matched sets are created from direct measurement of the matching variables–e. g. , matching on age and sex. Natural matching: Matched sets are created from family or social-network relationships–e. g. , comparison subjects are twins, siblings, friends, or neighbors of the index subjects. With this method, it is often unclear on what factors the members of matched sets are made similar.
Individual Matching A special form of natural matching is matching subjects to themselves–i. e. , index subjects serve as their own comparison subjects in one of these ways: 1) 2) 3) making within-person comparisons of the outcome over time–e. g. , before vs. after first exposure in a cohort study involving recurrent outcomes (e. g. asthma attacks) within individuals comparing similar anatomical parts in the same subjects–e. g. , right eye vs. left eye, where one eye is diseased and the other is not (casecontrol sampling); or one eye is treated and the other is not (randomly chosen) comparing exposure frequency between cases (acute events) shortly before they became cases and prior to or after the disease period–i. e. , a case-crossover design: in which cases serve as their own controls.
Individual Matching In case control studies with caliper or natural matching there may be overlap among categories of the matching variables, so that certain potential controls are represented in more than one category This overlap can results in selection bias if exposure status is associated with inclusion in multiple categories. E. g. , a problem with friend matching (selecting controls from the friends named by cases) is that certain individuals belong to multiple friendship networks and, therefore, have a higher probability of being selected as controls. Bias arises if people with many friends also have higher exposure probability (e. g. , alcohol consumption or social-support level). The bias from overlapping calipers tends to be small in real examples, however, so caliper matching remains common. ADD an example here:
Frequency Matching n Frequency matching of comparison subjects cannot be completed until all index subjects (cases) are identified (or you know the expected distribution of the matching variables for the index subjects). n The end results of frequency matching is essentially equivalent to category matching. n The main difference is that frequency matching, unlike category matching, may involve any selection ratio of comparison to index subjects – even a ratio less than one (I. e. fewer total comparison subjects than index subjects)
Countermatching A method called ‘countermatching’ may have certain statistical advantages in population-based case-control studies in which we have a surrogate E* for exposure status for the entire source population. E. G. , in occupational studies, we might know the location of employment for all workers in the source population (E*=1 if in a location with exposure, E*=0 otherwise), but we do not have the resource to determine the amount of or timing of exposure for each worker in the source population. With countermatching, for each case with E*=1 we select a control with E*=0, and for each case with E*=0 we select a control with E*=1. In other words, controls get matched to cases with the opposite value of E*. The researcher then collects more detailed information on exposure status (and possibly confounders) for all subjects in the case-control study. Countermatching may at first seem counterintuitive. How can we benefit by making comparison subjects different from index subjects? To understand how countermatching can improve statistical efficiency we must consider more closely how matching can improve statistical efficiency.
Why Match? The methodologic issues involved in matching are more complex than they first appear, which has led to widespread misunderstanding of matching. To appreciate the real statistical advantage of matching, we must consider how matched data are analyzed. As we see throughout this section, the analysis of matched data should usually take the matching variables into consideration through some form of stratification. Thus, matching of index and comparison subjects may not be sufficient to control for confounding due to the matching variables; stratified analysis or analogous methods may also be needed to complete the control of matching factors. Although matching is used as part of a strategy to control for confounding, confounding control is not the major reason for matching, because we can always control for measured confounders in the analysis without matching, via stratification or model fitting.
Major Statistical Advantage of Matching In both case-control and cohort studies we aim to reduce the variance of adjusted estimators, at a given sample size. This goal is especially important when there is a limited number of index subjects. This gain in precision occurs when the matching variable (M) is associated with both exposure status (E) and the disease occurrence (D) in the source population, so that we would need to control for M as the confounder even if matching were not done. Thus, the major statistical reason for matching is not to control for confounders, which can be done in the analysis, but to produce a more efficient study (one that yields an estimator with a smaller variance for a given sample size) than if we had not matched.
Major Statistical Advantage of Matching This gain in statistical efficiency (variance reduction), when it occurs, is obtained by equalizing (matching or ‘balancing’) the ratio of comparison to index subjects across strata of the matching variable. However, such equalization does not always result in greater efficiency (especially in case-control studies), so it is important to understand when it helps and when it does not. Natural matching also offers logistical advantages in that it is often easier to find controls from among sibs, friends and neighbors of cases.
Example: Source Population Stable source population: Relation between low dietary betacarotene (the exposure) and the incidence of lung cancer (D), by smoking status (the covariate). Beta-carotene Smokers Nonsmokers Total D P-Yrs Low (E=1) 30 12, 000 6 20, 000 36 32, 000 High (E=0) 10 12, 000 6 60, 000 16 72, 000 Total 40 24, 000 12 80, 000 52 104, 000 IR = 3. 00 5. 06
Example: Unmatched Case-Control Study Unmatched (population-based) density case-control study: Expected results taking all 52 cases and 52 randomly selected controls. Note that 12/52 = 23% of controls are smokers, reflecting the distribution of person-time in the source population (24, 000/104, 000). The proportions of smoking and nonsmoking controls that are exposed are also equal to the corresponding proportions of person-years in the source population.
Example: Unmatched Case-Control Study Betacarotene Smokers Nonsmokers D D Total D Low (E=1) 30 6 6 10 36 16 High (E=0) 10 6 6 30 16 36 Total 40 12 12 40 52 52 OR= 3. 00 5. 06 m. OR = 3. 00 95% CL= (1. 16, 7. 73) Comment: Note the confounding by smoking in the source population and in the unmatched case-control study. This confounding is controlled using stratified analysis–e. g. , the M-H method for estimating the common effect.
Example: Matched Case-Control Study Matched case-control study: All 52 cases and 52 controls matched on smoking (expected results of pairwise matching). Note the same 1: 1 ratio of controls to cases for smokers and nonsmokers in the total sample. Betacarotene Smokers D Nonsmokers D Total D Low (E=1) 30 20 6 3 36 23 High (E=0) 10 20 6 9 16 29 Total 40 40 12 12 52 52 3. 00 2. 84
Example: Matched Case-Control Study: Conclusion Although matching eliminated most confounding due to smoking, it introduced a selection bias (the crude estimator is still biased) that is controllable by controlling for smoking. It also produced a more precise estimate of the common odds ratio with the sample size than did an unmatched design. Note that in the matched study, the ratio of cases to controls is the same for smokers and nonsmokers– i. e. , the strata are balanced with respect to disease. In the unmatched study, however, the strata are quite unbalanced.
Example: Matched Case-Control Study Comments In observational studies, a gain in statistical efficiency due to matching tends to occur when 1. 2. The matching factor is one that must be controlled n the unmatched design (a confounder) and The matching factor is strongly associated with 1. The outcome in a case-control study 2. The exposure in a cohort study Nonetheless, matching can lead to a loss of efficiency, although when there are few matching categories, the gains and the losses are usually not dramatic.
Analysis of Matched Data When analyzing matched data, we must take the matching into consideration through some form of stratification. This objective is achieved by one of two methods, depending on the type of matching: 1) Ordinary stratified analysis: With category or frequency matching, the most efficient analytic strategy for estimating the effect is to conduct a general stratified analysis as described in ME 2 Ch. 15. That is, we ignore matched sets (even if the comparison subjects are individually matched to index subjects) and re-stratify on all matching variables in the analysis (as in the previous example).
Analysis of Matched Data 2) Matched analysis: With caliper or natural matching or a mixture of caliper and category matching, we usually conduct a stratified analysis by treating each matched set as a separate stratum (“matched analysis”). This type of analysis preserves the matched sets and usually yields small numbers within strata, demanding use of sparse data methods such as Mantel-Haenszel methods or conditional maximum likelihood.
Analysis of Matched Data cont. We could use matched analysis with (individual) category matching, but ordinary stratified analysis yields more precise estimates of effect. Similarly, we could use ordinary stratified analysis with caliper matching (ignoring the matched sets), but this strategy might leave residual confounding from the matching variables (since mutually exclusive strata were not used to select matched comparison subjects) if the categories are too wide, and it might lead to a loss in precision. In any case, so-called “matched analysis methods” are nothing more than analysis methods for sparse data.
Selection of Matching Variables: Statistical Considerations n The selection of specific matching variables when designing a study depends on several statistical considerations as well as the study design. n As noted previously, the statistical purpose of matching is to improve statistical precision when controlling for confounders. n Although we can also control for confounders in the analysis without matching, the efficiency loss by stratification is particularly troublesome when the confounder is a nominal variable with many categories– e. g. , occupations or neighborhoods.
Selection of Matching Variables: Statistical Considerations n Matching on such a confounder helps to prevent strata in which there are no cases or no controls – such strata get discarded by most analysis methods, so that any subject in those strata contribute nothing to the analysis (uninformative strata) n Without matching, therefore, it might not be possible to control adequately in the analysis for this type of variable. We would have to combine strata, possibly producing residual confounding.
Selection of Matching Variables: Statistical Considerations cont. In certain situations, however, it may be counterproductive to match on a given variable. That is: 1) Unlike in a cohort study, matching on a variable M in a case-control study usually precludes estimating its effect on disease occurrence because the M-D association is altered artificially by matching. n This is one reason for matching only on known risk factors for the disease in case-control studies. 2) In both case-control and cohort studies, matching on an intermediate variable or a factor that is affected by both the exposure and the disease will lead to a bias in both matched and crude analyses.
Selection of Matching Variables: Statistical Considerations 3) Sometimes matching can result in a loss of statistical efficiency, relative to not matching, instead of a gain. n This problem–called overmatching–occurs under different conditions in cohort versus case-control designs, and more readily in case-control designs. The selection of specific matching variables when designing a study depends on several statistical considerations as well as the study design.
Selection of Matching Variables: Statistical Considerations As noted previously, the statistical purpose of matching is to improve statistical precision when controlling for confounders. n Although we can also control for confounders in the analysis without matching, the efficiency loss by stratification is particularly troublesome when the confounder is a nominal variable with many categories–e. g. , occupation or neighborhood. n Matching on such a confounder helps to prevent strata in which there are no cases or no controls – such strata get discarded by most analysis methods, so that any subject in those strata contribute nothing to the analysis (uninformative strata) n Without matching, therefore, it might not be possible to control adequately in the analysis for this type of variable. We would have to combine strata, possibly producing residual confounding.
Overmatching In case-control studies, statistical overmatching is most likely to occur when the matching variable is not a risk factor for the disease but is strongly associated with exposure status in the source population. Furthermore, if the matching variable is not taken into account in the analysis, the effect estimate will usually be biased; as illustrated in the next example. Often, however, logistical considerations (how to recruit controls) may override concerns about statistical overmatching – for example neighborhood matching is usually done for logistical reasons, without regard to its statistical impact. Unfortunately, if too many factors are chosen for matching, it may become difficult or impossible to find a match for each case, e. g. , there may often be no cooperative potential control of the same sex and race, and also close in age to the case in a neighborhood. Fortunately, it is possible to partially match controls (e. g. loosen age or race matching requirements for some controls), as long as once controls the matching factor closely in the analysis.
Example: Overmatching In A Case-Control Study Source population: Expected relation between neuroleptic (antipsychotic) drug exposure and the incidence (D) of tardive dyskinesia (TD), by psychiatric diagnosis (C, which is not a risk factor for TD among the unexposed). Neuroleptics Schizophrenic Other Total D P-Yrs Exposed 80 4000 20 100 5000 Unexposed 10 1000 40 4000 50 5000 Total 90 5000 60 5000 150 10000 IR = 2. 00
Example: Overmatching In A Case-Control Study cont. Unmatched case-control study: All 150 cases and 150 randomly selected controls (expected results). Neuro. Leptics Schizophrenic D Other Total D D Exposed 80 60 20 15 100 75 Unexposed 10 15 40 60 50 75 Total 90 75 60 75 150 E(OR)= 2. 00 Comment: Because diagnosis is not a risk factor for TD in the unexposed base population, it does not confound the exposure effect. Thus, the expected estimate of the c. OR is unbiased in an unmatched case-control study.
Example: Overmatching In A Case-Control Study cont. Matched case-control study: All 150 cases and 150 controls matched on diagnosis (expected results). Neuro. Leptics Schizophrenic D Other Total D D Exposed 80 72 20 12 100 84 Unexposed 10 18 40 48 50 66 Total 90 90 60 60 150 E(OR) = 2. 00 1. 57
Example: Overmatching In A Case-Control Study cont. Conclusion: By matching on diagnosis (a non-risk factor), diagnosis becomes associated with disease status in the unexposed study population (even though these two variables are not associated in the total study population). That is, matching on diagnosis introduces a selection bias. Since the crude estimate of effect (1. 57) is biased, we must stratify in the analysis on diagnosis; but by forcing us to stratify, the matching results in a loss of statistical efficiency (statistical overmatching) because diagnosis is strongly associated with exposure status. Note: that the 95% confidence interval is wider in the matched study than in the unmatched study of the same size.
Overmatching In A Case-Control Study cont. Comment: Overmatching also results in efficiency loss when doing a matched analysis of case-control data, because the proportion of discordant matched sets is reduced. Implication: In a case-control study, we do not want controls to be similar (matched) to cases on all factors other than the exposure. Such a strategy could result in severe overmatching since certain matching variables may be strongly associated with exposure status but not risk factors. In summary, making controls similar to cases in a case-control study, does not eliminate the need to control a factor, and may force us to control a factor that we would not have had to otherwise. Making controls similar to cases is therefore a poor heuristic for controlling confounding. Matching needs to be done with some thought to the likely gains and losses that results from matching on each possible candidate, including both statistical and logistical considerations.
Statistical Overmatching and Study Design If a particular variable is not a risk factor for the disease (or is a weak risk factor) but is associated with exposure status in the base population, matching on this variable can result in a loss of statistical efficiency in a case-control study but not in a cohort study. This difference in overmatching between case-control and cohort designs can be depicted as a type of confounding by the matching variable. Case-control study: Matching on a factor related to exposure in a case-control study introduces a source of association between disease status and the matching variable conditional on exposure. Suppose this factor was unrelated to disease to begin with except through its association with exposure. After matching, it will be related to both exposure and disease, and thus become an “induced confounder”. Because we now have to control for this factor, and this control usually increases variance, we suffer a loss of statistical efficiency, compared with the expected results if matching had not been used. The stronger the M-E association, the greater the power loss from overmatching, and with rare exceptions the greater the efficiency loss (variance increase) as well.
Statistical Overmatching and Study Design Cohort study: In a cohort study fixed-ratio matching does not introduce an M-D association conditional on exposure. Instead, fixed-ratio cohort matching eliminates any association between exposure status and the matching variable in the total cohort (source population). Thus, the matching variable is not a confounder in this type of cohort study If the matching variable is not a risk factor the unmatched variance estimator will be unbiased as well, so we don’t have to control for it and there will be no loss of efficiency from matching on it.
Statistical Overmatching and Study Design As in a case control study, matching in a cohort study can result in a loss of statistical efficiency (i. e. , statistical overmatching) when estimating the risk or rate ratio even if the matching variable is a confounder in the source population (and hence without matching a stratified analysis must be done to get an unbiased point estimate). Usually, however, this efficiency loss does not occur when estimating the risk or rate difference. Thus, the conditions for overmatching in a cohort study are very different from the conditions for overmatching in a case-control study. While it may be difficult in practice to predict when cohort matching on a factor will increase variance, the increase is most likely to occur when the confounding (in the common RR or IDR) by the factor in the source population is toward the null value–i. e. , when the ratio measure is biased toward one in the source population if the factor is not controlled. For example, we might expect a loss in statistical efficiency by matching on a confounder when c. IR = 1. 57 and m. IR = 2. 00 or when c. IR = 0. 70 and m. IR = 0. 50.
Example: Overmatching in a Cohort Study Source population: Relation between neuroleptic drug exposure and the incidence (D) of TD, by psychiatric diagnosis (which is a negative confounder in this population because schizophrenics are at lower risk of TD, unlike the two previous examples). Neuroleptics Schizophrenic Other Total D Persons Exposed 60 30, 000 160 20, 000 220 50, 000 Unexposed 20 20, 000 120 30, 000 140 50, 000 Total 80 50, 000 280 50, 000 360 100, 000 RR = 2. 00 1. 57
Example: Overmatching in a Cohort Study Unmatched cohort study: Randomly sample 10% of all exposed persons and unexposed persons in the source population (expected results). Neuroleptics Schizophrenic Other Total D Persons Exposed 6 3, 000 16 2, 000 22 5, 000 Unexposed 2 2, 000 12 3, 000 14 5, 000 Total 8 5, 000 28 5, 000 36 10, 000 E(RR) = 2. 00 E(RRMH) = E(RRML) = 2. 00 Expected 95% CI(RRMH) = (1. 01, 3. 94) Expected 95% CI(RRML) = (1. 02, 3. 93) 1. 57 Comment: The crude risk ratio is confounded toward the null by diagnosis because schizophrenics are less likely than other patients to get TD (among the unexposed) but more likely to be exposed to neuroleptics.
Example: Overmatching in a Cohort Study Matched cohort study: Randomly sample 10% of all exposed persons and select an equal number of unexposed subjects matched on diagnosis (expected results). Neuroleptics Schizophrenic Other Total D Persons Exposed 6 3, 000 16 2, 000 22 5, 000 Unexposed 3 3, 000 8 2, 000 11 5, 000 Total 9 6, 000 24 4, 000 33 10, 000 E(RR) = 2. 00 E(RRMH) = E(RRML) = E(crude RR) = 2. 00 Expected 95% CL(c. RR) = Expected 95% CL(RRMH) = Expected 95% CL(RRML) = (0. 97, 4. 12) 2. 00 Comments: Even though matching on diagnosis made diagnosis a nonconfounder, it resulted in some loss of statistical efficiency (a larger variance and hence a wider 95% confidence interval). This efficiency loss occurred because of the smaller number of cases in the matched design (i. e. , 33 vs. 36).
Matching in Experiments: Matching in experiments (more often called blocking) is done before exposure, and affects exposure assignment but does not affect the marginal matching-factor distribution in the experimental cohort except insofar as the investigator discards potential subjects during this process (i. e. because they have no match, or he wants a constant number within blocks). Exposure plays no role in this selection process because no one is exposed until after selection. In contrast, in non experimental cohort studies, exposure precedes selection, and when unexposed are matched to exposed, exposure affects the matching factor distribution in the cohort. In randomized experiments, matching does NOT lead to a loss in large sample statistical efficiency, whereas as shown above it can lead to efficiency loss in a non experimental study.
Matching and Cost Efficiency: Trade-off between statistical considerations and study costs Typically, the process of matching in an observational study adds to the cost of data collection because of the need to identify eligible matches for index subjects by collecting information on the matching variables for persons who do not get into the study. Thus, the decision to match on a specific variable (M) depends on whether the gain in statistical efficiency by matching on a risk factor is worth the added cost. A method for estimating whether a parameter with a given sample size is more cost efficient than a second method is to assess whether the first method yields a more precise estimate for the same study cost or whether it costs less to obtain the same degree of precision. Matching, therefore, is more cost efficient than not matching if it costs less to achieve a given level of precision by matching than by simply increasing the sample size without matching, provided the latter strategy is feasible. For example, if a matching strategy increases efficiency by 10% but adds over 10% to the total cost of data collection, the matching strategy is not cost efficient.
Matching and Cost Efficiency: Trade-off between statistical considerations and study costs The extra cost of matching tends to be least when data on the matching variables are readily available on all persons in the source population before they are selected–e. g. , on computer files. In fact it is possible that matching on such variables as place of residence or place of diagnosis, especially in case-control studies, may reduce data-collection costs: E. g. , suppose we wish to conduct a population-based case-control study of lung cancer among adults in Los Angeles County, where the county tumor registry is used to identify new cases. It is not very practical to select controls randomly from all adult residents of Los Angeles County because there is no convenient list of eligible residents. One commonly used alternative is to individually match controls to cases on residential neighborhood and possibly other factors–i. e. , controls are selected from the same neighborhoods as cases. Even if “neighborhood” is not a risk factor for lung cancer (conditional on exposure status and other covariates), the matching can facilitate the selection of controls in such a way as to reduce the likelihood of response bias (e. g. , if the probability of cooperation depends on SES).
Matching and Cost Efficiency: Advantages outweighing the costs 1) The statistical advantage of matching on one or more risk factors is most likely to outweigh the added costs when at least one of the following conditions holds: 2) We match on a confounder that is a nominal variable with many categories. n Without matching, the analysis controlling for this confounder would be inefficient because of the sparse data resulting from many strata of the matching variable, or there might be residual confounding due to collapsing strata. 3) The cost per subject of collecting additional data after matching is high–e. g. , an expensive lab test to measure the exposure in a case-control study. n Thus, if it improves efficiency, matching on a confounder reduces total data-collection costs by reducing the sample size needed to obtain a desired level of precision. 4) There is a very limited supply of available index subjects in the source population–i. e. , exposed persons in a cohort study or cases in a case-control study. n n Thus, we want to maximize information obtained from additional comparison subjects when there is a limited supply of index subjects. One reason, therefore, that matching is more common in case-control studies than in cohort studies is that case-control designs are often done in settings in which the disease is rare and cases are in short supply.
Matching and Cost Efficiency: Matching is most counterproductive When it is done so tightly or on so many variables that we cannot find matched comparison subjects for several index subjects. In this situation, we might even be left with some index subjects without any matched comparison subjects. To deal with this problem, we either must drop these unmatched index subjects from the matched analysis or ignore the matching and adjust for all matching variables in the analysis. In all these scenarios, however, the net result of not finding enough eligible matches is to reduce the sample size and therefore to lose precision–sometimes substantially.
Summary: Matching 1) Matching (partial restriction) in the selection of comparison subjects differs for cohort and case-control studies, but in both designs it should be followed by an analytic method that takes the matching variables into consideration–either by stratifying on the matching variables (especially with category matching) or by some form of matched analysis (stratifying down to the original matched sets, especially with natural matching). n Failure to deal with the matching in the analysis tends to introduce bias in a case -control study (and sometimes in a cohort study with variable-ratio matching or density analysis), and it tends to overestimate variances (loose precision) in a cohort study. 2) The major statistical reason for matching is to gain statistical efficiency, not to reduce bias (the later can be done without matching). n Specifically, matching on a confounder often allows the investigator to control for this confounder more efficiently than if matching had not been used.
Summary: Matching 3) Matching in an observational study can also result in a loss in statistical efficiency. n n n The conditions in which such statistical overmatching occurs differ for case-control and cohort designs. In case-control studies, it is likely to occur when the matching variable is not a risk factor for the disease but is strongly associated with exposure status in the source population. In cohort studies, overmatching is difficult to predict, but more likely to occur when the matching variable biases the risk/rate ratio toward the null value in the source population. 4) It is usually not advantageous to match on a variable if it is neither a known risk factor for the disease (conditional on exposure) nor a mechanism for reducing study costs (e. g. , place of residence). n n In case-control (but not cohort) studies, matching on a variable usually precludes estimation of its effect on disease occurrence. We should also not match on a variable if it is affected by exposure status–e. g. , as an intermediate variable in the causal pathway between exposure and disease.
Summary: Matching 5) Matching often adds to the cost of data collection by making it harder to find reference subjects. n Thus, matching is cost efficient only if it is less costly to gain precision by matching than by increasing the sample size without matching. n Matching tends to be most cost efficient when: 1) 2) 3) we match on a confounder that is a nominal variable with many categories; the extra cost of matching is small, and the cost per subject of collecting additional data after matching is very high; or there is a very limited supply of available index subjects, such as a casecontrol study in which there are fewer than 50 eligible cases. n Matching tends to be most counterproductive when eligibility criteria for comparison subjects are set so restrictively (e. g. , by matching on several variables) that matches cannot be found for some of the index subjects, thereby reducing the effective sample size and precision.
Example from the literature: CYP 2 D 6 Polymorphism, pesticide exposure, and Parkinson’s disease. Elbaz et al. 2003 Ann Neurol; 55: 430 -4 Case Control study of Parkinson’s disease (PD) Goal: Assess gene environment interactions between xenobiotic metabolizing cytochrome P 450 gene (CYP 2 D 6) and pesticides PD cases (N=247) and controls (N=676) identified from among subjects enrolled in the French health insurance organization for farmers (Mutualite Sociale Agricole) n Covers ages 18 -75 years; cases submitted a claim for PD related health care within an 18 months period; controls submitted claims for other care. n 3 controls matched to every case by age, sex and residency 123 out of 190 PD cases were examined by study neurologist, for 56 only the treating neurologist provided information, no additional info on 11
Example from the literature: CYP 2 D 6 Polymorphism, pesticide exposure, and Parkinson’s disease. Elbaz et al. 2003 Ann Neurol; 55: 430 -4 Pesticide exposure assessment by occupational health physicians using individual expert evaluation procedures Participants were classified as never users, users for gardening, occupational users Design (partial) restrictions: n n restricted to subjects of European descent matched 190 cases to 419 controls (83 quadruplets, 63 triplets, 44 pairs)
Example from the literature: CYP 2 D 6 Polymorphism, pesticide exposure, and Parkinson’s disease. Elbaz et al. 2003 Ann Neurol; 55: 430 -4 n Controls were individually matched to cases for age of the case at the time of the study (+/- 5 years), gender, and their region of residency (according to an administrative division in "Départements"). n We could not use the complete list of MSA affiliates to select controls because unfortunately, at that time, there was no such computerized list in a "clean" and usable format. During the study, they implemented an electronic system, and for the new study, we are now using a complete electronic database with all affiliates randomly selected after accounting for the matching variables. .
Example from the literature: CYP 2 D 6 Polymorphism, pesticide exposure, and Parkinson’s disease. Elbaz et al. 2003 Ann Neurol; 55: 430 -4 n Here, instead we used persons making health claims for a variety of medical reasons; I have not been able to obtain an exact estimate of how this covers the underlying population. What I have been told several times is that over 80% of affiliates aged 40 years or older make at least one health claim every 2 years and that this percentage increases as age increases. The physicians were instructed to match a maximum number of three controls to each case and for the majority of cases, three controls were included. Some MSA physicians only included 2 or 1 controls to limit the amount of work. . . n Because of the matching, we lost some sets of cases and controls due to missing variables (this is what the figure tries to show) ; for instance, if one case had no DNA, we lost the entire case-control set using conditional logistic regression ; alternatively, if a case had no controls with available information, we lost that case. We preferred to stick to conditional logistic regression (in particular, the cases and controls were also matched on the occupational health physicians by design, since they lived in the same area, so that cases and controls were also matched for the person doing the assessment of pesticides exposure), but we have always checked that using unconditional logistic regression with the appropriate adjustments led to similar results.
- Slides: 60