Influence of control selection in genome-wide association studies: the example of diabetes in the Framingham Heart Study

Epidemiologic study designs represent a major challenge for genome-wide association studies. Most such studies to date have selected controls from the pool of participants without the disease of interest at the end of the study time. These choices can lead to biased estimates of exposure effects. Using data from the Framingham Heart Study (Genetic Analysis Workshop 16 Problem 2), we evaluate the impact on genetic association estimates for designs with control selection based on status at the end of a study (case exclusion (CE) sampling) to control selection based on incidence density (ID) sampling, when controls are selected from the pool of participants who are disease-free at the time a case is diagnosed. Cases are defined as those diagnosed with type 2 diabetes (T2D). We estimated odds ratios for 18 previously confirmed T2D variants using 189 cases selected by ID sampling and using 231 cases selected by CE sampling. We found none of these single-nucleotide polymorphisms to be significantly associated with T2D using either design. Because these empirical analyses were based on a small number of cases and on single-nucleotide polymorphisms with likely small effect sizes, we supplemented this work with simulated data sets of 500 cases from each strategies across a variety of allele frequencies and effect sizes. In our simulated datasets, we show ID sampling to be less biased than CE, although CE shows apparent increased power due to the upward bias of point estimates. We conclude that ID sampling is an appropriate option for genome-wide association studies.


Background
The genetic architecture of type 2 diabetes (T2D) appears to be composed of several genes, each of which has a modest impact on disease risk. Despite significant advances in our understanding of the genetic determinants of the monogenic forms of diabetes, the definitive identification of genes that increase risk of common T2D in the general population has been far more elusive. However, a string of recent genome-wide association studies (GWAS) has given promising clues to additional genes involved in common T2D risk.
GWAS offer an approach to gene discovery unbiased with regard to presumed functions or locations in the Open Access genome. The common method of control selection used for many GWAS is to form a single pool of potential controls consisting of subjects who were not cases by the end of the study period. However, this method has been shown by Greenland and Thomas [1] and Lubin and Gail [2] to lead to biased estimates of the rate ratio. This bias has been termed "case-exclusion bias". Moreover, differences in the origin of populations of cases and controls can arise if the two groups are recruited independently or have different inclusion criteria, and the presence of population stratification can lead to greater than nominal type I error rate.
Another method of control selection, termed "incidence density sampling", uses subjects who survived to the time of case occurrence to make a pool of potential controls for each case. The pool of potential controls may include subjects who later become cases and subjects who develop other diseases. This nested casecontrol design can be a very efficient approach to obtain unbiased estimates of relative risks associated with genetic variants.
In this study, we use the GWAS data from the Framingham Heart Study (FHS, Genetic Analysis Workshop 16 Problem 2) to compare the influence of control selection on the results for T2D.

FHS
The FHS is a community-based, multigenerational, longitudinal study of cardiovascular disease and its risk factors, including diabetes. The FHS began in 1948 to investigate the causes of heart disease. Men and women between the ages of 28 and 62 years were recruited and followed prospectively over time. Beginning in 1971, offspring of the Original Cohort were recruited as part of the Framingham Offspring Study. There are a total of 6752 subjects. There are 765 pedigrees with 2 to 301 genotyped subjects: 134 pedigrees with 2, 123 with 3, 98 with 4, 85 with 5, 177 with 6 to 10, 72 with 11 to 15, 30 with 16 to 20, and 46 with more than 20.

Case-control definitions
Cases were defined as people with a diagnosis of type 2 diabetes (T2D) during follow-up of the FHS cohort. Cases were born during the first, the second, or the third generation of the FHS. The age at diagnosis for 231 unrelated male and female cases was 20 to 80 years old.
In our nested incidence density case-control approach, 10 individually matched controls were selected with replacement from members of the cohort who did not have a T2D diagnosis at the time when the case was identified. Age is a strong risk factor for T2D disease, and so controls were always selected among participants of the same age at enrollment as the cases ( ± 5 years). Controls were additionally matched on sex and body mass index (BMI) at enrollment ( ± 2 kg/m 2 ). For every case, ten randomly chosen controls were selected by incidence density sampling. Cases and controls were not members of the same family. In our case-exclusion approach, controls were selected as members of the FHS who never received a T2D diagnosis during any of the recorded follow-up. We then adjusted for age, sex, and BMI matching criteria as in our nested case-control approach.

Statistical analyses
As a quality control measure, we tested for Hardy-Weinberg disequilibrium in controls using an exact test. All markers are in Hardy-Weinberg equilibrium in the observed FHS data and in all simulated samples. All individuals had complete data for sex, age, BMI, and diabetes except 15 controls in the incidence density (ID) sample and 28 in the case exclusion (CE) sample for whom BMI at enrollment was not available. All SNPs BMC Proceedings 2009, 3(Suppl 7):S113 http://www.biomedcentral.com/1753-6561/3/S7/S113 had no more than 10.4% missing data, which we judged to be acceptable.
Genetic associations with T2D (odds ratios, confidence intervals, and statistical tests) were estimated and tested using a conditional logistic regression under the additive model for the ID sampling approach and using logistic regression, adjusted for matching variables, in the caseexclusion approach. These analyses were carried out in SAS software using the PHREG procedure.

Simulations
Simulations were used to investigate control selection effects in a larger sample of individuals than that in the observed FHS data, and with SNPs having higher effect sizes. We simulated 11 sets of 100 replicates according to varying minor allele frequencies and generating hazard ratios. These simulations were used to estimate bias and power between the control sampling designs. A SAS program was used to simulate diabetes as a function of SNP genotype. We generated data sets of 10,000 individuals with SNP genotypes assigned probabilistically according to allele frequencies of 0.10, 0.30, or 0.50. We then assigned diabetes status and time of onset using an exponential model based on SNP hazard ratios from 1.3 to 3 (see Table 1). We selected five controls for each case according to the ID (risk set) sampling scheme and set a 5:1 control:case ratio for the CE sampling at end of follow up. We then estimated odds ratios (ORs), confidence intervals (CIs) and performed tests of association for each SNP. We repeated this 100 times to report average bias and estimated power for each SNP (defined as the proportion with statistically significant association (p < 0.05)). Bias ratios between ID and CE methods were estimated by the ratio "calculated OR per method/generating hazard ratio in simulations".

Results and discussion
In order to maximize precision, we chose a ratio of 10 controls per case for both sampling strategies in the FHS data. Because we did not have exact dates and BMI at onset of diabetes, we used the age at enrollment, i.e., the age at Visit 1, and BMI at enrollment to match cases and controls. To accommodate the effect of random ID control selection, we repeated random sampling and conditional logistic regression 10 times. The distribution of OR estimates obtained in each analysis showed wide variability across replicates, with a coefficient of variation from 14% to 20% per SNP among the ID sampling replicates. We report the average OR from these 10 replicates in Table 2, along with confidence limits based on the method of Rubin [5] that takes within-replicate and across-replicate variation into account. We also show the average p-value per SNP to indicate whether statistical significance was achieved in any replicate.
We failed to find any significant association with any of the 18 previously reported SNPs using ID sampling or CE sampling in FHS (Table 2). We included 18 SNPs with convincing association evidence; however, two important SNPs were missing in our genotyping data (rs757210 in TCF2 and rs13266634 in SLC30A8), and could not be considered in the FHS. One drawback of our study is the limited number of T2D cases, despite the very large database. With only 189 incident cases and 231 total cases, our study had low power to detect genetic association between SNPs and T2D, especially considering the expected magnitudes of association based on previous reports. Owing to the large CIs of the ORs in our two scenarios, the results would have been less conclusive than those of the previous studies conducted in larger sample (>1000 cases). An alternative explanation for the low power is that we considered each SNP separately rather than a combination of variants acting additively on risk, which may have a large effect. Because the empirical data are hard to interpret due to the small number of cases and small effect sizes, we further addressed differences between control sampling methods via simulation with higher sample sizes and effect sizes. For each simulating scenario, we simulated 100 cohort data sets, each with approximately 500 cases, as described in the Methods section (Table 1). These simulations show that when more precision can be obtained and higher effect sizes are considered, ID sampling does indeed have less bias, while CE methods have a slight upward bias, leading to the appearance of increase power. We suggest that this increased power should be considered with caution given the bias, and recommend ID sampling as the appropriate strategy for case-control analyses nested in cohorts.