Propensity score analysis in the Genetic Analysis Workshop 17 simulated data set on independent individuals
© Lin et al; licensee BioMed Central Ltd. 2011
Published: 29 November 2011
Skip to main content
Volume 5 Supplement 9
© Lin et al; licensee BioMed Central Ltd. 2011
Published: 29 November 2011
Genetic Analysis Workshop 17 provided simulated phenotypes and exome sequence data for 697 independent individuals (209 case subjects and 488 control subjects). The disease liability in these data was influenced by multiple quantitative traits. We addressed the lack of statistical power in this small data set by limiting the genomic variants included in the study to those with potential disease-causing effect, thereby reducing the problem of multiple testing. After this adjustment, we could readily detect two common variants that were strongly associated with the quantitative trait Q1 (C13S523 and C13S522). However, we found no significant associations with the affected status or with any of the other quantitative traits, and the relationship between disease status and genomic variants remained obscure. To address the challenge of the multivariate phenotype, we used propensity scores to combine covariates with genetic risk factors into a single risk factor and created a new phenotype variable, the probability of being affected given the covariates. Using the propensity score as a quantitative trait in the case-control analysis, we again could identify the two common single-nucleotide polymorphisms (C13S523 and C13S522). In addition, this analysis captured the correlation between Q1 and the affected status and reduced the problem of multiple testing. Although the propensity score was useful for capturing and clarifying the genetic contributions of common variants to the disease phenotype and the mediating role of the quantitative trait Q1, the analysis did not increase power to detect rare variants.
Although genome-wide association studies in population samples may provide information on disease associations with relatively common genetic polymorphisms, they give only a tiny glimpse of the underlying functional variants and networks that might contribute to disease processes. Especially, rare variants in gene coding regions that are not in linkage disequilibrium with common variants are not well represented in these studies. Therefore exome resequencing projects have emerged to fill this knowledge gap; these studies focus on the detection of rare variants in coding regions of the genome. In addition, phenotypes are often influenced by many underlying quantitative traits and environmental exposures that might contribute to disease risk. These factors might be present in case subjects as well as in control subjects, therefore complicating the detection of risk factors. Covariates might introduce bias into the analysis if they are not evenly distributed between case subjects and control subjects. As confounding factors they might also obscure the relationship between risk factors and disease. New statistical approaches are clearly needed to address these analytical challenges.
Propensity score analysis is a relatively novel statistical approach to dimension reduction, bias detection, and risk estimation in studies where multiple covariates are present . In clinical trials, the propensity score is often understood as the conditional probability of being assigned to a “treatment” group given the subject’s observed characteristics. It can be used to balance confounding covariates in the treatment and control groups, therefore reducing the selection bias in observational studies .
Three major applications of the propensity score have emerged. In the first approach the propensity score is used to reduce bias in observational case-control analyses if covariates are not evenly distributed between the case and control groups. In this context, case and control subjects could be matched one to one on the propensity score to create more homogeneous groups for comparison. This approach often leads to a significant reduction in sample size through elimination of unmatched cases . In the second approach based on the propensity score, strata can be created to allow for multiple matches and retention of larger sample sizes if perfect matches cannot be found . Finally, in the most common application, the propensity score is used to summarize information on confounding covariates into a single score, and then the propensity score itself is included as a covariate in a logistic regression model predicting the outcome .
In this study, we explore the use of the propensity score as a means to reduce the multiple dimensions of the phenotype in a case-control design using the exome sequencing data provided in the framework of the Genetics Analysis Workshop 17 (GAW17). Because the outcome is directly influenced by multiple quantitative traits, we use the propensity score to summarize the information on the covariates and to create a new quantitative trait or outcome variable, the probability of being affected given the covariates. Then, we compare this multivariate approach with the univariate approach in a case-control association analysis after reducing the multiple-testing problem even further by selecting only potential disease-causing genomic variants.
Distribution of covariates between case and control subjects
Case group (N = 209)
Control group (N = 488)
Sex = male
Variable selection by stepwise regression
Probability (chi-square test)
1.22 × 10−38
4.67 × 10−39
2.50 × 10−43
The genotype information was based on data from the pilot3 study of the 1000 Genomes Project  provided in the framework of GAW17. Based on autosomal sequence data, 24,487 single-nucleotide polymorphism (SNP) genotypes located in 3,205 known genes were provided. We removed synonymous SNPs and SNPs of unknown function based on the assumption that these SNPs were most likely not disease related. Variants with minor allele frequencies (MAFs) less than 0.001 were also removed, because those SNPs were present in only one individual and disease association would have been difficult to assign under these circumstances. The remaining 8,079 nonsynonymous SNPs were retained for further analysis.
Initially, we performed a univariate genome-wide case-control association analysis, first on the binary affected status and then on the quantitative traits Q1, Q2, and Q4 individually using the correlation trend test under the additive genetic model as implemented in the software program SVS, version 7, from Golden Helix. Population stratification present in this data set was corrected with 10 principal components, as indicated by the scree plot. Then, we used the propensity score, defined as the probability of being affected given the contributing covariates, as the outcome variable in the case-control association analysis. The Bonferroni approach was used to correct for multiple testing. In addition, we performed 10,000 single-value permutations and full-scan permutations in SVS, version 7, to confirm our results .
Genome-wide association analysis on univariate and multivariate phenotypes
Correlation trend test p-value
8.8 × 10−13
7.1 × 10−9
6.0 × 10−11
4.8 × 10−7
2.3 × 10−9
1.8 × 10−5
2.3 × 10−7
Case-control association analysis with the propensity score as the quantitative trait correctly identified the association with C13S523 (p = 2.3 × 10−9, Bonferroni-corrected p = 1.8 × 10−5) and C13S522 (p = 2.3 × 10−7, Bonferroni-corrected p = 0.0018) mediated through Q1 (Table 3). Permutation analysis with 1,000 permutations revealed a permutation p = 0.001 for both associations. Both SNPs were located in the gene FLT1. Even though we were able to capture the correlation between Q1 and the affected status and to detect true associations with common variants (no false-positive signals were found at the genome-wide level of significance), our approach was unable to detect additional rare variants. The analysis missed 69 true disease-related SNPs with MAFs between 0.17 and 0.00071, and β values between 1 and 0.03, including nine additional SNPs in the gene FLT1.
A common approach to rare genomic variants is the selection of variants that are present only in case subjects. Taking this approach, we found 421 SNPs that fulfilled this criterion; 16 of those were present in only one case subject. Out of 405 SNPs present in at least two case subjects, only 5 SNPs in five different genes were associated with disease according to the model and 400 SNPs were false positives. The most obvious contributing factor to the inflation of rare variants is ethnic admixture in this data set. Even though we attempted to correct for this problem using principal components analysis, it could not be completely eliminated.
Composite phenotypes that are strongly influenced by multiple quantitative traits with specific genetic risk factors are a frequently encountered phenomenon in genetic studies of common complex disorders. Often these traits are present in case subjects as well as in control subjects, and they might introduce bias or even be confounding factors in the estimates of disease associations. A joint estimate of these covariates is rarely included in genome-wide association studies.
Propensity score analysis has emerged as an approach to dimensionality reduction. Because the contributing quantitative traits are present in both case and control subjects, a probabilistic approach that summarizes the multiple risk factors is appropriate, particularly because the genetic risk factors predominantly influence the affected status through the quantitative traits. Using propensity score analysis, we were able to detect two SNPs associated with the affected phenotype that were obscured when only the affected phenotype was used as the outcome, and we were able to clarify the relationship between the quantitative traits and the phenotype.
The application of the propensity score in the more traditional sense as a means to reduce bias was limited in this data set. After all, potentially confounding covariates were not the focus of this simulation. In fact, most covariates were directly and causally related to the affected status. Attempts to match case and control subjects on the propensity score either by one-to-one matching or by stratification resulted in a significant reduction in sample size as a result of a large number of unmatched observations. The resulting loss of power made the detection of significant associations impossible. Using the propensity score as a covariate after inclusion of all known contributing variables eliminated all the genetic contributions to the affected status mediated by the quantitative traits.
Still, using the propensity score as a dimensionality reduction tool has several advantages over multivariate regression. Multivariate regression models are often concerned with finding parsimonious models using only a limited number of covariates to avoid overparameterization. In the propensity score estimation, the number of covariates that can be included is not limited by the model. Interactions and nonlinear terms can easily be incorporated.
A commonly encountered problem in case-control association studies is false-positive association resulting from nonrandom differences between case and control subjects that are not related to the presence of the disease itself [9, 10]. Population stratification resulting from admixture of different ethnic groups with differences in allele frequencies or uneven distribution of sex and other confounding covariates can introduce biases that are frequently not addressed in the study design. In the GAW17 data set, ethnicity was such a confounding factor. Ethnicity itself was not related to the disease status; however, the presence of seven different ethnicities introduced a large number of rare and private mutations in the data set. The overwhelming number of rare and private variants that were not related to disease in only the case group cautions against the assumption in studies involving independent individuals and complex disorders that presence in case subjects and absence in control subjects is evidence of pathogenicity. This data set demonstrates that ignoring design issues, particularly population admixture and unbalanced covariates between case and control subjects, can introduce noise into the data and can complicate the discovery of true disease associations. Post hoc statistical analysis cannot always correct for these design issues.
Sequencing data sets still have a relatively small sample size that limits the power of a study to detect associations with rare variants in a traditional case-control design. Therefore, it might be useful to identify genomic variants that are more likely to cause disease, such as nonsynonymous variants and truncating and non-sense mutations in coding regions, through resequencing approaches. Focus on those variants would decrease the multiple testing problem and increase the power to detect disease-associated variants with large effect. However, realistic expectations should be in place when dealing with these data. Case-control association designs might not be the appropriate approach to rare variants, particularly under genetic heterogeneity. Family-based resequencing approaches might be more appropriate under these conditions.
Propensity score analysis could be a useful tool in genetic case-control association analyses. Even though we admit that this simulated data set had limitations for the meaningful use of this method, our study demonstrates an application to the dimensionality reduction of phenotypes that are influenced by multiple correlated traits with strong genetic risk factors. This approach might give some advantage in settings in which issues related to multiple testing arise. Potential problems include the selection of covariates. Our approach did not increase the power to detect rare variants, which remains a problem that is difficult to address in case-control studies.
This work was supported by National Institutes of Health (NIH) grant 1R01 MH085744-01A1 through the National Institute of Mental Health, awarded to Berit Kerner. We would like to thank Ingrid Munch for helpful advice and lively discussions. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 5 Supplement 9, 2011: Genetic Analysis Workshop 17. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/5?issue=S9.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.