Visualizing genotype × phenotype relationships in the GAW15 simulated data
© Qin et al; licensee BioMed Central Ltd. 2007
Published: 18 December 2007
We have developed a graphical display tool called SIMLAPLOT for visualizing different ways in which continuous covariates may influence the genotype-specific risk for complex human diseases. The purpose of our study was to examine continuous covariates in the Genetic Analysis Workshop 15 simulated data set using our novel graphical display tool, with knowledge of the answers. The generated plots provide information about genetic models for the simulated continuous covariates and may help identify the single-nucleotide polymorphisms associated with the underlying quantitative trait loci.
One of the most challenging aspects of complex genetic traits is developing intuition regarding genotype × phenotype relationships across the distribution of a continuous covariate. Standard family-based and case-control association tests do not directly examine the role of continuous disease-related covariates in genetic models. Such covariates may themselves have a genetic basis in the form of a quantitative trait locus (QTL), or they may interact statistically with one or more susceptibility genes (gene × environment (G × E) interaction). A third possibility is that they may define more homogeneous subgroups of patients or families, in which the main effect of a particular susceptibility gene is more easily detected. While family-based designs offer protection against spurious associations as a result of population stratification, they are known to be less efficient than case-control designs for some disease models. Case-control designs may also have advantages over family-based designs in terms of distinguishing QTL models from G × E interaction models. To improve our understanding of a variety of complex genetic models used in simulation studies, we developed a novel graphical display tool, SIMLAPLOT, which produces plots of the relationship between affection status, continuous covariate values, and marker genotypes. SIMLAPLOT provides a way to examine genetic model parameters for continuous traits and to evaluate models by comparing plots from observed data to theoretical model plots. By applying SIMLAPLOT to the simulated Genetic Analysis Workshop 15 (GAW15) data, our goal was to identify the SNPs in highest linkage disequilibrium (LD) with simulated QTLs underlying measured continuous covariates and to characterize the corresponding genetic models qualitatively.
where AFF = 1 if affected and AFF = 0 otherwise. G codes for the three possible genotypes (dd, Dd, DD) at a bi-allelic susceptibility locus or nearby marker based on the user-specified mode of inheritance (additive, dominant, or recessive). β1 is the log-transformed odds ratio for the susceptibility locus. E is a continuous, normally distributed covariate; it can be an environmental risk factor, an endophenotype, or a quantitative trait, which depends on an underlying QTL. β2 is the log-transformed odds ratio for a user-specified one-unit increase of the continuous covariate. G × E is defined as the product of G and E, and β3 is the log-transformed odds ratio for this interaction term. β0 adjusts for the user-specified disease prevalence in the population of simulated individuals.
SIMLAPLOT evaluates QTL models, G × E interaction models, and genetic main effect models with covariate-defined heterogeneity. It produces four types of plots to explore different aspects of the relationship between affection status, continuous covariate values and marker genotypes in each model.
Genotype-specific penetrance values as a function of covariate values
Three penetrance curves, one for each genotype, are produced. These curves display changes in penetrance as a function of E, if E is a risk factor for the simulated disease phenotype, either alone or in combination with genetic susceptibility.
Conditional genotype probability as a function of covariate values and affection status
Three frequency curves, one for each genotype, are produced. At each point on the x-axis, the sum of the three frequencies is 1.0. The respective frequencies change as a function of E if the genotypes correspond to a QTL, if there is interaction with an environmental covariate, or if E is an indicator of genetic heterogeneity.
Covariate distribution for each genotype in affected individuals
Covariate distribution for each genotype in unaffected individuals
The covariate distributions are plotted for each genotype, separately for affected and unaffected individuals. The comparison of the two plots reflects the main effect of E, or the strength of G × E interaction.
SIMLAPLOT will plot the theoretical conditional distributions for the different models given the following input parameters: mean and standard deviation for E, which may or may not be genotype-dependent, allele frequency for the susceptibility locus, QTL or nearby marker, all relevant odds ratios, the mode of inheritance, and the type of model (model-based: QTL, G × E, or heterogeneity). Some parameters, such as genotype-specific means and variances, can be estimated from an existing data set, and some parameters are approximated based on the assumed model, e.g., QTL. SIMPLAPLOT also produces the same types of plots based on the observed data (data-based). Comparison of the observed to the theoretical distributions may suggest an appropriate model for the observed data set. To produce these plots SIMLAPLOT uses a kernel density estimate of the form with different kernels and width b . Kernel options include Gaussian (the default), rectangular, triangular, and cosine. It is very important to evaluate the robustness of the visual plot appearance to the choice of smoothing parameters. SIMLAPLOT determines the optimal degree of smoothing by either minimizing the mean squared error (default) or minimizing the mean distance to the center-matched Gaussian predictions .
We applied SIMLAPLOT to the GAW15 simulated data sets using the quantitative covariates IgM, anti-CCP (anti-cyclic citrinullated protein), and severity of RA (rheumatoid arthritis). We analyzed all SNP markers on chromosomes 9, 11, and 18. Because covariate values exist only for affected individuals, we specified a relative risk of 1.0 and focused on two types of plots: the conditional genotype probability (plot type 2) and the covariate distribution for each genotype in affecteds (plot type 3). The input parameters for an assumed QTL model (plots labeled "model-based"), such as genotype-specific mean and variance, were estimated from the observed data for the specified SNP. We demonstrate SIMLAPLOT with data from Replicate 1. To evaluate our qualitative conclusions, we performed quantitative trait association analysis using the Monks-Kaplan method  as implemented in the QTDT program . p-Values and their ranks were obtained for all 100 simulated replicates.
Continuous covariate: IgM
QTDT-Monks Kaplan results
Range of p-values over 100 replicates
1.00 × 10-42 to 9.00 × 10-29
5.00 × 10-9 to 0.4048
4.00 × 10-14 to 6.00 × 10-4
2.00 × 10-13 to 7.00 × 10-4
1.00 × 10-3 to 0.99
1.00 × 10-9 to 0.3366
Continuous covariate: anti-CCP
Discrete covariate: severity
It is a challenge to identify the role of a continuous covariate in complex human diseases. We developed SIMLAPLOT as a visualization tool to explore different models by which continuous covariates may influence disease risk and to estimate parameters of interest. Our applications of SIMLAPLOT suggest that SNPs in strong LD with QTLs may be apparent when observed and expected (theoretical) plots of conditional genotype distributions across covariate values are compared. SIMLAPLOT may also help differentiate QTL models from interaction and heterogeneity models involving continuous covariates by comparing plots for affected and unaffected individuals.
We gratefully acknowledge support for this research from NIH (NEI R03 EY015216, NIMH R01 MH595228, NIA R01 AG20135) and the Neurosciences Education and Research Foundation.
This article has been published as part of BMC Proceedings Volume 1 Supplement 1, 2007: Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/1?issue=S1.
- Schmidt M, Hauser E, Martin E, Schmidt S: Extension of the SIMLA package for generating pedigrees with complex inheritance patterns: environmental covariates, gene-gene and gene-environment interaction. Stat Appl Genet Mol Biol. 2005, 4: Article15-PubMed CentralPubMedGoogle Scholar
- Scott DW: Multivariate Density Estimation. Theory, Practice, and Visualization. 1992, New York: John Wiley and SonsView ArticleGoogle Scholar
- Monks SA, Kaplan NL: Removing the sampling restrictions from family-based tests of association for a quantitative-trait locus. Am J Hum Genet. 2000, 66: 576-592. 10.1086/302745.View ArticlePubMed CentralPubMedGoogle Scholar
- Abecasis G, Cardon L, Cookson W: A general test of association for quantitative traits in nuclear families. Am J Hum Genet. 2000, 66: 279-292. 10.1086/302698.View ArticlePubMed CentralPubMedGoogle Scholar
- Van Hulle MM, Gautama T: Optimal smoothing of Kernel-based topographic maps with application to density-based clustering of shapes. J VLSI Signal Process. 2004, 37: 211-222. 10.1023/B:VLSI.0000027486.56120.e7.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.