- Proceedings
- Open access
- Published:

# Evaluating aggregate effects of rare and common variants in the 1000 Genomes Project exon sequencing data using latent variable structural equation modeling

*BMC Proceedings*
**volume 5**, Article number: S47 (2011)

## Abstract

Methods that can evaluate aggregate effects of rare and common variants are limited. Therefore, we applied a two-stage approach to evaluate aggregate gene effects in the 1000 Genomes Project data, which contain 24,487 single-nucleotide polymorphisms (SNPs) in 697 unrelated individuals from 7 populations. In stage 1, we identified potentially interesting genes (PIGs) as those having at least one SNP meeting Bonferroni correction using univariate, multiple regression models. In stage 2, we evaluate aggregate PIG effects on trait, Q1, by modeling each gene as a latent construct, which is defined by multiple common and rare variants, using the multivariate statistical framework of structural equation modeling (SEM). In stage 1, we found that PIGs varied markedly between a randomly selected replicate (replicate 137) and 100 other replicates, with the exception of *FLT1*. In stage 1, collapsing rare variants decreased false positives but increased false negatives. In stage 2, we developed a good-fitting SEM model that included all nine genes simulated to affect Q1 (FLT1, KDR, ARNT, ELAV4, FLT4, HIF1A, HIF3A, VEGFA, VEGFC) and found that *FLT1* had the largest effect on Q1 (*β*_{std} = 0.33 ± 0.05). Using replicate 137 estimates as population values, we found that the mean relative bias in the parameters (loadings, paths, residuals) and their standard errors across 100 replicates was on average, less than 5%. Our latent variable SEM approach provides a viable framework for modeling aggregate effects of rare and common variants in multiple genes, but more elegant methods are needed in stage 1 to minimize type I and type II error.

## Background

The 1000 Genomes Project is an international public-private consortium aiming to build the most detailed map of human genetic variation with the overarching goal to improve our understanding of the genetic contribution to common human diseases. Initially launched in 2008, three pilot studies have been completed to test multiple sequencing methods. Pilot Project 3 involved sequencing the coding regions (exons) of 3,205 genes in 697 individuals from 7 populations, which revealed 24,487 rare and common genetic variants. The sequencing data from Pilot Project 3 were used for Genetic Analysis Workshop 17 (GAW17), and details of this data set, including how the phenotypes were simulated, can be found in Almasy et al. [1].

Although strategies have been developed to evaluate the contribution of rare variants to disease susceptibility in nonfamilial data, including collapsing methods, which are reviewed by Dering et al. [2], approaches that evaluate the combined or aggregate effects of rare and common variants together are limited. Thus in this paper we aim to evaluate the aggregate effects of rare and common single-nucleotide polymorphisms (SNPs) in genes on the simulated quantitative trait Q1 using the Pilot Project 3 data (unrelated subjects). In stage 1 we use multiple regression methods (with and without collapsing rare variants) to identify potentially interesting genes (PIGs); in stage 2, we use a latent variable structural equation modeling (SEM) approach to evaluate aggregate effects of rare and common variants in PIGs on Q1. During our initial analyses, we were blinded to the “answers” of the simulated model. In *post hoc* analyses, we used knowledge that 39 SNPs in 9 genes, primarily in the vascular endothelial growth factor (VEGF) pathway, were simulated to be associated with Q1.

## Methods

### Data cleaning and preparation: phenotype and genotype variables

We first examined the distribution of Q1, which we arbitrarily chose from the three simulated quantitative phenotypes available (see Almasy et al. [1]), in a randomly chosen replicate (replicate 137) of the unrelated individuals from the GAW17 data using SAS, v. 9.1 (SAS Institute Inc., Cary, North Carolina). Visual inspection of histograms and quantile-quantile (Q-Q) plots and Shapiro-Wilk and Kolmogorov-Smirnov tests indicated that Q1 was essentially normally distributed. Summary statistics and Mendelian inheritance errors were evaluated using PLINK, v. 1.07 [3].

### Stage 1: statistical methods for regression-based analyses

We evaluated the association between each SNP as an additive model (0, 1, or 2 copies of the minor allele) and Q1 using linear regression models adjusted for all the covariates provided in the GAW17 data set (Age, Sex, Smoking, population [Pop1]) using PLINK, v. 1.07. In addition, we collapsed rare variants (minor allele frequency [MAF] < 0.05) in each gene using the indicator coding method [2], which assumes equal weighting of each rare variant. We also adjusted the models for population substructure using principal components (PCs). PCs were generated using the centralized scoring matrix method of Qin et al. [4, 5] in MATLAB (MathWorks, Boston, Massachusetts). We adjusted models for multiple PCs and found that adjusting for 10 or 12 PCs minimized the number of false positives (see Results section, Table three, and additional details in Qin et al. [5]).

### Stage 2: statistical methods for latent variable structural equation modeling

Our approach for modeling multiple common variants in genes using latent constructs has been described previously [6]. Essentially, we let a latent variable (ovals in Figure 1, e.g., *FLT1*) represent the overall variation in a gene, which we formally describe by multiple SNPs (rectangles in Figure 1, e.g., C13S522) in that gene. In terms of notation, briefly, in latent variable structural equation modeling (SEM), two general submodels are used: (1) a measurement model that develops the relations (loadings; e.g., the arrow from *FLT1* to C13S522 in Figure 1) between the observed variables and the latent constructs; and (2) a structural model that develops the relations (path coefficients; e.g., the arrow from PopStr to *FLT1* in Figure 1) between the latent variables. The general form of the measurement model is:

**y** = **Λ**_{
y
}**η** + **ε**, (1)

where **y** is the *p* × 1 vector of observed variables, **η** is the *m* × 1 vector of latent random variables, **ε** is the *p* × 1 vector of measurement errors for **y**, and **Λ**_{
y
} is the *p* × *n* matrix of coefficients relating **y** to **η**.

The general form of the structural model imposes constraints such that:

**η** = **Βη** + **ζ**, (2)

where **Β** is the *m* × *m* matrix of path coefficients and **ζ** is the *m* × 1 vector of errors or disturbances in the endogenous (dependent) latent variables.

The structural model can be modified by adding a *q*-dimensional vector of covariates (**x**), an *m* × *q* matrix of regression coefficients (**Γ**), and an *m*-dimensional vector of intercepts (**α**):

**η** = **α** + **Βη** + **Γx** + **ζ**. (3)

Similar to our prior work using common variants to describe overall variation in a gene [6, 7], we used eigenvalues, scree plots, reliability, linkage disequilibrium plots (Haploview, v. 4.2), and association results from stage 1 to help select the most informative SNPs and define parsimonious latent gene constructs. We performed confirmatory factor analysis using a robust maximum-likelihood estimator, which provides test statistics and standard errors robust to nonnormality, using Mplus, v. 5.1 (Muthén and Muthén, Los Angeles, California), to generate and test single latent gene construct models, which included adjustment for covariates (Age, Sex, Smoking) and population structure (modeled using a latent variable defined by Pop1 and the top 12 PCs). To assess the overall model goodness-of-fit, we used the chi-square test, the comparative fit index (CFI), the root mean-square error of approximation (RMSEA), and the standardized root mean-square residual (SRMR) [8].

The chi-square test evaluates whether the covariance matrix is equal to the model-implied covariance matrix predicted by the parameters, but it is sensitive to sample size and complexity. Thus, other fit indexes, including the CFI, RMSEA, and SRMR, have been used to evaluate model fit [8]. The CFI is relatively insensitive to sample size and model complexity, and CFI ≥ 0.95 and CFI ≥ 0.90 suggest good and acceptable fit, respectively [9]. The RMSEA is less sensitive to sample size and favors more parsimonious models. An RMSEA ≤ 0.06 represents good fit, and an RMSEA ≤ 0.10 yields acceptable fit [9]. An SRMR ≤ 0.08 represents a good fit, and an SRMR < 0.10 represents an acceptable fit [8, 9]. We evaluated the performance of the SEM model by calculating the mean relative bias in the parameters and their standard errors across 100 replicates (replicates 99–136 and 138–200) available in the GAW17 data [1]. All *p*-values are from two-sided tests.

## Results

Without knowledge of the underlying simulated model and using a randomly selected replicate (replicate 137), we evaluated potential associations between each SNP and trait Q1 in stage 1. We found that several genes had a least one SNP meeting or exceeding the Bonferroni-corrected level with (*p* ≤ 8.33 × 10^{−6}) and without (*p* ≤ 2.04 × 10^{−6}) collapsing rare variants (MAF < 0.05) (Table 1), but the most significant associations were observed with common (C13S522, C13S523) and rare (C1S3524) variants in *FLT1* (Table 2).

In stage 2, when building the *FLT1* construct using replicate 137, we found that adding rare variants to the common variants improved the model fit (CFI = 0.90, RMSEA = 0.03, and SRMR = 0.08 in Figure 1A vs. CFI = 0.96, RMSEA = 0.02, and SRMR = 0.05 in Figure 1B), improved construct reliability (Cronbach’s *α*: 0.40 (A) vs. 0.53 (B)), and increased the variance explained in Q1 (*R*^{2}: 0.30 ± 0.04 (A) vs. 0.36 ± 0.04 (B)). In a larger SEM (Figure 2) with 6 genes (26 SNPs) and with population structure represented by a latent variable (PopStr), we found that the path coefficient of *FLT1* on Q1 (*β*_{std} = 0.49 ± 0.04) was slightly lower than that in the reduced model (Figure 1B: *β*_{std} = 0.43 ± 0.05), but *FLT1* remained the gene most strongly associated with Q1, followed by *SPHKAP*, *LRRN2*, *C5ORF25*, and *FADS3*. Genes *AKAP13* and *OR2T34* were not associated with Q1. Population structure was not significantly associated with Q1 or with genes where paths are not shown.

In post hoc analyses, we found that the list of PIGs varied markedly across replicates (99–136 and 138–200), with the exception of *FLT1*, which had at least one SNP in all 100 replicates exceeding the Bonferroni-corrected *p*-value in models adjusted for 10 or 12 PCs, and with and without rare variants collapsed. *KDR* was the next most consistent PIG, which was identified in 12 and 20 of the 100 replicates when rare variants were and were not collapsed, respectively; the results were the same when adjusting for 10 and 12 PCs.

We obtained the answers to the GAW17 simulation model to better understand the performance of our stage 1 approach and to develop a stage 2 model that would more closely reflect the simulated model. The answers revealed that 39 SNPs in 9 genes (*FLT1*, *FLT4*, *KDR*, *ARNT*, *ELAVL4*, *HIF1A*, *HIF3A*, *VEGFA*, *VEGFC*), primarily in the VEGF pathway, were simulated to be associated with Q1.

With regard to stage 1 performance, we found that the number of false-positive genes decreased with adjustment for increasing numbers of PCs. As shown in Table 3, the number of false-positive genes was lower when rare variants were collapsed (8 PCs: *μ* [mean] = 1.40, SD = 2.38; 10 PCs: *μ* = 1.25, SD = 2.35; 12 PCs: *μ* = 1.20, SD = 2.26) versus not collapsed (8 PCs: *μ* = 6.46, SD = 11.99; 10 PCs: *μ* = 5.69, SD = 11.38; 12 PCs: *μ* = 5.43, SD = 11.20). The number of true-positive and false-negative genes was similar for models adjusted for 10 and 12 PCs and when rare variants were and were not collapsed (Table 3). Relaxing multiple test criteria to *p* ≤ 1.56 × 10^{−6} (which reflects Bonferroni correction for the total number of genes) did not materially improve the number of true-positive genes when rare variants were collapsed (not shown). Although we were most interested in identifying causal genes in stage 1, we note that the number of false-positive SNPs over the 100 replicates decreased with adjustment for increasing numbers of PCs (Pop1: *μ* = 58.74, SD = 36.34; 8 PCs: *μ* = 6.68, SD = 12.38; 10 PCs: *μ* = 5.89, SD = 11.76; 12 PCs: *μ* = 5.61, SD = 11.57). The numbers of false-negative SNPs were similar when adjusting for 10 PCs (*μ* = 36.31, SD = 1.06) and 12 PCs (*μ* = 36.32, SD = 1.05) with most replicates correctly identifying *FLT1* SNPs C13S522 and C13S523 (not shown).

In regards to building the Stage 2 model, because the GAW17 answers provided only a list of the nine genes simulated to be associated with Q1, we used the pathway database of the Kyoto Encyclopedia of Genes and Genomes (KEGG) (http://www.genome.jp/kegg/pathway.html; VEGF Signaling, Cytokine-Cytokine Receptor Interaction, Pathways in Cancer) to better understand the biological relationships between the nine genes. We developed a good-fitting model (CFI = 0.90, RMSEA = 0.04, SRMR = 0.03) that included all nine genes simulated to affect Q1 (Figure 3). The variance explained in Q1 (*R*^{2} = 0.42) was greater than in prior models. *FLT1* remained the gene most strongly associated with Q1, followed by *ARNT*, *VEGFA*, *KDR*, *VEGFC*, *FLT4*, and *HIF3A*. Smoking was simulated to be associated with *KDR*, but we observed only a marginal association (*β*_{std} = 0.05 ± 0.02, *p* = 0.08; not shown) and found that Smoking was more highly associated with *HIF3A* and *ELAVL4*. Modeling all nine genes simultaneously revealed that *HIF1A* was associated with *VEGFC* but that *ELAVL4* and *HIF1A* were not associated with Q1. Removing paths designated by a dashed line (Figure 3) resulted in a slightly improved model fit (CFI = 0.91, RMSEA = 0.04, SRMR = 0.02), but the magnitude of the paths from genes to Q1 remained similar.

To evaluate the performance of the stage 2 SEM model, we used estimates from a randomly selected replicate (replicate 137) to represent population values (because the GAW17 answers did not contain standardized estimates for aggregate gene effects that we could directly compare to) and compared these population values to estimates from 100 replicates (replicates 99–136 and 138–200) available in the GAW17 data [1]. We found that the mean relative bias (MRB) in the parameters and in the standard errors across 100 replicates was 4.27% and 4.69%, respectively. The MRBs in standardized loadings and residuals were 1.47% for **Λ**, 0.09% for **ε**, and 0.57% for **ζ**; the MRBs in the standard errors (SEs) were 0.32% for **Λ**, 0.98% for **ε**, and 2.19% for **ζ**. The MRB was generally similar between common and rare variants. For example, in the *FLT1* common SNP C13S523 (MAF = 0.07) and the *FLT1* rare SNP C13S524 (MAF = 0.004), the MRB in **Λ** and the SE of **Λ** were 0.35% in C13S523, 0.62% in C13S524, and 0.33% in C13S523 vs. 0.98% in C13S524. The MRB of **ε** was 0.31% in C13S523 and 0.22% in C13S524, and the MRB of the SE of **ε** was 0.63% in C13S523 and 0.20% in C13S534. The largest bias was observed in the path coefficients (*β* = 19.9%; SE of *β* = 9.79%), which was quite severe in some cases, such as the *HIF1A* path coefficient, where the bias reached 67.94%. Interestingly, the post hoc analysis revealed that genes represented by private variants, such as *HIF1A*, that were associated with Q1 in replicate 137 were not significantly associated with Q1 in most replicates. Also, the effects of covariates (Age, Smoking, Sex, Pop1, PCs) varied markedly across replicates. Model fit, however, was generally consistent across replicates, with the average CFI, RMSEA, and RMSR being 0.91, 0.04, and 0.02, respectively.

## Discussion

In stage 1, adjusting for 10 or 12 PCs (see also Qin et al. [5]) and collapsing rare variants using the indicator coding method decreased the number of false-positive genes by about 78% (1.2 vs. 5.4), on average, but the number of false-negative genes remained high regardless of whether rare variants were collapsed or not (7.9 vs. 7.8). This is striking because we missed identifying about 87% of the simulated causal genes and correctly identified only one gene (11.1%; *FLT1*) over all 100 replicates (replicates 99–136 and 138–200). Our stage 2 SEM results were able to confirm the importance of *FLT1*, because irrespective of the other genes included in the model (i.e., the false-positive model in Figure 2 and the answer-driven model in Figure 3), the *FLT1* construct consistently had the strongest association with Q1 in replicate 137 and across all other replicates (replicates 99–136 and 138–200). We found that the MRB in the answer-driven stage 2 SEM model’s parameters and standard errors across 100 replicates was less than 5%. In addition, our stage 2 SEM model (Figure 3) revealed relationships between genes (e.g., *ARNT* and *FLT1*) and between covariates and genes (e.g., Smoking and *ELAVL4*) that were not discussed in the GAW17 answers. Thus, we believe that modeling all nine genes simultaneously together with the relevant environmental factors and population structure in a hierarchical manner that better reflects the underlying biology using latent variable SEM provides an improved understanding of each gene’s relevance in the disease pathophysiology compared to standard multiple regression methods.

## Conclusions

Our latent gene construct approach provides a viable framework for evaluating the aggregate effects of rare and common variants in multiple genes on a trait while adjusting for population substructure; however, more elegant methods are needed in stage 1 to minimize false positives and concomitantly improve identification of true-positive genes.

## References

Almasy L, Dyer TD, Peralta JM, Kent JW, Charlesworth JC, Curran JE, Blangero J: Genetic Analysis Workshop 17 mini-exome simulation. BMC Proc. 2011, 5 (suppl 9): S2-10.1186/1753-6561-5-S9-S2.

Dering C, Pugh E, Ziegler A: Statistical analysis of rare sequence variants: an overview of collapsing methods. Genet Epidemiol. 2011, X (suppl X): X-X.

Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ, et al: PLINK: a tool set for whole-genome association and population-based linkage analysis. Am J Hum Genet. 2007, 81: 559-575. 10.1086/519795.

Qin H, Morris N, Kang SJ, Li M, Tayo B, Lyon H, Hirschhorn J, Cooper RS, Zhu X: Genetic and population analysis: interrogating local population structure for fine mapping in genome-wide association studies. Bioinformatics. 2010, 26: 2961-2968. 10.1093/bioinformatics/btq560.

Qin H, Elston RC, Zhu X: Interrogating population structure and its impact on association tests. BMC Proc. 2011, 5 (suppl 9): S25-10.1186/1753-6561-5-S9-S25.

Nock NL, Larkin EM, Morris NJ, Li Y, Stein CM: Modeling the complex gene × environment interplay in the simulated rheumatoid arthritis GAW15 data using latent variable structural equation modeling. BMC Proc. 2007, 1 (suppl 1): S118-10.1186/1753-6561-1-s1-s118.

Nock NL, Wang X, Thompson CL, Song Y, Baechle D, Raska P, Stein CM, Gray-McGuire C: Defining genetic determinants of the metabolic syndrome in the Framingham Heart Study using association and structural equation modeling methods. BMC Proc. 2009, 3 (suppl 7): S50-10.1186/1753-6561-3-s7-s50.

Kline RB: Measurement models and confirmatory factor analysis. Principles and Practice of Structural Equation Modeling (by RB Kline). 2005, New York, Guilford Press, 133-145.

Hu LT, Bentler PM: Cutoff criteria for fit indexes in covariance structure analysis: conventional criteria versus new alternatives. Struct Equation Modeling. 1999, 6: 1-55. 10.1080/10705519909540118.

## Acknowledgments

The Genetic Analysis Workshops are supported by National Institutes of Health (NIH) grant R01 GM031575 from the National Institute of General Medical Sciences with additional support from NIH/National Cancer Institute (NCI) grant K07 CA129162.

This article has been published as part of *BMC Proceedings* Volume 5 Supplement 9, 2011: Genetic Analysis Workshop 17. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/5?issue=S9.

## Author information

### Authors and Affiliations

### Corresponding author

## Additional information

### Competing interests

The authors declare that they have no competing interests.

### Authors’ contributions

NLN conceived of the study, managed design and coordination, performed stage 2 analyses, and drafted the manuscript. LXZ performed data cleaning and conducted stage 1 analyses. Both authors read and approved the final manuscript.

## Rights and permissions

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## About this article

### Cite this article

Nock, N., Zhang, L. Evaluating aggregate effects of rare and common variants in the 1000 Genomes Project exon sequencing data using latent variable structural equation modeling.
*BMC Proc* **5**
(Suppl 9), S47 (2011). https://doi.org/10.1186/1753-6561-5-S9-S47

Published:

DOI: https://doi.org/10.1186/1753-6561-5-S9-S47