Evaluating aggregate effects of rare and common variants in the 1000 Genomes Project exon sequencing data using latent variable structural equation modeling

Methods that can evaluate aggregate effects of rare and common variants are limited. Therefore, we applied a two-stage approach to evaluate aggregate gene effects in the 1000 Genomes Project data, which contain 24,487 single-nucleotide polymorphisms (SNPs) in 697 unrelated individuals from 7 populations. In stage 1, we identified potentially interesting genes (PIGs) as those having at least one SNP meeting Bonferroni correction using univariate, multiple regression models. In stage 2, we evaluate aggregate PIG effects on trait, Q1, by modeling each gene as a latent construct, which is defined by multiple common and rare variants, using the multivariate statistical framework of structural equation modeling (SEM). In stage 1, we found that PIGs varied markedly between a randomly selected replicate (replicate 137) and 100 other replicates, with the exception of FLT1. In stage 1, collapsing rare variants decreased false positives but increased false negatives. In stage 2, we developed a good-fitting SEM model that included all nine genes simulated to affect Q1 (FLT1, KDR, ARNT, ELAV4, FLT4, HIF1A, HIF3A, VEGFA, VEGFC) and found that FLT1 had the largest effect on Q1 (βstd = 0.33 ± 0.05). Using replicate 137 estimates as population values, we found that the mean relative bias in the parameters (loadings, paths, residuals) and their standard errors across 100 replicates was on average, less than 5%. Our latent variable SEM approach provides a viable framework for modeling aggregate effects of rare and common variants in multiple genes, but more elegant methods are needed in stage 1 to minimize type I and type II error.


Background
The 1000 Genomes Project is an international publicprivate consortium aiming to build the most detailed map of human genetic variation with the overarching goal to improve our understanding of the genetic contribution to common human diseases. Initially launched in 2008, three pilot studies have been completed to test multiple sequencing methods. Pilot Project 3 involved sequencing the coding regions (exons) of 3,205 genes in 697 individuals from 7 populations, which revealed 24,487 rare and common genetic variants. The sequencing data from Pilot Project 3 were used for Genetic Analysis Workshop 17 (GAW17), and details of this data set, including how the phenotypes were simulated, can be found in Almasy et al. [1].
Although strategies have been developed to evaluate the contribution of rare variants to disease susceptibility in nonfamilial data, including collapsing methods, which are reviewed by Dering et al. [2], approaches that evaluate the combined or aggregate effects of rare and common variants together are limited. Thus in this paper we aim to evaluate the aggregate effects of rare and common single-nucleotide polymorphisms (SNPs) in genes on the simulated quantitative trait Q1 using the Pilot Project 3 data (unrelated subjects). In stage 1 we use multiple regression methods (with and without collapsing rare variants) to identify potentially interesting genes (PIGs); in stage 2, we use a latent variable structural equation modeling (SEM) approach to evaluate aggregate effects of rare and common variants in PIGs on Q1. During our initial analyses, we were blinded to the "answers" of the simulated model. In post hoc analyses, we used knowledge that 39 SNPs in 9 genes, primarily in the vascular endothelial growth factor (VEGF) pathway, were simulated to be associated with Q1.

Data cleaning and preparation: phenotype and genotype variables
We first examined the distribution of Q1, which we arbitrarily chose from the three simulated quantitative phenotypes available (see Almasy et al. [1]), in a randomly chosen replicate (replicate 137) of the unrelated individuals from the GAW17 data using SAS, v. 9.1 (SAS Institute Inc., Cary, North Carolina). Visual inspection of histograms and quantile-quantile (Q-Q) plots and Shapiro-Wilk and Kolmogorov-Smirnov tests indicated that Q1 was essentially normally distributed. Summary statistics and Mendelian inheritance errors were evaluated using PLINK, v. 1.07 [3].

Stage 1: statistical methods for regression-based analyses
We evaluated the association between each SNP as an additive model (0, 1, or 2 copies of the minor allele) and Q1 using linear regression models adjusted for all the covariates provided in the GAW17 data set (Age, Sex, Smoking, population [Pop1]) using PLINK, v. 1.07. In addition, we collapsed rare variants (minor allele frequency [MAF] < 0.05) in each gene using the indicator coding method [2], which assumes equal weighting of each rare variant. We also adjusted the models for population substructure using principal components (PCs). PCs were generated using the centralized scoring matrix method of Qin et al. [4,5] in MATLAB (MathWorks, Boston, Massachusetts). We adjusted models for multiple PCs and found that adjusting for 10 or 12 PCs minimized the number of false positives (see Results section, Table three, and additional details in Qin et al. [5]).

Stage 2: statistical methods for latent variable structural equation modeling
Our approach for modeling multiple common variants in genes using latent constructs has been described previously [6]. Essentially, we let a latent variable (ovals in Figure 1, e.g., FLT1) represent the overall variation in a gene, which we formally describe by multiple SNPs (rectangles in Figure 1, e.g., C13S522) in that gene. In terms of notation, briefly, in latent variable structural equation modeling (SEM), two general submodels are used: (1) a measurement model that develops the relations (loadings; e.g., the arrow from FLT1 to C13S522 in Figure 1) between the observed variables and the latent constructs; and (2) a structural model that develops the relations (path coefficients; e.g., the arrow from PopStr to FLT1 in Figure 1) between the latent variables. The general form of the measurement model is: where y is the p × 1 vector of observed variables, h is the m × 1 vector of latent random variables, ε is the p × 1 vector of measurement errors for y, and Λ y is the p × n matrix of coefficients relating y to h.
The general form of the structural model imposes constraints such that: where Β is the m × m matrix of path coefficients and ζ is the m × 1 vector of errors or disturbances in the endogenous (dependent) latent variables.
The structural model can be modified by adding a qdimensional vector of covariates (x), an m × q matrix of regression coefficients (Γ), and an m-dimensional vector of intercepts (a): Similar to our prior work using common variants to describe overall variation in a gene [6,7], we used eigenvalues, scree plots, reliability, linkage disequilibrium plots (Haploview, v. 4.2), and association results from stage 1 to help select the most informative SNPs and define parsimonious latent gene constructs. We performed confirmatory factor analysis using a robust maximum-likelihood estimator, which provides test statistics and standard errors robust to nonnormality, using Mplus, v. 5.1 (Muthén and Muthén, Los Angeles, California), to generate and test single latent gene construct models, which included adjustment for covariates (Age, Sex, Smoking) and population structure (modeled using a latent variable defined by Pop1 and the top 12 PCs).
To assess the overall model goodness-of-fit, we used the chi-square test, the comparative fit index (CFI), the root mean-square error of approximation (RMSEA), and the standardized root mean-square residual (SRMR) [8].
The chi-square test evaluates whether the covariance matrix is equal to the model-implied covariance matrix predicted by the parameters, but it is sensitive to sample size and complexity. Thus, other fit indexes, including the CFI, RMSEA, and SRMR, have been used to evaluate model fit [8]. The CFI is relatively insensitive to sample size and model complexity, and CFI ≥ 0.95 and CFI ≥ 0.90 suggest good and acceptable fit, respectively [9]. The RMSEA is less sensitive to sample size and favors more parsimonious models. An RMSEA ≤ 0.06 represents good fit, and an RMSEA ≤ 0.10 yields acceptable fit [9]. An SRMR ≤ 0.08 represents a good fit, and an SRMR < 0.10 represents an acceptable fit [8,9]. We evaluated the performance of the SEM model by calculating the mean relative bias in the parameters and their standard errors across 100 replicates (replicates 99-136 and 138-200) available in the GAW17 data [1]. All p-values are from two-sided tests.

Results
Without knowledge of the underlying simulated model and using a randomly selected replicate (replicate 137), we evaluated potential associations between each SNP and trait Q1 in stage 1. We found that several genes had a least one SNP meeting or exceeding the Bonferroni-corrected level with (p ≤ 8.33 × 10 −6 ) and without (p ≤ 2.04 × 10 −6 ) collapsing rare variants (MAF < 0.05) ( Table 1), but the most significant associations were observed with common (C13S522, C13S523) and rare (C1S3524) variants in FLT1 (Table 2).
In stage 2, when building the FLT1 construct using replicate 137, we found that adding rare variants to the common variants improved the model fit (CFI = 0.90, RMSEA = 0.03, and SRMR = 0.08 in Figure 1A vs. CFI = 0.96, RMSEA = 0.02, and SRMR = 0.05 in Figure 1B), improved construct reliability (Cronbach's a: 0.40 (A) vs. 0.53 (B)), and increased the variance explained in Q1 (R 2 : 0.30 ± 0.04 (A) vs. 0.36 ± 0.04 (B)). In a larger SEM ( Figure 2) with 6 genes (26 SNPs) and with population structure represented by a latent variable (PopStr), we found that the path coefficient of FLT1 on Q1 (b std = 0.49 ± 0.04) was slightly lower than that in the reduced model ( Figure 1B: b std = 0.43 ± 0.05), but FLT1 remained the gene most strongly associated with Q1,    followed by SPHKAP, LRRN2, C5ORF25, and FADS3. Genes AKAP13 and OR2T34 were not associated with Q1. Population structure was not significantly associated with Q1 or with genes where paths are not shown. In post hoc analyses, we found that the list of PIGs varied markedly across replicates (99-136 and 138-200), with the exception of FLT1, which had at least one SNP in all 100 replicates exceeding the Bonferroni-corrected p-value in models adjusted for 10 or 12 PCs, and with and without rare variants collapsed. KDR was the next most consistent PIG, which was identified in 12 and 20 of the 100 replicates when rare variants were and were not collapsed, respectively; the results were the same when adjusting for 10 and 12 PCs.
We obtained the answers to the GAW17 simulation model to better understand the performance of our stage 1 approach and to develop a stage 2 model that would more closely reflect the simulated model. The answers revealed that 39 SNPs in 9 genes (FLT1, FLT4, KDR, ARNT, ELAVL4, HIF1A, HIF3A, VEGFA, VEGFC), primarily in the VEGF pathway, were simulated to be associated with Q1.
In regards to building the Stage 2 model, because the GAW17 answers provided only a list of the nine genes simulated to be associated with Q1, we used the pathway database of the Kyoto Encyclopedia of Genes and Genomes (KEGG) (http://www.genome.jp/kegg/pathway. html; VEGF Signaling, Cytokine-Cytokine Receptor Interaction, Pathways in Cancer) to better understand the biological relationships between the nine genes. We developed a good-fitting model (CFI = 0.90, RMSEA = 0.04, SRMR = 0.03) that included all nine genes simulated to affect Q1 (Figure 3). The variance explained in Q1 (R 2 = 0.42) was greater than in prior models. FLT1 remained the gene most strongly associated with Q1, followed by ARNT, VEGFA, KDR, VEGFC, FLT4, and HIF3A. Smoking was simulated to be associated with KDR, but we observed only a marginal association (b std = 0.05 ± 0.02, p = 0.08; not shown) and found that Smoking was more highly associated with HIF3A and ELAVL4. Modeling all nine genes simultaneously revealed that HIF1A was associated with VEGFC but that ELAVL4 and HIF1A were not associated with Q1. Removing paths designated by a dashed line (Figure 3) resulted in a slightly improved model fit (CFI = 0.91, RMSEA = 0.04, SRMR = 0.02), but the magnitude of the paths from genes to Q1 remained similar.
To evaluate the performance of the stage 2 SEM model, we used estimates from a randomly selected replicate (replicate 137) to represent population values (because the GAW17 answers did not contain standardized estimates for aggregate gene effects that we could directly compare to) and compared these population values to estimates from 100 replicates (replicates 99-136 and 138-200) available in the GAW17 data [1]. We found that the mean relative bias (MRB) in the parameters and in the standard errors across 100 replicates was 4.27% and 4.69%, respectively. The MRBs in standardized loadings and residuals were 1.47% for Λ, 0.09% for ε, and 0.57% for ζ; the MRBs in the standard errors (SEs) were 0.32% for Λ, 0.98% for ε, and 2.19% for ζ. The MRB was generally similar between common and rare variants. For example, in the FLT1 common SNP C13S523 (MAF = 0.07) and the FLT1 rare SNP C13S524 (MAF = 0.004), the MRB in Λ and the SE of Λ were 0.35% in C13S523, 0.62% in C13S524, and 0.33% in C13S523 vs. 0.98% in C13S524. The MRB of ε was 0.31% in C13S523 and 0.22% in C13S524, and the MRB of the SE of ε was 0.63% in C13S523 and 0.20% in C13S534. The largest bias was observed in the path coefficients (b = 19.9%; SE of b = 9.79%), which was quite severe in some cases, such as the HIF1A path coefficient, where the bias reached 67.94%. Interestingly, the post hoc analysis revealed that genes represented by private variants, such as HIF1A, that were associated with Q1 in replicate 137 were not significantly associated with Q1 in most replicates. Also, the effects of covariates (Age, Smoking, Sex, Pop1, PCs) varied markedly across replicates. Model fit, however, was generally consistent across replicates, with the average CFI, RMSEA, and RMSR being 0.91, 0.04, and 0.02, respectively.

Discussion
In stage 1, adjusting for 10 or 12 PCs (see also Qin et al. [5]) and collapsing rare variants using the indicator coding method decreased the number of false-positive genes by about 78% (1.2 vs. 5.4), on average, but the number of false-negative genes remained high regardless of whether rare variants were collapsed or not (7.9 vs. 7.8). This is striking because we missed identifying about 87% of the simulated causal genes and correctly identified only one gene (11.1%; FLT1) over all 100 replicates (replicates 99-136 and 138-200). Our stage 2 SEM results were able to confirm the importance of FLT1, because irrespective of the other genes included in the model (i.e., the false-positive model in Figure 2 and the answer-driven model in Figure 3), the FLT1 construct consistently had the strongest association with Q1 in replicate 137 and across all other replicates (replicates 99-136 and 138-200). We found that the MRB in the answer-driven stage 2 SEM model's parameters and standard errors across 100 replicates was less than 5%. In addition, our stage 2 SEM model ( Figure 3) revealed relationships between genes (e.g., ARNT and FLT1) and between covariates and genes (e.g., Smoking and ELAVL4) that were not discussed in the GAW17 answers. Thus, we believe that modeling all nine genes simultaneously together with the relevant environmental factors and population structure in a hierarchical manner that better reflects the underlying biology using latent variable SEM provides an improved understanding of each gene's relevance in the disease pathophysiology compared to standard multiple regression methods.

Conclusions
Our latent gene construct approach provides a viable framework for evaluating the aggregate effects of rare and common variants in multiple genes on a trait while adjusting for population substructure; however, more elegant methods are needed in stage 1 to minimize false positives and concomitantly improve identification of true-positive genes. Figure 3 Modeling the aggregate effects of common and rare variants in multiple genes (with knowledge of the answers) using latent variable structural equation modeling. Model of the associations between 9 genes (19 SNPs) simulated to affect Q1 (Q1 R 2 = 0.42, CFI = 0.90, RMSEA = 0.04, SRMR = 0.03). * p < 0.10; ** p ≤ 0.05; *** p ≤ 0.01. Residuals and paths from population structure not shown for clarity.