Application of structural equation models to construct genetic networks using differentially expressed genes and single-nucleotide polymorphisms

Understanding the genetic basis of human variation is an important goal of biomedical research. In this study, we used structural equation models (SEMs) to construct genetic networks to model how specific single-nucleotide polymorphisms (SNPs) from two genes known to cause acute myeloid leukemia (AML) by somatic mutation, runt-related transcription factor 1 (RUNX1) and ets variant gene 6 (ETV6), affect expression levels of other genes and how RUNX1 and ETV6 are related to each other. The SEM approach allows us to compare several candidate models from which an explanatory genetic network can be constructed.


Background
To understand the genetic basis of complex traits, it is often useful to examine intermediate gene expression traits. It is known that many diseases occur because of changes in either patterns or levels of gene expression [1,2]. There are also known cases in which differences in gene expression due to genotype are associated with phenotypic variation [3][4][5][6][7]. Thus, the analysis of gene expression may lead to a better understanding of the genetic basis of complex traits [3]. The study of gene expression variation, both within and between species, is currently an active area of research [8][9][10].
We propose an approach based on structural equation models (SEMs) [11] to construct a genetic network that can provide information about the genetic basis of expression variation in humans. SEMs were originally developed in the early 1970s in the field of social science to fit models with unobserved variables. A key feature of the SEM approach is that it allows one to compare candidate models.
from Genetic Analysis Workshop 15 St. Pete Beach, Florida, USA. 11-15 November 2006 In this study, we used gene expression data and SNP data provided by the Genetic Analysis Workshop 15 (GAW15). The study subjects were selected from 14, three-generation Utah families with ancestry from northern and western Europe (Utah Centre d'Etude du Polymorphisme Humain, or CEPH). Each family consisted of 14 individuals (4 grandparents, 2 parents, and 8 offspring), except for two families that had only 7 offspring (so only 13 individuals). In these families, expression levels of 8500 genes in lymphoblastoid cells were initially measured. Morley et al. [12] found that 3554 genes out of these 8500 genes showed greater variations among individuals than between replicate determinations in the same individual. Expression levels of these 3554 genes were provided to GAW15. The GAW15 genotype data contained 2882 autosomal and Xlinked single-nucleotide polymorphisms (SNPs) for each member of the 14 CEPH Utah families. The genotypes were generated by the SNP Consortium http://snp.cshl.org/.
We are interested in acute myeloid leukemia (AML). In this study, we focused on two known AML-related genes, runtrelated transcription factor 1 (RUNX1) and ets variant gene 6 (ETV6). Although expression levels for these two genes were not included in the GAW15 data set, we used two-way analysis of variance (ANOVA) models to find other genes that were differentially expressed depending on the genotype for SNPs within RUNX1 and ETV6. Using genes selected from the ANOVA, we fitted SEMs and constructed genetic networks.

Selection of differentially expressed genes from SNPs
First, we used dbSNP (Build 126) and the Online Mendelian Inheritance in Man (OMIM) FTP site of the National Center for Biotechnology Information (NCBI) to select SNPs within RUNX1 and ETV6. Next, we used the following model to fit two-way ANOVA models to find differentially expressed genes for each selected SNP: where y ijk is the expression level of the gene in the k th individual with SNP genotype j from family i. FAMID and SNP represent the family and SNP effects, respectively, and ε ijk is a random error term. A gene was selected to be part of the SEM analysis if the fit of the full model was significantly better (p < 0.05) than the model that included only overall mean and family.

Structural equation models for genetic networks
SEMs are comprehensive statistical models that allow us to test relations among observed and latent variables; in this context, we used them to construct a network of differentially expressed genes for specific polymorphisms. Latent variables are unobserved variables that are implied by the covariance among two or more indicators. The expression levels for RUNX1 and ETV6 were not included in the data provided to GAW15. Because of this, and the way we selected our observed variables, we believe it is likely that the latent variables represent the level of expression of these two genes.
Path diagrams were used to describe genetic networks in a graphical manner (Figure 1). There are two types of variables used in path diagrams: observed variables (represented by rectangles) and latent variables (represented by ovals). The relationships defined in SEMs consist of two parts: relationships among the latent variables and relationships between the latent variables and the observed variables. The SEMs for our genetic networks were defined as follows: where y is a vector representing the observed variables (gene expression levels); η is a vector of the latent variables (RUNX1 and ETV6 expression); Λ y is a matrix representing the true relationships between the gene expression levels and the gene functions; and B is a matrix representing the true the relationships among the latent variables. Random errors in the equations are represented by ε and ς. To fit this model, we used AMOS, a SEM software solution provided by SPSS http://www.spss.com/amos/.

SNP selection
From the 2882 SNPs provided in the GAW15 data set, we identified six SNPs located on RUNX1 and ETV6 on the basis of chromosomal location. The RUNX1 SNPs are rs882776, rs933131, and rs1892687 and the ETV6 SNPs are rs1894307, rs1573612, and rs916041.
Although we selected SNPs based solely on physical position, we expect that RUNX1 and ETV6 are co-regulated and are involved in the regulation of additional genes. We constructed the SEMs after identifying genes apparently regulated by these six SNPs.
First, we applied confirmatory factor analysis [11] to validate the use of these 12 differentially expressed genes in our SEM. Then, we added the relation between the latent variables and fitted the general SEMs.
To construct the genetic network, we treated RUNX1 and ETV6 as latent variables and all genes with expression values as observed variables. We considered SEMs where both RUNX1 and ETV6 could control any of the 12 expressed genes.
We fitted various SEMs to the data, as shown in Figure 1. Model 1 assumes that there is a correlation between the two latent variables and Models 2 and 3 assume that there is a one-sided causal relationship between them. Model 4 is a second-order factor model in which the latent variables directly influencing the observed variables may be influenced by other latent variables that need not have direct effects on the observed variables [11]. Finally, Model 5 assumes that there are mutual causal relationships between the two latent variables. Table 1 shows goodness-of-fit measures for the five models. The goodness-of-fit index (GFI) measures the relative differences between data and estimated values obtained from a model, while the adjusted GFI (AGFI) adjusts the GFI according to the degrees of freedom. If these two measures are close to 1, we conclude that the model fits the data well. Smaller root mean square residuals (RMRs) indicate betterfitting models. The Akaike information criterion (AIC) is a well-known measure that can be used for the model comparison. A smaller AIC indicates a better-fitting model.

Diagram of SEMs for constructing genetic structures
For Models 1-5, we tested, using the modification index (MI), many modified SEMs with different covariance structures for the error terms. The MI measures how much the chi-square statistic is expected to decrease if a particular constrained parameter is set free and the model is reestimated [11]. In our model fitting, we first evaluated all pair-wise error connections and chose the connection with the largest MI. Given the first connection, we then chose the best remaining possible connection, and so forth. We continued until the improvement in fit was small. Figure 1 illustrates the models that best fit the covariance structure for the error terms. Table 1 summarizes goodness of fit measures for these models. Model 1 provided the best fit, yielding the largest GFI and AGFI and the smallest AIC. For Models 2, 3, and 5, the estimates of the error variances were negative, indicating that these models represent the so-called "Heywood cases" [11]. Model 4 provided the best result in terms of the RMR but the worst result in terms of other goodness-of-fit measures such as the AIC. Because the AIC and RMR are different measures of goodness-of-fit, they might provide inconsistent results. One advantage of the AIC is that it considers the number of parameters in the model, while the RMR does not. Thus, we selected Model 1 as the best model in terms of the AIC.
The fitted results of Model 1 are summarized in Table 2; the results show that the latent variables we believe represent RUNX1 and ETV6 expression are closely related to each other. The parameter relating these latent variables (Table  2) estimates the covariance between RUNX1 and ETV6.
Of the 12 genes with measured expression, all three that were differentially expressed for ETV6 showed significant connection to the latent variable representing ETV6. Among the nine differentially expressed genes for RUNX1, four -TMED2, IFIT5, AASDHPPT, and TRAM1 -did not show significant control by the latent variable RUNX1. When we excluded these nonsignificant genes, the goodness-of-fit measures improved compared to the full Model 1 (Table 1) while the parameter estimates and significance ( Table 2) remained similar to those of the full model.

Conclusion
In this paper, we proposed the use of SEMs to construct genetic network models. We demonstrated this method using two genes, RUNX1 and ETV6, which are well known to cause AML by somatic mutation [13]. We used SNPs in these genes to select the differentially expressed genes related to these SNPs to use as the observed variables for our SEMs. The best-fitting model indicated that, even Box plot of selected gene expressions Figure 2 Box plot of selected gene expressions.
though we did not have direct measurements of RUNX1 and ETV6 expression, they are likely to be highly correlated with each other.
In summary, SEMs allow us to compare candidate models and to construct a genetic network from the best fitting model. However, the SEM approach has several limitations, such as 1) as the number of genes increases, the number of parameters in the model can increase exponentially; 2) if there are many genes, it is difficult to determine the causality relationship among the genes; 3) it can be difficult to interpret the relationship embedded in SEMs. For these reasons, in order to construct the genetic network efficiently, it is very useful to find a few genes that are correlated with other genes in the proposed network. Another limitation of the SEM approach is that it requires the selection of an initial model that is updated to construct the final SEM. As described above, this process can require many iterations to determine an appropriate error structure. In spite of these limitations, we believe that our application of SEMs to the GAW15 data demonstrates that the SEM approach can be a useful and efficient method for constructing informative genetic networks.