Proceedings | Open | Published:
Gene coexpression network analysis for family studies based on a meta-analytic approach
BMC Proceedingsvolume 10, Article number: 35 (2016)
For a better understanding of the biological mechanisms involved in complex traits or diseases, networks are often useful tools in genetic studies: coexpression networks based on pairwise correlations between genes are commonly used. In case of a family-based design, it can be problematic when there is a large between-family variation in expression levels. We propose here a gene coexpression network analysis for family studies. We build a coexpression network for each family and then combine the results. We applied our approach to data provided for analysis in the Genetic Analysis Workshop 19 and compared it to 2 naïve approaches—ignoring correlations among the expressions and decorrelating the gene expression by using the residuals of a mixed model—and a single-probe analysis. Our approach seemed to better deal with heterogeneity with regard to the naïve approaches. The naïve approaches did not provide any significant results, while our approach detected genes via indirect effects. It also detected more genes than the single-probe analysis.
Weighted gene coexpression network is a widely used method for studying biological networks based on pairwise correlations. This method provides more insight in the underlying biological mechanisms and offers a tool for dimension reduction by summarizing identified modules (clusters) of genes [1, 2]. How to perform such an analysis for family data is an open question. For family data Kraft et al.  noted that testing association between expression levels and traits without taking into account the family structure can lead to spurious results, especially when the number of families is small and in the presence of large between-family variation. In this paper, we propose a novel strategy for network analyses in a small set of relatively large families.
For this family-based approach, we first construct family-specific coexpression networks and test for association between the modules and the traits of interest. A common set of genes for all families was obtained by using the intersection and the union of family-specific modules. We compare this family-based approach with 2 naïve approaches: namely, one using the gene expression of the families directly (ignoring correlation) and one that first decorrelates the gene expressions and then applies the standard approach. We also compare our results with single-probe analyses.
The gene expression data set is composed of 647 individuals from 17 large families. These samples are from the data set described in Almasy et al. . Here, we focus on the largest 5 families: namely families 2, 5, 6, 8, and 10 with 65, 55, 45, 62, and 49 family members, respectively. The total number of individuals is 276 and the total number of probes from which gene expression is available is 20,364. We used the simulated quantitative phenotypes systolic blood pressure (SBP) and the phenotype Q1 at time point 1 as outcome variables. The simulation model of SBP comprises 15 genes and that of Q1 does not contain any of these genes. SBP, Q1, and all probes were corrected for age and sex by regressing out covariates and using residuals.
To decorrelate the gene expressions, we fitted for each probe a linear mixed model: X ij = μ + u ij + v i + ε ij , with X ij the value of the probe for the individual j in family i, u ij a normally distributed random genetic effect: u ij ~ N(0, S) where S = 2 * K * s g with K kinship matrix and s g genetic variance, v i a normally distributed random effect representing shared environmental effects, and ε ij a normally distributed residual. To obtain the residuals X ij * of this model, we used the function lmekin, which fits linear mixed models with specific structure of the variance-covariance matrix from the package coxme  in R.
For the single-probe analysis the following mixed model was used:
with Y ij the value of SBP or Q1 and X ij the value of the probe for individual j of family i. The random effects u ij , v i and ε ij are the genetic effect, the shared environmental effect and residuals respectively. The parameter β represents the effect of the probe on the outcome variable.
Coexpression networks were built on the data set without correction for family structure based on X ij (naïve approach), the data set adjusted for family structure based on X ij * (naïve decorrelated approach), and on the data sets from the 5 families separately.
We used signed coexpression networks. The adjacency matrix A = [a lk ] of each network was computed as follows: a lk = |0.5 + 0.5 × cor(x l , x k )|γ, with cor(x l , x k ) the correlation between x l the values vector of probe l and x k the values vector of probe k. The parameter γ is acting as a soft threshold in the adjacency matrix, when we increase the value γ the coefficient of the adjacency matrix will tend toward zero except for values really close to 1. We used the biweight midcorrelation based on the median, which is more robust than the Pearson correlation. The co-expression networks were constructed with the R package WGCNA (weighted gene correlation network analysis) . For each obtained module, the first principal component (eigengene) was computed.
From all modules and all families, the following models were fitted:
where Y j is the outcome, u j the random genetic effect, and eigengenej k the value of the eigengene of module k of family member j. Let E F2 M to E F10 M be the most significant eigenvalues of the family specific networks (NF2 to NF10) and let E F M be the most significant eigenvalue of these 5 eigenvalues and M F M be the corresponding module. Identify the modules of the family-specific networks, which have the highest overlap with M F M (denoted as M F2 O to M F10 O ). Next, 2 common sets of genes for all families were obtained by taking the intersection (MF = M F2 O ∩ M F5 O ∩ M F6 O ∩ M F8 O ∩ M F10 O ) and the union (MF = M F2 O ∪ M F5 O ∪ M F6 O ∪ M F8 O ∪ M F10 O ) of the family specific modules. The first principal components of the 2 common sets were computed. The principal component that explained most of the variance of the corresponding set of genes was used as the eigengene EF of the family-based approach.
The eigengenes of the naïve approach (EN), the naïve approach after decorrelation (END), and the family-based approach (EF) are tested for association with the 2 phenotypes SBP and Q1. Here, the following mixed model was used:
with Y ij the phenotype value for individual j of family i and eigengene ij k the value of eigengene of module k of individual j of family i. And u ij , v i and ε ij are the genetic effect, the shared environmental effect and residuals, respectively. The parameter β represents the effect of the eigengene k on the outcome variable.
Finally, because spurious associations are especially expected in the presence of large between-family heterogeneity , we also performed a network analysis using the subset of 25 % most heritable probes when performing the network analysis (n = 4911 probes with heritability between 0.33 and 0.88).
To test for significance we used a nominal alpha level of 0.05 and the Bonferroni correction was applied to take into account multiple testing.
Results obtained with all probes
For per family analysis, the module that showed the highest correlation with the SBP was the magenta module obtained in family 8 (M F8 M ) (β = 2.52, p = 0.0011). M F8 M comprises 710 genes. Table 1 gives for each family the number of genes of the module with the highest overlap. The intersection and the union of these 5 family modules, comprises 62 and 1746 probes, respectively. The first principal component (eigengene) of the probes in the intersection set explained more than 50 % of the variance for each family, while for the union set the eigengene explained only between 23 % and 31 % of the variance of the expression levels. Therefore the eigengene of the intersection set was used as summary for the family approach (EF). In Table 2, for each family the effect of EFi on SBP (β of model (2)) is given. For families 2 and 8, the eigengenes (EF2 and EF8) were significantly associated with SBP.
When analyzing all families together, none of the approaches provided significant results. The joint analysis of the families using EF as eigengene in model (3) did not provide a significant association SBP (β = −0.13, p = 0.49). For the naïve approach, the eigengene of the module magenta (EN) had the smallest p value (β = −3.21, p = 0.01). For the naïve approach using the decorrelated data set, the eigengene of the module grey60 (END) had the smallest p value (β = −3.03, p = 0.0061). After multiple-testing correction (between 43 and 50 modules in each network) none of the results were significant. Finally the single-probe analysis preformed in the 5 families by using model (1) provided 1 significantly associated probe with SBP (CRIP2; β = −13.68, p = 1.7e-06).
The intersection module of the family based approach did not contain any of the 15 genes used for the simulation. Also, the identified gene of the single-probe analysis is not among these 15 genes. We hypothesized that correlation might exist between EF2, EF8, and the gene expression of CRIP2 on one hand and the set of 15 genes on the other hand. Indeed, EF2 showed significant correlation with PSMD5 (p = 0.004) and GTF2IRD1 (p = 0.007), and EF8 showed significant correlation with ZNF443 (p = 5e-05), PSMD5 (p = 3e-05), and ABTB1 (p = 6e-05). When the presence of these 15 genes in the modules was investigated, it appeared that they were in different modules (Table 3). The gene CRIP2, which was significant in the single-probe analysis, showed significant correlation with the gene KRTAP11-1 (p = 3.1e-03).
Analysis of Q1
The results of the analysis of Q1 are also given in Table 2. For the family approach, none of the modules obtained in family-specific network analysis was significantly associated with Q1 and no common set could be defined. In Table 2, the estimates of strongest associated modules E F M for each family are given. For the naïve approach, the module red (EN) (β = 2.75, p = 0.00056) was significant and for the naïve approach using the decorrelated data the module green (END) (β = −1.41, p = 0.00094) was significantly associated with Q1.
Results obtained with the 25 % most heritable probes
For the naïve and the family approaches, the results of the network-based analyses using only the gene expressions of the 25 % most heritable probes (n = 4911 probes with heritability between 0.33 and 0.88) are also given in Table 2. None of the 15 genes used in the simulation model for SBP was among this set of most heritable probes. For Family 6 the EF6 was significantly associated with SBP (p = 0.01). The association of EF in the 5 families with SBP was also significant (p = 0.02). For Q1, none of the approaches provided significant results. With regard to the single-probe analysis, no probe other than CRIP2 was significantly associated.
In this paper, we have proposed a novel strategy to perform a coexpression network analysis with family data: building a network for each of the large pedigrees and defining a common module by taking the intersection of family-specific modules. We compared our family-based approach with 2 naïve network approaches and a single-probe analysis. All analyses were performed in a small set of 5 relatively large families. None of the 15 genes in the simulation model was identified in this small data set. However, the family-based approach identified significant associations between the eigengene and SBP in 2 families. This eigengene was significantly correlated with 4 of the 15 genes. When analyzing all families jointly the family eigengene was not significant. Also the naïve network approaches did not provide any significant result. The single-probe analysis provided 1 significant gene, which was correlated with 1 of the 15 genes. To study the performance of the methods with regard to false-positive findings, we also analyzed the trait Q1 for which no gene expressions were included in the simulation model. The family approach did not provide a significant finding, whereas both naïve approaches identified a significant module for Q1. The result in the naïve approach based on gene expression (X ij ) is in line with the findings of Kraft et al. . We did not expect to have a false-positive finding when using the decorrelated data (X ij * ) as input for our network analysis. Possible explanations for this finding are the fact that the correlation based on the kinship coefficient might not be appropriate for gene expressions, and randomness. In addition to the set of all probes, networks were also built using only the 25 % most heritable probes. Especially for these variables that show large between-family variation spurious associations might occur when the family structure is not taken into account. This was not confirmed in our analysis. More research is needed to study the sensitivity of the methods for between-family variation.
We did not know the answers when we developed the family-based approach and analyzed the data. The simulation model used to create the data sets may not be well suited to pick up the 15 genes directly by network analysis. The 15 genes present in the model were in different pathways: they were not correlated. Moreover our approach was able to identify indirect effects: that is, the eigengenes were correlated with the 15 genes. Thus the significant association of the family-based network approach represented the largest number of genes from the simulation model. We expect that especially in the presence of large between-family variation our approach would perform best. A thorough simulation study is required to investigate the performance of our method further.
Network analysis provides a tool to reduce the number of tests by first summarizing the data in sets of genes with correlated gene expressions and summarizing the gene set by the first principle component. Another obvious reduction step is to only consider the heritable probes for the analysis. It appeared that by using the heritable probes the results across the families were less heterogeneous. The family approach, as well as the naïve approach using decorrelated data, provided significant results for SBP.
In this paper we combined the family-specific modules by taking the intersection of the modules which showed most overlap. This approach worked well for the relatively small set of 5 families. When we applied our method to the 6 largest families, similar results were obtained (data not shown). However, intersection might not be the most appropriate approach to combine modules across families, because the intersection set becomes too small. Alternative approaches have to be developed. For example, LASSO (least absolute shrinkage and selection operator)-type methods can be used to select probes from the union sets. Development of a method for constructing a common set from the family specific modules is a topic for future research.
Finally more research is needed to evaluate the performance of our method with regard to false-positive and false-negative findings in relationship to heterogeneity, family size, the number of families, and the heritability of gene expressions.
Plaisier CL, Horvath S, Huertas-Vazquez A, Cruz-Bautista I, Herrera MF, Tusie-Luna T, Aguilar-Salinas C, Pajukanta P. A systems genetics approach implicates USF1, FADS3, and other causal candidate genes for familial combined hyperlipidemia. PLoS Genet. 2009;5(9):e1000642.
de Jong S, Boks MP, Fuller TF, Strengman E, Janson E, de Kovel CG, Ori AP, Vi N, Mulder F, Blom JD, et al. A gene co-expression network in whole blood of schizophrenia patients is independent of antipsychotic-use and enriched for brain-expressed genes. PLoS One. 2012;7(6):e39498.
Kraft P, Schadt E, Aten J, Horvath S. A family-based test for correlation between gene expression and trait values. Am J Hum Genet. 2003;72(5):1323–30.
Blangero J, Teslovich TM, Sim X, Almeida MA, Jun G, Dyer TD, Johnson M, Peralta JM, Manning AK, Wood AR, et al. Omics squared: human genomic, transcriptomic, and phenotypic data for Genetic Analysis Workshop 19. BMC Proc. 2015;9 Suppl 8:S2.
Therneau T. The lmekin function https://cran.r-project.org/web/packages/coxme/vignettes/lmekin.pdf
Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics. 2009;9:559.
This work is supported by FP7-Health-F5-2012 under grant agreement No. 305280 (MIMOmics). The Genetic Analysis Workshop is funded by the National Institutes of Health. The authors would like to thank all participants of the gene expression group of Genetic Analysis Workshop 19 for their constructive comments and discussions.
This article has been published as part of BMC Proceedings Volume 10 Supplement 7, 2016: Genetic Analysis Workshop 19: Sequence, Blood Pressure and Expression Data. Summary articles. The full contents of the supplement are available online at http://bmcproc.biomedcentral.com/articles/supplements/volume-10-supplement-7. Publication of the proceedings of Genetic Analysis Workshop 19 was supported by National Institutes of Health grant R01 GM031575.
RT, HWU, and JH designed the study, interpreted the results and wrote the manuscript. RT carried out the analyses. EvdA and ST helped in developing the algorithm and provided part of the codes. BB helped in data handling and writing. All authors read and approved the final manuscript.
The authors declare that they have no competing interest.