- Open Access
Gene coexpression network analysis for family studies based on a meta-analytic approach
© The Author(s). 2016
- Published: 18 October 2016
For a better understanding of the biological mechanisms involved in complex traits or diseases, networks are often useful tools in genetic studies: coexpression networks based on pairwise correlations between genes are commonly used. In case of a family-based design, it can be problematic when there is a large between-family variation in expression levels. We propose here a gene coexpression network analysis for family studies. We build a coexpression network for each family and then combine the results. We applied our approach to data provided for analysis in the Genetic Analysis Workshop 19 and compared it to 2 naïve approaches—ignoring correlations among the expressions and decorrelating the gene expression by using the residuals of a mixed model—and a single-probe analysis. Our approach seemed to better deal with heterogeneity with regard to the naïve approaches. The naïve approaches did not provide any significant results, while our approach detected genes via indirect effects. It also detected more genes than the single-probe analysis.
- Systolic Blood Pressure
- Network Analysis
- Coexpression Network
- Gene Coexpression Network
- Shared Environmental Effect
Weighted gene coexpression network is a widely used method for studying biological networks based on pairwise correlations. This method provides more insight in the underlying biological mechanisms and offers a tool for dimension reduction by summarizing identified modules (clusters) of genes [1, 2]. How to perform such an analysis for family data is an open question. For family data Kraft et al.  noted that testing association between expression levels and traits without taking into account the family structure can lead to spurious results, especially when the number of families is small and in the presence of large between-family variation. In this paper, we propose a novel strategy for network analyses in a small set of relatively large families.
For this family-based approach, we first construct family-specific coexpression networks and test for association between the modules and the traits of interest. A common set of genes for all families was obtained by using the intersection and the union of family-specific modules. We compare this family-based approach with 2 naïve approaches: namely, one using the gene expression of the families directly (ignoring correlation) and one that first decorrelates the gene expressions and then applies the standard approach. We also compare our results with single-probe analyses.
The gene expression data set is composed of 647 individuals from 17 large families. These samples are from the data set described in Almasy et al. . Here, we focus on the largest 5 families: namely families 2, 5, 6, 8, and 10 with 65, 55, 45, 62, and 49 family members, respectively. The total number of individuals is 276 and the total number of probes from which gene expression is available is 20,364. We used the simulated quantitative phenotypes systolic blood pressure (SBP) and the phenotype Q1 at time point 1 as outcome variables. The simulation model of SBP comprises 15 genes and that of Q1 does not contain any of these genes. SBP, Q1, and all probes were corrected for age and sex by regressing out covariates and using residuals.
To decorrelate the gene expressions, we fitted for each probe a linear mixed model: X ij = μ + u ij + v i + ε ij , with X ij the value of the probe for the individual j in family i, u ij a normally distributed random genetic effect: u ij ~ N(0, S) where S = 2 * K * s g with K kinship matrix and s g genetic variance, v i a normally distributed random effect representing shared environmental effects, and ε ij a normally distributed residual. To obtain the residuals X ij * of this model, we used the function lmekin, which fits linear mixed models with specific structure of the variance-covariance matrix from the package coxme  in R.
Coexpression networks were built on the data set without correction for family structure based on X ij (naïve approach), the data set adjusted for family structure based on X ij * (naïve decorrelated approach), and on the data sets from the 5 families separately.
We used signed coexpression networks. The adjacency matrix A = [a lk ] of each network was computed as follows: a lk = |0.5 + 0.5 × cor(x l , x k )| γ , with cor(x l , x k ) the correlation between x l the values vector of probe l and x k the values vector of probe k. The parameter γ is acting as a soft threshold in the adjacency matrix, when we increase the value γ the coefficient of the adjacency matrix will tend toward zero except for values really close to 1. We used the biweight midcorrelation based on the median, which is more robust than the Pearson correlation. The co-expression networks were constructed with the R package WGCNA (weighted gene correlation network analysis) . For each obtained module, the first principal component (eigengene) was computed.
Finally, because spurious associations are especially expected in the presence of large between-family heterogeneity , we also performed a network analysis using the subset of 25 % most heritable probes when performing the network analysis (n = 4911 probes with heritability between 0.33 and 0.88).
To test for significance we used a nominal alpha level of 0.05 and the Bonferroni correction was applied to take into account multiple testing.
Results obtained with all probes
Module size of M F2 O to M F10 O and overlap size with M F M in the all-probes analysis
M F2 O
M F5 O
M F6 O
M F8 O
M F10 O
Size of the overlap with M F M
Parameter estimates of the association between eigengenes and Q1 and SBP
25 % Most heritable probes
When analyzing all families together, none of the approaches provided significant results. The joint analysis of the families using EF as eigengene in model (3) did not provide a significant association SBP (β = −0.13, p = 0.49). For the naïve approach, the eigengene of the module magenta (EN) had the smallest p value (β = −3.21, p = 0.01). For the naïve approach using the decorrelated data set, the eigengene of the module grey60 (END) had the smallest p value (β = −3.03, p = 0.0061). After multiple-testing correction (between 43 and 50 modules in each network) none of the results were significant. Finally the single-probe analysis preformed in the 5 families by using model (1) provided 1 significantly associated probe with SBP (CRIP2; β = −13.68, p = 1.7e-06).
Analysis of Q1
The results of the analysis of Q1 are also given in Table 2. For the family approach, none of the modules obtained in family-specific network analysis was significantly associated with Q1 and no common set could be defined. In Table 2, the estimates of strongest associated modules E F M for each family are given. For the naïve approach, the module red (EN) (β = 2.75, p = 0.00056) was significant and for the naïve approach using the decorrelated data the module green (END) (β = −1.41, p = 0.00094) was significantly associated with Q1.
Results obtained with the 25 % most heritable probes
For the naïve and the family approaches, the results of the network-based analyses using only the gene expressions of the 25 % most heritable probes (n = 4911 probes with heritability between 0.33 and 0.88) are also given in Table 2. None of the 15 genes used in the simulation model for SBP was among this set of most heritable probes. For Family 6 the EF6 was significantly associated with SBP (p = 0.01). The association of EF in the 5 families with SBP was also significant (p = 0.02). For Q1, none of the approaches provided significant results. With regard to the single-probe analysis, no probe other than CRIP2 was significantly associated.
In this paper, we have proposed a novel strategy to perform a coexpression network analysis with family data: building a network for each of the large pedigrees and defining a common module by taking the intersection of family-specific modules. We compared our family-based approach with 2 naïve network approaches and a single-probe analysis. All analyses were performed in a small set of 5 relatively large families. None of the 15 genes in the simulation model was identified in this small data set. However, the family-based approach identified significant associations between the eigengene and SBP in 2 families. This eigengene was significantly correlated with 4 of the 15 genes. When analyzing all families jointly the family eigengene was not significant. Also the naïve network approaches did not provide any significant result. The single-probe analysis provided 1 significant gene, which was correlated with 1 of the 15 genes. To study the performance of the methods with regard to false-positive findings, we also analyzed the trait Q1 for which no gene expressions were included in the simulation model. The family approach did not provide a significant finding, whereas both naïve approaches identified a significant module for Q1. The result in the naïve approach based on gene expression (X ij ) is in line with the findings of Kraft et al. . We did not expect to have a false-positive finding when using the decorrelated data (X ij * ) as input for our network analysis. Possible explanations for this finding are the fact that the correlation based on the kinship coefficient might not be appropriate for gene expressions, and randomness. In addition to the set of all probes, networks were also built using only the 25 % most heritable probes. Especially for these variables that show large between-family variation spurious associations might occur when the family structure is not taken into account. This was not confirmed in our analysis. More research is needed to study the sensitivity of the methods for between-family variation.
We did not know the answers when we developed the family-based approach and analyzed the data. The simulation model used to create the data sets may not be well suited to pick up the 15 genes directly by network analysis. The 15 genes present in the model were in different pathways: they were not correlated. Moreover our approach was able to identify indirect effects: that is, the eigengenes were correlated with the 15 genes. Thus the significant association of the family-based network approach represented the largest number of genes from the simulation model. We expect that especially in the presence of large between-family variation our approach would perform best. A thorough simulation study is required to investigate the performance of our method further.
Network analysis provides a tool to reduce the number of tests by first summarizing the data in sets of genes with correlated gene expressions and summarizing the gene set by the first principle component. Another obvious reduction step is to only consider the heritable probes for the analysis. It appeared that by using the heritable probes the results across the families were less heterogeneous. The family approach, as well as the naïve approach using decorrelated data, provided significant results for SBP.
In this paper we combined the family-specific modules by taking the intersection of the modules which showed most overlap. This approach worked well for the relatively small set of 5 families. When we applied our method to the 6 largest families, similar results were obtained (data not shown). However, intersection might not be the most appropriate approach to combine modules across families, because the intersection set becomes too small. Alternative approaches have to be developed. For example, LASSO (least absolute shrinkage and selection operator)-type methods can be used to select probes from the union sets. Development of a method for constructing a common set from the family specific modules is a topic for future research.
Finally more research is needed to evaluate the performance of our method with regard to false-positive and false-negative findings in relationship to heterogeneity, family size, the number of families, and the heritability of gene expressions.
This work is supported by FP7-Health-F5-2012 under grant agreement No. 305280 (MIMOmics). The Genetic Analysis Workshop is funded by the National Institutes of Health. The authors would like to thank all participants of the gene expression group of Genetic Analysis Workshop 19 for their constructive comments and discussions.
This article has been published as part of BMC Proceedings Volume 10 Supplement 7, 2016: Genetic Analysis Workshop 19: Sequence, Blood Pressure and Expression Data. Summary articles. The full contents of the supplement are available online at http://bmcproc.biomedcentral.com/articles/supplements/volume-10-supplement-7. Publication of the proceedings of Genetic Analysis Workshop 19 was supported by National Institutes of Health grant R01 GM031575.
RT, HWU, and JH designed the study, interpreted the results and wrote the manuscript. RT carried out the analyses. EvdA and ST helped in developing the algorithm and provided part of the codes. BB helped in data handling and writing. All authors read and approved the final manuscript.
The authors declare that they have no competing interest.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Plaisier CL, Horvath S, Huertas-Vazquez A, Cruz-Bautista I, Herrera MF, Tusie-Luna T, Aguilar-Salinas C, Pajukanta P. A systems genetics approach implicates USF1, FADS3, and other causal candidate genes for familial combined hyperlipidemia. PLoS Genet. 2009;5(9):e1000642.View ArticlePubMedPubMed CentralGoogle Scholar
- de Jong S, Boks MP, Fuller TF, Strengman E, Janson E, de Kovel CG, Ori AP, Vi N, Mulder F, Blom JD, et al. A gene co-expression network in whole blood of schizophrenia patients is independent of antipsychotic-use and enriched for brain-expressed genes. PLoS One. 2012;7(6):e39498.View ArticlePubMedPubMed CentralGoogle Scholar
- Kraft P, Schadt E, Aten J, Horvath S. A family-based test for correlation between gene expression and trait values. Am J Hum Genet. 2003;72(5):1323–30.View ArticlePubMedPubMed CentralGoogle Scholar
- Blangero J, Teslovich TM, Sim X, Almeida MA, Jun G, Dyer TD, Johnson M, Peralta JM, Manning AK, Wood AR, et al. Omics squared: human genomic, transcriptomic, and phenotypic data for Genetic Analysis Workshop 19. BMC Proc. 2015;9 Suppl 8:S2.Google Scholar
- Therneau T. The lmekin function https://cran.r-project.org/web/packages/coxme/vignettes/lmekin.pdf
- Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics. 2009;9:559.View ArticleGoogle Scholar