- Open Access
Identifying cis- and trans-acting single-nucleotide polymorphisms controlling lymphocyte gene expression in humans
© Hu et al; licensee BioMed Central Ltd. 2007
Published: 18 December 2007
Assuming multiple loci play a role in regulating the expression level of a single phenotype, we propose a new approach to identify cis- and trans-acting loci that regulate gene expression. Using the Problem 1 data set made available for Genetic Analysis Workshop 15 (GAW15), we identified many expression phenotypes that have significant evidence of association and linkage to one or more chromosomal regions. In particular, six of ten phenotypes that we found to be regulated by cis- and trans-acting loci were also mapped by a previous analysis of these data in which a total of 27 phenotypes were identified with expression levels regulated by cis-acting determinants. However, in general, the p-values associated with these regulators identified in our study were larger than in their studies, since we had also identified other factors regulating expression. In fact, we found that most of the gene expression phenotypes are influenced by at least one trans-acting locus. Our study also shows that much of the observable heritability in the phenotypes could be explained by simple single-nucleotide polymorphism associations; residual heritability was reduced and the remaining heritability may represent complex regulation systems with interactions or noise.
Phenotypes and genotypes
Expression levels for 3554 genes, taken from the lymphoblastoid cell lines of 194 members of 14 CEPH Utah families, were made available for GAW15 . Of these expression measures, 92 were missing either chromosome number or start or end of chromosomal location information in the Affymetrix annotation table http://www.affymetrix.com, so we focused on the remaining 3462 gene expression traits. The genotypes for 2819 autosomal SNPs for the same individuals were generated by The SNP Consortium http://snp.cshl.org.
Definition of cis- and trans-acting regulators
cis-Regulatory variants were defined as SNPs either within a gene, up to 1 Mb proximal to the start of the gene, or up to 1 Mb distal to the end of the gene. trans-Regulatory polymorphisms are defined as all SNPs elsewhere in the genome. Physical locations of probe sets were obtained from the Affymetrix annotation table http://www.affymetrix.com. The Rutgers map was used to establish a correspondence between the megabase locations on the physical map and the genetic map http://compgen.rutgers.edu/maps. Markers that could not be mapped using Rutgers map, but that were located between physically anchored markers, were placed on the genetic map by linear interpolation.
Identification of cis- and trans-acting regulators
Three steps were used to identify cis- and trans-regulatory polymorphisms (See Figure 1). 1) For each probeset, we first identified SNPs in or close to (1 Mb) the probe set (cis-SNPs), and then we fit a linear regression model containing gender and the cis-SNPs as covariates. If no cis-SNPs were identified for the probe set, we used only gender as a covariate; if more than one cis-SNP was identified, a stepwise algorithm based on the Akaike information criterion (AIC) was applied to choose a predictive set of cis-SNPs; if no SNPs were kept after running the stepwise algorithm, we forced in the SNP with the smallest p-value. The SNPs were coded 0, 1, and 2, representing homozygous rare, heterozygous, and homozygous common genotypes, respectively. Within-family dependence was not modelled, although we did some sensitivity analyses examining the effect of this assumption (see Discussion). We report the nominal, parametric p-values for the test of no association for each SNP (β = 0). 2) Residuals were obtained from the previous linear models containing gender and cis-SNPs (if any). Genome-wide multipoint linkage analysis was then performed on the residuals using the MERLIN-REGRESS command in the statistical genetics software MERLIN . 3) For each of the gene expression phenotypes, we fit a new linear model to identify SNPs under the linkage peaks of step (2) (logarithm of the odds (LOD) ≥ 2.0) that influence gene expression, and also included gender. These were primarily trans-SNPs because the linkage models used residuals that had already been adjusted for cis-SNPs. We evaluated which of these SNPs significantly and independently predicted gene expression phenotype by using the stepwise procedure with AIC to choose the optimal set of SNPs in the model.
The variance components analysis in MERLIN was used to estimate heritability based on: 1) raw gene expression profiles (H_exp); 2) residuals to a stepwise regression analysis containing cis-SNPs (if any) and gender (H_cis); 3) residuals to a stepwise regression analysis containing SNPs (if any) that have LOD of at least 2.0 and gender (H_lod).
Distribution of the number of expression phenotypes with different number of SNPs in the regression models of steps 1 and 3 of our three-step method
p ≤ 0.01c
p ≤ 0.001c
p ≤ 0.01c
p ≤ 0.001c
We then performed linkage analysis using residuals for the 3462 expression phenotypes derived from the fitted association models using the associated cis-SNPs and gender. Morley et al.  defined two levels of significance: p = 3.7 × 10-5 (LOD ~ 3.4), and p = 4.3 × 10-7 (LOD ~ 5.3). Using the same thresholds, we identified 1556 and 337 expression phenotypes with at least one marker showing evidence for linkage beyond these thresholds, respectively. In comparison, Morley et al.  identified 984 and 142 phenotypes, respectively, with at least one region of linkage at these two levels. We found many expression phenotypes whose regulation mapped to shared hotspots on chromosomes 9, 11, 13, 14, and 20.
We then performed a second set of stepwise linear regression analyses for the expression phenotypes, including gender and SNPs that showed LOD scores ≥ 2.0. There were 3034 of these 3462 phenotypes with at least one linkage peak. Given a p-value threshold of 0.01, we found 930 of the 3034 phenotypes were significantly associated with one marker and another 893 phenotypes were significantly associated with more than one marker. The remaining 1639 phenotypes showed no significant association with any marker. Again, if a more stringent significance level is used, the number of identified significant trans-SNPs will be decreased, but not by as much as in Step 1. Focusing on the most significant SNP for each phenotype, we found that 1514 (83.0%) expression phenotypes are most strongly influenced by a trans-acting transcriptional regulator.
Ten phenotypes whose expression level is significantly regulated by both cis- and trans-acting determinants
Cis-Association analysis (Step 1)
Linkage analysis of residuals (Step 2) – Signals at cis-SNPs
Association analysis under linkage peaks (Step 3)
p-value for cis-SNP with peak LOD score
Variation explained (R2%)
cis-SNP with peak LOD score
Peak (cis) LOD score
No. of SNPs in model
Distribution of heritability estimates for expression phenotypes
Discussion & conclusion
Genetic and environmental factors influence gene expression through complex pathways. Therefore, useful insight can be gained by considering jointly the effects of covariates and several SNPs when examining factors influencing gene expression. We included gender in all models and it was highly significant in many models (data not shown). It would also be interesting to include age to examine more complex genetic relationships. We also performed linkage analysis on residuals to models containing gender and cis-SNP effects, rather than performing linkage analysis on raw expression intensities. This approach may reduce residual variance and hence make it possible to identify additional factors influencing expression.
We allowed multiple SNPs to be considered for each linear regression and hence we identified phenotypes that are associated with several different SNPs in different parts of the genome. For some genes, a very large proportion of the variability was explained by a combination of several SNPs (see Table 2). Often, there may be several nearby SNPs that all show univariate associations with an expression phenotype. Correlations between these SNPs mean that, often, only some of these SNPs would be retained by the stepwise regression – a more parsimonious model can capture the association in a genomic region. We identified some of the same cis-controlled phenotypes as Morley et al.  and Cheung et al. , but our statistical significance was reduced relative to theirs. This may be a consequence of including several SNPs as well as gender in each model, in conjunction with a small sample size. We did not examine interactions between SNPs or genes; however, it would be interesting to model interactions between cis-SNPs in the regression analysis to explore joint effects. Despite performing linkage on residuals, we sometimes found linkage to regions near the cis-SNPs, probably due to multi-marker linkage patterns, incompletely explained by allelic variability.
Although we used simple linear regression to explore SNP associations and did not correct for additional familial dependence in this analysis, we compared our first stage results with generalized estimating equation (GEE) models using only the 413 phenotypes where there was exactly one cis-SNP available. Our regression analysis identified 15 of these phenotypes to have a significant cis-SNP (p ≤ 0.01) while the GEE model identified only 10 with one significant cis-SNP (p ≤ 0.01). Five of these phenotypes were identified by both methods. By ignoring familial clustering, we may have p-values that are too small. It would be worth fitting random effect models or GEE models to all the data, as well as models that are robust to non-normal distributions. However, the number of families here is quite small and any general conclusions would be better drawn from a larger sample.
Conceptually, we showed that a sizeable proportion of the observable heritability could be explained by simple SNP associations for these lymphoblastoid expression phenotypes. The 5.5% of the phenotypes where residual heritability remained over 0.4 may be influenced by complex regulation systems.
This article has been published as part of BMC Proceedings Volume 1 Supplement 1, 2007: Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/1?issue=S1.
- Morley M, Molony CM, Weber TM, Devlin JL, Ewens KG, Spielman RS, Cheung V: Genetic analysis of genome-wide variation in human gene expression. Nature. 2004, 430: 743-747. 10.1038/nature02797.View ArticlePubMed CentralPubMedGoogle Scholar
- Cheung VG, Spielman RS, Ewens KG, Weber TM, Morley M, Burdick JT: Mapping determinants of human gene expression by regional and whole genome association. Nature. 2005, 437: 1365-1369. 10.1038/nature04244.View ArticlePubMed CentralPubMedGoogle Scholar
- Stranger BE, Forrest MS, Clark AG, Minichiello MJ, Deutsch S, Lyle R, Hunt S, Kahl B, Antonarakis SE, Tavare , Deloukas P, Dermitzakis ET: Genome-wide associations of gene expression variation in humans. PloS Genet. 2005, 1: e78-10.1371/journal.pgen.0010078.View ArticlePubMed CentralPubMedGoogle Scholar
- Abecasis GR, Cherny SS, Cookson WO, Cardon LR: MERLIN-rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet. 2002, 30: 97-101. 10.1038/ng786.View ArticlePubMedGoogle Scholar
- Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W, Cho EK, Dallaire S, Freeman JL, González JR, Gratacòs M, Huang J, Kalaitzopoulos D, Komura D, MacDonald JR, Marshall CR, Mei R, Montgomery L, Nishimura K, Okamura K, Shen F, Somerville MJ, Tchinda J, Valsesia A, Woodwark C, Yang F, Zhang J, Zerjal T, Zhang J, Armengol L, Conrad DF, Estivill X, Tyler-Smith C, Carter NP, Aburatani H, Lee C, Jones KW, Scherer SW, Hurles ME: Global variation in copy number in the human genome. Nature. 2006, 444: 444-454. 10.1038/nature05329.View ArticlePubMed CentralPubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.