- Proceedings
- Open Access
- Published:

# Statistical corrections of linkage data suggest predominantly *cis* regulations of gene expression

*BMC Proceedings*
**volume 1**, Article number: S145 (2007)

## Abstract

Morley et al. (*Nature* 2004, **430:**743–747) detected significant linkages to the expression levels of 142 genes (of 3554) at a reported threshold of genome-wide *p* = 0.001 (LOD ≈ 5.3), using 14 three-generation Centre d'Etude du Polymorphisme Humain pedigrees. Most of the linkages (77%) were *trans*, i.e., more than 5 Mb from the expressed gene. However, the analysis did not account for the expected anti-conservative effect of the skewed distribution of score- or regression-based statistics in large sibships, or for the possible variance distortion due to correlations among tests. Therefore, we re-analyzed their data, using a robust score statistic for the entire pedigrees and correcting the *p*-values for skewness. We found that a LOD of 5.3 had a skewness-corrected genome-wide *p*-value of 0.016 instead of 0.001 (a result that we confirmed using simulation), with around 50 expected false positives. We then further corrected for correlation among the (skew-corrected) *p*-values by using Efron's method for obtaining the empirical null distribution. Setting a threshold of FDR = 10% (*Z* = 6.4, LOD = 8.9), we detected linkage for the expression levels of 22 genes, 19 of which are *cis*. Limiting the analysis to *cis* regions, linkage was detected to the expression levels of 46 genes with 4.6 expected false positives (FDR = 10%).

## Background

In their study of genome-wide linkage of expression levels of 3554 genes, Morley et al. [1] determined a genome-wide *p*-value for each phenotype using Gaussian process theory, and then used a form of Bonferroni correction, without accounting for dependencies among the many phenotypes being tested, to estimate the number of expected false positives among their 142 positive findings. However, Tang and Siegmund [2] have pointed out that in large sibships, because of the dependencies among identity-by-descent (IBD) counts, score- or regression-based statistics have a skewed distribution under the null hypothesis of no linkage, even if the phenotypes are exactly normally distributed. They have also provided a skewness-corrected approximation to the genome-wide *p*-value, which shows that approximations based on Gaussian processes can be quite anti-conservative in small samples. Also, Morley et al. [1] reported (and we have also observed, data not shown) that there are substantial correlations of expression levels for many pairs of genes in their data. Efron [3] has shown that correlation among many tests, if ignored, can lead to an excess or a deficit of significant findings, and he proposed a method to correct for this effect.

Therefore, we re-examined the linkage results of Morley et al. [1], using a robust score statistic to map the expression phenotypes, based on IBD counts for all relative pairs in each of the 14 Centre d'Etude du Polymorphisme Humain pedigrees. (Note that analyzing entire pedigrees is more powerful here than considering only the sibships; see below.) We corrected the *p*-values using the method of Tang and Siegmund [2] and then determined the false-discovery rate (FDR) using the method of Efron [4] to correct for the correlations among tests. We compared the results of this analysis to those based on a permutation-based FDR procedure. Finally, because most of our significant linkage signals were in *cis* regions (defined by Morley et al. [1] as within 5 Mb of the expressed gene), we also determined whether our analysis would have been more powerful if we had only tested linkage of each expression trait to the markers within 5 Mb of the gene.

## Methods

### A robust score statistic to map quantitative trait loci (QTL) using extended pedigrees

For notational simplicity, we suppress the index for each family. Let *Y* denote the phenotype for the members of a pedigree. Let *ν*_{
ij
}(*t*) denote the number of alleles IBD at locus *t* between individual *i* and *j*, centered to have expected value 0. Let *A*_{
ν
}(*t*) be the IBD matrix with [*A*_{
ν
}(*t*)]_{
ij
}= *ν*_{
ij
}(*t*). Define Σ to be the phenotypic covariance matrix. Assuming no dominant genetic effect, then according to Tang and Siegmund [2], the conditional covariance matrix

Σ_{
A
}= Cov(*Y*, *Y* | *A*_{
ν
}(*τ*)) = Σ + *αA*_{
α
}(*τ*),

where *α* ≥ 0 denotes the additive genetic effect.

From the working assumption that at a trait locus *τ*, conditional on *A*_{
ν
}(*τ*), *Y* follows a multivariate normal distribution, one can derive a robust score statistic for testing whether there is an additive genetic effect at *τ* [2] in the form *Z*(*τ*) = *l*_{
α
}(*τ*)/[*E*_{0}${l}_{\alpha}^{2}$(*τ*)]^{1/2}. Here,*l*_{
α
}(*τ*) = 2^{-1}∑[-*tr*Σ^{-1}*A*_{
ν
}+ *tr*Σ^{-1}*A*_{
ν
}Σ^{-1}*YY'*].

In practice, unobserved values of the IBDs in *l*_{
α
}(*t*) are replaced by their conditional expectation given the genotypic data, while their variances are estimated from multipoint genotypic data. To make the test robust to the normality assumption of the traits, we use *Z*(*τ*) = *l*_{
α
}(*τ*)/[*E*_{0}(${l}_{\alpha}^{2}$(*τ*) | *Y*)]^{1/2}.

Generation-specific effects in extended pedigrees were allowed in estimating the mean and variance of each trait, while the phenotypic correlations *ρ*_{1} for grandparent-grandchild and *ρ*_{2} for sibs were estimated by maximum-likelihood estimation (MLE) with the genetically natural constraint *ρ*_{1} ≤ *ρ*_{2}/2. The sex-average genetic map provided to Genetic Analysis Workshop 15 by Sung et al. [5] was used. The expected IBD counts were computed by MERLIN [6] using all 2819 SNP markers and full pedigrees. The score statistic was then computed at marker locations using the estimated IBD counts. Let *Z*_{
n
}(*t*) be the score statistic at marker *t* for the *n*^{th} phenotype. We defined the genome scan statistic to be *Z*_{
n
}= max_{
t
}*Z*_{
n
}(*t*) over all marker loci for trait *n*. The genome-wide *p*-value for each *Z*_{
n
}(i.e., for each of the 3554 traits) was then computed using the skewness correction described in the Appendix.

Our theoretical calculation shows that, for these 14 pedigrees, using only sibships causes a loss of power equal to roughly 35% of the sample size. Here, we compare linkage scores for sibships and for entire pedigrees, and then we use the more powerful pedigree-based tests for analysis of the effects of our correction procedures.

### Control FDR based on the empirical null distribution

One useful method for addressing the multiple testing problem is to control the FDR [7]. For our problem, a successful FDR procedure requires 1) accurate evaluation of the genome-wide *p*-value for each trait, and 2) adjustment for the correlations among the genome scan test statistics. Here, we correct FDR using Efron's method to estimate the empirical null distribution [4]. For trait *n*, we computed the genome scan statistic *Z*_{
n
}and approximated the genome-wide *p*-value *p*_{
n
}using the skewness-correction method described in the Appendix (accuracy is checked using a Monte Carlo simulation). We then transformed *p*_{
n
}to the normal quantile *q*_{
n
}= Φ^{-1}(1 - *p*_{
n
}) and applied Efron's method on {*q*_{
n
}} to estimate the empirical null distribution *N*(*μ*, *σ*^{2}) The expected number of false positives for threshold *q* is 3554Φ((*q* - *μ*)/*σ*) and the FDR is estimated as FDR = 3554Φ((*q* - *μ*)/*σ*)/#{*q*_{
n
}> *q*}. Here, we have implicitly assumed that the proportion of traits without linkage signals is close to one.

### Control FDR using permutations

To validate the FDR results obtained by correcting for skewness and for the empirical null distribution, we used 1000 permutations to determine the number of false positives and hence the FDR following Efron's method [3]. For permutation *n*, we computed the genome scan statistics ${Z}_{1}^{n},\cdots ,{Z}_{3554}^{n}$, and computed ${Y}_{0}^{n}=\#\{{Z}_{k}^{n}<3.5\}$ and ${Y}_{1}^{n}(b)=\#\{{Z}_{k}^{n}>b\}$ for *b* > 3.5, where 3.5 is the median value of *Z*. The correlation among the genome scan statistics causes ${Y}_{0}^{n},{Y}_{1}^{n}(b)$ to be correlated. So we can fit a linear regression model *Y*_{1}(*b*) = *a*_{1} + *a*_{2}*Y*_{0} + *ε* to the 1000 pairs of (${Y}_{0}^{n},{Y}_{1}^{n}(b)$). For the observed data, we computed the genome scan statistics, *Y*_{0} and *Y*_{1}, then computed the expected number of false positives among the *Y*_{1}(*b*) positive findings as *a*_{1} + *a*_{2}*Y*_{0}. The estimated FDR for threshold *b* was then (*a*_{1} + *a*_{2}*Y*_{0})/*Y*_{1}. The permutation-based FDR procedure does not require accurate evaluation of genome-wide *p*-values or appropriate correction for the correlations among tests, but it is computationally intensive.

### Search for *cis*-regulated genes

If most linkages prove to be in *cis* regions, then the power to detect these linkages could be increased by considering only the markers within 5 Mb of that gene, because genome-wide *p*-values would have to be corrected only for this small proportion of markers for each expression trait. We searched the location of the 3554 gene names on http://sky.bsd.uchicago.edu/genequery.html. The markers within 5 Mb of each target gene were identified and the scan statistic was obtained as the maximum score for these markers. Because the number of markers and the genetic lengths in the *cis* region are highly variable, we evaluated the region-wide *p*-values empirically. We ran 8000 permutations and fit a quadratic curve (log *p*_{
n
}= *α*_{n,0 }+ *α*_{n,1}*b* + *α*_{n,2}*b*^{2}) to the results to predict the region-wide *p*-value for the *n*^{th} trait. The form of the *p*-value is suggested by the formula in Appendix. We then transformed *p*_{
n
}to normal quantile *q*_{
n
}and estimated the empirical null distribution based on the quantitles. The expected number of false positives and the FDR are based on the estimated empirical null distribution.

## Results

### Genome-wide *p*-value

To evaluate the validity of the skewness correction procedure described in the Appendix, we used that procedure to estimate the genome-wide *p*-value associated with the LOD score threshold of 5.3 (*Z* = 4.94) assumed by Morley et al. to have a genome-wide *p*-value of 0.001 based on Gaussian process theory [1]. Our skewness-correction procedure, however, determined that *Z* = 4.94 has a genome-wide *p*-value of 0.016. We then carried out 1600 Monte Carlo simulations (assuming no linkage) of genotypes for 2800 SNP markers (with minor allele frequencies from 0.25 to 0.5 assigned randomly) at 1.2-cM spacing, in 14 families with eight siblings and two parents per family. The genome-wide *p*-value for *Z* = 4.94 was found to be 0.0195 (SD = 0.0035), in close agreement with the skewness-corrected theoretical result, which supports the validity of the correction. Note that in the remainder of our analyses, we applied an FDR threshold of 10% (*Z* = 6.4) rather than a *p*-value threshold. When a large number of true positive results is expected, assessing significance by FDR might have more practical value than a family-wise error rate.

### Analysis of sibships vs. pedigrees

Figure 1 illustrates differences between linkage results for sibships and for pedigrees. In the two panels to the left, the score statistic results are shown for two expressed genes (*ZP3* and *TM7SF3*) for the chromosome on which the gene is located, with the gene location in these two cases being directly under the peak score. Larger scores are observed for full pedigree analysis. In the panel to the right, the score based on sibship data is plotted against the score based on full pedigree data for 3554 traits. The largest (most significant) scores are larger using the full pedigree data.

### Corrected vs. uncorrected linkage results

Following Efron [4], we estimated the empirical null to be *N*(0.25, 1.05^{2}). The *Z*_{
n
}threshold of 6.4 for FDR = 10% was determined as shown in Figure 2. Using this threshold, we observed 22 gene expression levels with significant evidence of linkage (Tables 1 and 2). Among those genes, three were mapped to *trans* loci and 19 to *cis* loci. Using only sibships, we applied the same procedure and found evidence of linkage for only six genes at FDR = 10%. Similar results were obtained using permutation-based FDR method: 19 gene expression levels have significant evidence of linkage, among which 2 were mapped to *trans* loci and 17 to *cis* loci.

Table 1 shows, for comparison, the results of the uncorrected analysis reported by Morley et al. [1]. Using a *Z*_{
n
}threshold of 4.94 (LOD = 5.3 using regression statistics and an assumption of a Gaussian distribution), they reported 142 significant linkages, most of them *trans*, and calculated FDR = 2.5%. However, using the method described above, we expect 50 false-positive results at a threshold of 4.94 after correcting only for skewness, and 110 such results after further correcting for the empirical null. Therefore, we estimate FDR = 77.5% for the results reported by Morley et al. [1].

Finally, Figure 3 and Table 1 summarize the results of the corrected analysis limited to *cis* regions (within 5 Mb of each gene). Because this procedure maximizes *Z*_{
n
}over a smaller number of tests, it detected 46 significant linkages at FDR = 10%.

## Discussion

We have addressed two issues relevant to linkage analysis of multiple traits. First, in data from family constellations larger than one sibling pair, the dependence of IBD sharing for different pairs of individuals within each family will create right-skewed score and regression tests, which we corrected using the method of Tang and Siegmund [2]. Second, when many tests are carried out, and there are correlations among tests, the distribution of test statistics under the null hypothesis can deviate in either direction from Gaussian expectation, which we corrected by the method of Efron [4]. We show that very similar results are obtained by applying these corrections to the data or by computing FDR empirically by permutation. This suggests that our corrections are valid and can be used in place of the very time-consuming permutation procedure.

Our analyses detected far fewer significant tests than the analysis of Morley et al. [1]. There are several differences between these analyses: they selected 142 linkage signals based on a genome-wide *p*-value threshold of 0.001, but without correcting for skewness; and they computed the expected number of false positives by multiplying this *p*-value by the number of tests, without correcting for the correlations among tests. Our results may be more plausible biologically, in that most of the significant linkages of expression levels are *cis*, i.e., close to the gene, where regulatory elements are known to exist. This is consistent with the result of a larger recent gene expression linkage study [8] of 20,413 transcripts in 1200 individuals from 40 Mexican-American families, where 95% of LOD scores >5.0 were located in the *cis* region of the expressed gene.

We would therefore suggest that in linkage studies of correlated traits in larger families, more accurate genome-wide inferences can be made if *p*-values are corrected for skewness caused by correlations of IBD sharing proportions for pairs of relatives, and if the expected proportion of false-positive results is corrected based on the empirical null distribution of test statistics. This proposal requires further testing where the "true" positives are known, using simulation of both expression levels and marker genotypes or using data for linkages that have been validated biologically.

## Appendix

Given skewness *γ*, inter-marker distance Δ, genetic length of the genome *L*, and the average recombination rate *β*, the genome-wide *p*-value can be approximated by

where *ϕ*(*ξ*) = *E*exp(*ξZ*(*t*)) ≈ *ξ*^{2}/2 + *γξ*^{3}/6 and *ξ* is chosen as the solution of *ϕ*'(*ξ*) = *ξ* + *γξ*^{2}/2 = *b*. The function $\nu (x)=2{x}^{-2}[-2{\displaystyle {\sum}_{k=1}^{+\infty}\Phi (-x{k}^{1/2}/2)}]$[9] for *x* > 0. It can be approximated by *ν*(*x*) ≈ exp(-0.583*x*) very accurately for 0 <*x* < 2; the series converge fast for large *x*. For GAW15 linkage data, *β* = 0.033, *γ* = 0.427 (detail omitted), and Δ = 3300 cM/2819 ≈ 1.2 cM.

## References

- 1.
Morley M, Molony CM, Teresa M, Weber TM, Devlin JL, Ewens KG, Spielman RS, Cheung VG: Genetic analysis of genome-wide variation in human gene expression. Nature. 2004, 430: 743-747. 10.1038/nature02797.

- 2.
Tang H-K, Siegmund D: Mapping quantitative trait loci in oligogenic models. Biostatistics. 2001, 2: 147-162. 10.1093/biostatistics/2.2.147.

- 3.
Efron B: Correlation and large-scale simultaneous significance testing. J Am Stat Assoc. 2007, 107: 93-103. 10.1198/016214506000001211.

- 4.
Efron B: Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. J Am Stat Assoc. 2004, 99: 96-104. 10.1198/016214504000000089.

- 5.
Sung Y, Di Y, Fu AQ, Rothstein JH, Sieh W, Tong L, Thompson EA, Wijsman EM: Comparison of multipoint linkage analyses for quantitative traits in the CEPH data: parametric LOD scores, variance components LOD scores, and Bayes factors. BMC Proc. 2007, 1 (Suppl 1): S93-

- 6.
Abecasis GR, Cherny SS, Cookson WO, Cardon LR: Merlin-rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet. 2002, 30: 97-101. 10.1038/ng786.

- 7.
Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B. 1995, 57: 289-300.

- 8.
Göring HHH, Curran JE, Johnson MP, Dyer TD, Jowett JBM, Mahaney MC, MacCluer JW, Collier GR, Moses EK, Blangero J: Large-scale genetic investigation of genome-wide transcriptional profiles [abstract]. Annual Meeting of The American Society of Human Genetics, 10–14 October 2006; New Orleans. 2006, 171-

- 9.
Siegmund D: Sequential Analysis: Tests and Confidence Intervals. 1985, New York: Springer-Verlag

## Acknowledgements

Support for this work was provided by the Stanford Graduate Fellowship (JS), NIH grant HG000848 (DOS), NIMH grant K24 MH64197 (DFL), and the Eleanor Nichols Endowment (Stanford University).

This article has been published as part of *BMC Proceedings* Volume 1 Supplement 1, 2007: Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/1?issue=S1.

## Author information

## Additional information

### Competing interests

The author(s) declare that they have no competing interests.

## Rights and permissions

## About this article

#### Published

#### DOI

### Keywords

- Additive Genetic Effect
- Linkage Result
- Empirical Null Distribution
- Large Sibship
- Conditional Covariance Matrix