A robust score statistic to map quantitative trait loci (QTL) using extended pedigrees
For notational simplicity, we suppress the index for each family. Let Y denote the phenotype for the members of a pedigree. Let ν
ij
(t) denote the number of alleles IBD at locus t between individual i and j, centered to have expected value 0. Let A
ν
(t) be the IBD matrix with [A
ν
(t)]
ij
= ν
ij
(t). Define Σ to be the phenotypic covariance matrix. Assuming no dominant genetic effect, then according to Tang and Siegmund [2], the conditional covariance matrix
Σ
A
= Cov(Y, Y | A
ν
(τ)) = Σ + αA
α
(τ),
where α ≥ 0 denotes the additive genetic effect.
From the working assumption that at a trait locus τ, conditional on A
ν
(τ), Y follows a multivariate normal distribution, one can derive a robust score statistic for testing whether there is an additive genetic effect at τ [2] in the form Z(τ) = l
α
(τ)/[E0(τ)]1/2. Here,l
α
(τ) = 2-1∑[-trΣ-1A
ν
+ trΣ-1A
ν
Σ-1YY'].
In practice, unobserved values of the IBDs in l
α
(t) are replaced by their conditional expectation given the genotypic data, while their variances are estimated from multipoint genotypic data. To make the test robust to the normality assumption of the traits, we use Z(τ) = l
α
(τ)/[E0((τ) | Y)]1/2.
Generation-specific effects in extended pedigrees were allowed in estimating the mean and variance of each trait, while the phenotypic correlations ρ1 for grandparent-grandchild and ρ2 for sibs were estimated by maximum-likelihood estimation (MLE) with the genetically natural constraint ρ1 ≤ ρ2/2. The sex-average genetic map provided to Genetic Analysis Workshop 15 by Sung et al. [5] was used. The expected IBD counts were computed by MERLIN [6] using all 2819 SNP markers and full pedigrees. The score statistic was then computed at marker locations using the estimated IBD counts. Let Z
n
(t) be the score statistic at marker t for the nth phenotype. We defined the genome scan statistic to be Z
n
= max
t
Z
n
(t) over all marker loci for trait n. The genome-wide p-value for each Z
n
(i.e., for each of the 3554 traits) was then computed using the skewness correction described in the Appendix.
Our theoretical calculation shows that, for these 14 pedigrees, using only sibships causes a loss of power equal to roughly 35% of the sample size. Here, we compare linkage scores for sibships and for entire pedigrees, and then we use the more powerful pedigree-based tests for analysis of the effects of our correction procedures.
Control FDR based on the empirical null distribution
One useful method for addressing the multiple testing problem is to control the FDR [7]. For our problem, a successful FDR procedure requires 1) accurate evaluation of the genome-wide p-value for each trait, and 2) adjustment for the correlations among the genome scan test statistics. Here, we correct FDR using Efron's method to estimate the empirical null distribution [4]. For trait n, we computed the genome scan statistic Z
n
and approximated the genome-wide p-value p
n
using the skewness-correction method described in the Appendix (accuracy is checked using a Monte Carlo simulation). We then transformed p
n
to the normal quantile q
n
= Φ-1(1 - p
n
) and applied Efron's method on {q
n
} to estimate the empirical null distribution N(μ, σ2) The expected number of false positives for threshold q is 3554Φ((q - μ)/σ) and the FDR is estimated as FDR = 3554Φ((q - μ)/σ)/#{q
n
> q}. Here, we have implicitly assumed that the proportion of traits without linkage signals is close to one.
Control FDR using permutations
To validate the FDR results obtained by correcting for skewness and for the empirical null distribution, we used 1000 permutations to determine the number of false positives and hence the FDR following Efron's method [3]. For permutation n, we computed the genome scan statistics , and computed and for b > 3.5, where 3.5 is the median value of Z. The correlation among the genome scan statistics causes to be correlated. So we can fit a linear regression model Y1(b) = a1 + a2Y0 + ε to the 1000 pairs of (). For the observed data, we computed the genome scan statistics, Y0 and Y1, then computed the expected number of false positives among the Y1(b) positive findings as a1 + a2Y0. The estimated FDR for threshold b was then (a1 + a2Y0)/Y1. The permutation-based FDR procedure does not require accurate evaluation of genome-wide p-values or appropriate correction for the correlations among tests, but it is computationally intensive.
Search for cis-regulated genes
If most linkages prove to be in cis regions, then the power to detect these linkages could be increased by considering only the markers within 5 Mb of that gene, because genome-wide p-values would have to be corrected only for this small proportion of markers for each expression trait. We searched the location of the 3554 gene names on http://sky.bsd.uchicago.edu/genequery.html. The markers within 5 Mb of each target gene were identified and the scan statistic was obtained as the maximum score for these markers. Because the number of markers and the genetic lengths in the cis region are highly variable, we evaluated the region-wide p-values empirically. We ran 8000 permutations and fit a quadratic curve (log p
n
= αn,0 + αn,1b + αn,2b2) to the results to predict the region-wide p-value for the nth trait. The form of the p-value is suggested by the formula in Appendix. We then transformed p
n
to normal quantile q
n
and estimated the empirical null distribution based on the quantitles. The expected number of false positives and the FDR are based on the estimated empirical null distribution.