The hypotheses and the test statistic
For an arbitrary location τ in the genome consider testing the following hypotheses:
H0 : τ = τ vs. H0 : τ ≠ τ,
where τ* is the true location of the QTL controlling the expression level of a continuous phenotype. If we perform the above hypothesis test at an α-level, then it is easily seen that the collection of all loci τ that the null hypothesis is not rejected forms an 1 - α confidence region for the location of the QTL. In addition to allowing us to derive confidence regions with known statistical properties, this formulation of the null and alternative hypothesis helps circumvent the need for multiplicity adjustment for the number of locations tested.
In order to test the above hypotheses, we made use of the method of the squared differences (SQD) of phenotypic values of sib pairs introduced by Haseman and Elston (HE) [4]. In brief, let us assume that we have phenotypic values of a quantitative trait for n sib-pairs, and let y
i
be the squared difference of the expressions of the two siblings in the ith family. In addition, define π
i
to be the proportion of alleles shared identical by descent (IBD) by the sib pair at the trait locus. HE [4] proved that under the regression model
Ey
i
= β0 + β1π
i
,
and assuming null dominance effect, or fairly large sample size, the coefficient β1 is simply -2, where is the additive variance component attributed to the trait locus. Fulker and Cardon [5] demonstrated that the same holds if π
i
is substituted by π
i
(G
i
), the expected number of alleles shared IBD given the marker genotypes (G
i
) of the family. Based on this observation, one can easily test Eq. (1) by simply testing the following hypotheses for each locus τ
(3)
where β1τis the regression coefficient of the SQDs y values, on the π
τ
(G) values, the observed IBD sharing by the sib pair at the locus τ. Thus, in order to construct the confidence region, we only need to estimate the additive effect of the trait locus. Such an estimate can be obtained from the data themselves after a linkage signal has been established, i.e., the maximum test statistic used in the preliminary analysis exceeds a certain pre-chosen threshold. Assuming that the total number of sib pairs (n) is sufficiently large, we can randomly split the data into two groups of n1 and n2 = n - n1 pairs. Using the data from the first group, we can estimate the additive genetic variance as , where is the estimate of the regression coefficient of the SQD values on the observed IBD sharing at the locus τ
max
, the location where the maximum test statistic of the whole genome screening occurred. Then, based on the data from the second group and for each location τ on the genome, we fit the regression in Eq. (2) to obtain the coefficients . Finally, we test
to determine whether τ should be included in the confidence region for the location of the putative trait contributing gene.
Data and phenotypes
Our data consisted of GE measurements on 3554 genes for 14 three-generational CEPH families, averaging 14 members per pedigree for a total of 194 people. We chose to analyze the GEs of 25 genes that have been suggested to be (co-)regulated by one or possibly multiple SNPs located on a small genomic region of 5 cM toward the end of chromosome 14 [3]. Genotypes on about 2800 SNPs densely covering the whole genome (on average 1 SNP/cM) were also available for all individuals. Because most of the families were multigenerational with more than two children, we extracted all possible combinations of sib pairs, resulting in a sample of 376 pairs. Note that this procedure does not yield an independent sample. We could select one sib pair from each family and exclude it from the analysis to remove this dependency [6]. However, previous experience with the CSI version for binary traits [7] suggests that this dependency has a minimal effect on the confidence regions. Because we expected the CSI-QTL method to have a similar behavior as its counterpart for binary traits, we opted to include all the pairs in the analysis.
Analyses
Each GE of interest was analyzed according to the following two-stage protocol. First, using the HE test statistic we scanned the whole autosomal genome, 22 chromosomes, to uncover chromosomes that potentially harbor regulating factors. In an attempt to hold the false-discovery rate at a low level, we selected for further analysis only those chromosomes that had a minimum observed standardized score for β1 of less than -3.09 (pointwise significant level of 0.001). In the second stage, we focused only on the chromosome(s) that were identified on the preliminary analysis and applied our proposed method to construct 95% confidence regions (CRs) for the true location of the identified putative QTL. Although the sample size of 376 sib pairs seems to be sufficiently large for each stage of the analysis, splitting them into two groups as we described earlier would lead to too small a sample size for either. Thus, we decided to use all the sib pairs for both estimating the additive variance and constructing the CRs. We expect this to result in relatively conservative intervals with true coverage probability higher than their nominal value.