The hypothesis, test statistics, and confidence set
Suppose there is a chromosomal region of length C with at most one disease gene in it. We want to localize the gene, if it exists, by constructing a confidence set of its locus with coverage probability p such that the exclusion of the true disease locus on the map from the confidence set is controlled at level α = 1 - p. Because we can only make a type I error at exactly one of the excluded positions, if there is a disease gene on the map, no multiplicity adjustment is needed [1]. By the duality of confidence set and hypothesis testing, this is equivalent to testing the following hypothesis for every position on the chromosome at level α:
H0:d = d0 vs. H
a
:d ≠ d0,
where d is the true but unknown map position of a disease gene on the chromosome, and d0(d0 ∈ [0, C]) is the tested map position.
The LOD score, the conventionally used measure of support for linkage versus absence of linkage, can be utilized as a test statistic here, denoted by λd0:
λd0 = log10L(d0)/L(∞) = LOD(d0).
Two alternative test statistics (GLRT: λ*d0 and GLRT/MA: λMAd0) that are generalized likelihood ratio based can also be used:
λ*d0 = -2LnL(d0)/L() = 4.6[LOD() - LOD(d0)], and
λMAd0 = E
M
[λ*d0] = 4.6[MALOD() - MALOD(d0)],
where LOD() is the maximum LOD score maximized over [0, C] as well as when d is off the map (d = ∞). The model averaging LOD score MALOD is defined as
which is an average of LOD scores over a set of S disease models (M
i
values) compatible with the data. More specifically, the set of disease models considered are those that are consistent with the identical-by-descent (IBD) probabilities estimated at the hypothesized trait locus or their perturbations. Such a setup not only accounts for model uncertainty associated with the estimated IBDs but also the uncertainty associated with the estimation of the IBDs. The weights assigned to the models, P(Mi), are bimodal, with those obtained from the IBD estimates getting a larger weight than those from the perturbations. More details can be found in Wan [2].
A confidence set of the disease locus is then constructed by including all of the positions not rejected. Because the distribution of any of the three test statistics under H0 cannot be found analytically, we used simulation or asymptotic distribution to approximate this null distribution. For the simulation-based approach, data from multiple markers are simulated simultaneously conditional on the affection status and pedigree structure at each hypothesized disease position d0. Based on the simulated marker data, the null distribution at that hypothesized position is constructed by a Monte Carlo estimate. The test statistic λd0 is then compared to the null distribution to determine whether d0 should be included in the confidence set. It is worth emphasizing that at each hypothesized disease position, all marker data (multipoint) are simulated, regardless of the marker interval in which the hypothesized disease locus lies. Because there are an infinite number of putative disease loci to be tested, a practical strategy is needed to discretize the chromosome so that only a finite number of positions need to be tested without compromising the level of coverage. To further improve the computational efficiency of this simulation-based procedure, an importance sampling (IS) component was also proposed. In the following we describe the integrated procedure and the asymptotic approaches.
An integrated procedure based on LOD
We begin with a broad search of chromosomal regions to be included in or excluded from the confidence set. This broad search strategy is being referred to as our adaptive component of the integrated procedure. Specifically, the chromosome of interest is divided by the genetic markers, and each interval is considered in turn. For each such chromosomal segment, we divide it into two equal halves. For each half, we test the two end points (L and R) and the mid-point (M), and make inference about whether L-M, and/or M-R should be included in/excluded from the confidence set based on properties of the LOD scores, such as unimodality between two markers [2, 3]. If inclusion/exclusion decision cannot be made on an interval (L-M or M-R), it is further divided into two equal halves until either a decision about inclusion/exclusion can be made or the length of the segment is less than a preset threshold.
One could have set the threshold to be sufficiently small so that interpolation based on the two end points of any remaining undecided segment would lead to a coverage probability close to the nominal. However, this would be a computationally inefficient procedure due to the need of constructing a large number of simulated null distributions. Instead, we used a relatively coarse grid (leading to a threshold of 1 cM) in the adaptive step and adopted an importance sampling strategy to further refine the remaining segments (all smaller than the threshold) without any additional simulation. Specifically, suppose d0 is an interior point of one of such segments with the left end point being d
L
. We would like to test the hypotheses in Eq. (1) to determine whether d0 should be included in the confidence set. Let
X = LOD(d0) = log10(P(G|D = d0)/P(G|D = ∞))
denote the random variable corresponding to the LOD score hypothesizing the disease at position d0, where G is the collection of genotypes of all the individuals at all the marker loci. Then the c.d.f. of X can be written as:
Thus, the c.d.f. of the LOD score at d0 can be estimated by
which makes use of the N sets of simulated marker data (Givalues) with the disease locus hypothesized to be at d
L
. Note that these simulated marker data are available from the adaptive component step, and thus no additional simulations are needed. The importance sampling weight, P(Gi|D = d0)/P(Gi|D = d
L
), can be shown to equal to after some algebra, and thus can be easily calculated. We then proceed to test the inclusion/exclusion of d0 based on this estimated null distribution. Our simulation study [2] indicated accurate estimation with substantial gains in computational efficiency because no additional simulations are needed to estimate distributions at all interior points of a segment after the adaptive step. Additional efficiency can be gained by using the simulated marker data at the right end point as well [2].
Asymptotic approaches based on GLRT and GLRT/MA
When the sample size is moderate or large and/or when the family structure is not extremely heterogeneous, we can approximate the null distribution of the GLRT by a distribution. Thanks to its computational efficiency, one can further take model uncertainty into account by considering the test statistic GLRT/MA, where we approximate its limiting distribution by a weighted sum of independent values, with a cautionary note that the actual asymptotic distribution may be more complicated due to the dependency of the component values.