Haseman-Elston method
The simple regression method of the Haseman-Elston [8] offers an effective framework for studying linkage between markers and disease. Later, Elston et al. [7] proposed modifications to the original Haseman-Elston method [8] to improve its power. It is based on regression of the squared sum of mean-centered trait values, CP
j
= (Y1j- m)(Y2j- m), with mean m on the estimated proportion of alleles shared identically by decent (IBD) by the sibling pair, Y1j, Y2j.
The model and prior specifications
Assume that there are p markers on the whole genome with n dependent data (samples). Then, we form a model that includes a number of different marker loci to study their simultaneous effects, which can be best approached in a linear regression fashion such as
(1)
where μ is the mean, y
j
is an observed phenotypic value (CP
j
) for each sibling pair, x
ij
is a proportion of IBD for ith marker in jth sample, and β
i
is an effect of the ith marker. The variance of the trait is assumed to be ε ~ N(0, φ-1I), with φ being a precision parameter. To explore promising subsets (a set of markers having evidence of linkage) over the entire model space efficiently, a binary indicator γ
i
is used to represent an exclusion or inclusion of ith marker in the model [1]. Then a model is represented by γ = (γ1,..., γ
p
) and Eq. (1) can be reduced to the p
γ
= I
p
γ variables by ignoring columns of X for which γ
i
= 0. We denote the corresponding model as X
γ
and coefficient parameters as β
γ
. When epistases are considered, the indicator vector γ is expressed as γ = (γ1,..., γ
p
, γ(1, 2),..., γ(i, j),..., γ(p, p-1)), where γ(i, j) is an indicator of an epistasis of ith and jth markers and Eq. (1) is extended by adding for an epistatic effect, between loci j1 and j2 ≤ p. Therefore, a general model to describe both main effects and epistases can be written Y = μ + X
γ
β
γ
+ ε, where some of the columns of X
γ
are formed from the original variables by multiplication of columns of X to build the design matrix for epistases and β
γ
= [β1,..., β2, β1, 2,..., βp-1, p].
The prior distribution for unknown parameters Φ
γ
= (β
γ
, γ, φ-1) can be decomposed as π (β
γ
, γ, φ-1) = π (β
γ
|γ, φ-1) π (γ) π (φ-1) under a simple independence assumption. We assume the prior Φ
γ
to be in the conjugate normal-gamma family, namely,
where c is an unknown positive scalar.
Posterior inference
The statistical inference for the identification of marker loci having evidence of linkage is retrieved through the posterior distribution of γ, which is given by Bayes' rule, π (γ|Y) ∝ π (γ)f(Y|γ). After nuisance parameters β
γ
and φ-1are integrated out from the marginalized likelihood, it is simplified to
(2)
When there are a large number of markers involved, Eq. (2) is estimated via MCMC algorithms by simulating samples from the posterior distribution without knowing the normalizing constant, but at a risk of false inferences and being subject to initialization biases. We use perfect sampling described as follows.
Posterior simulation via perfect sampling
Under a non-epistatic model, γ = (γ1, K, γ
p
), for example, we simulate samples from Eq. (2) by updating γ in a component-wise manner. Each component γ
i
is chosen consecutively or via a random permutation on its index (1,..., p). Then the probability of determining γ
i
to be 1 conditional on other latest updated components is given from a Bernoulli trial such as
(3)
where γ(-i) = (γ1,...., γi-1, γi+1,..., γ
p
) and γ(i) = (γ1,...., γi-1, 1, γi+1,..., γ
p
). There are 2p-1 possible configurations of Eq. (3). The original perfect sampling method, CFTP [5], entails running parallel chains from every possible 2p-1 state from the past time -T to 0 repeatedly until it achieves coalescence. However, our approaches do not require running all these chains based on two following ideas.
First, for t ∈ [-T, 0), instead of attempting to run all possible chains, we construct, sandwich distributions, which bound all the possibilities of Eq. (3) such as
(4)
so that an update is done only on these two distributions. This is because the coupling of these sandwich distributions implies the coalescence of all other chains in between. Second, rather than tracing , we generate its support to keep track of only possible values, which further reduce the computational burden. That is, for a random seed generated from a uniform distribution on (0, 1), if is taken as true and its support is assigned as the same value. On the other hand, if is indeterminate and records uncertain values, {0, 1}. Then, for all i = 1, 2,..., p, an updating rule is formulated as
(5)
Coalescence is achieved when all supports become settled at time 0, i.e., || = 0 for all i. This procedure in implemented iteratively as follows. At T = -1, for each ∈ St, we decide two sandwich distributions and update S0 based on Eq. (5). If the coalescence is achieved, a support S0 is reported as a draw from the target posterior in Eq. (2). Otherwise, we move back at T = -2 and repeat updating for t ∈ [-2, 0], and then check coalescence at 0. The whole procedure is repeated, and a sample is drawn only if coalescence occurs at 0. Otherwise, the starting time is shifted further back, preferably at -2T [5] and the updates perform by reusing the same random seed, which is critical to preclude the space from growing [5]. One of the main keys in our methods is to construct two bounds, and . We have recently proposed how to build these bounds, approximately to succeed the perfect sampling even for high dimensional spaces. The manuscript may be obtained upon request.
Model space prior for epsitases
To account for epistatic effects, we consider two different model space priors of π (γ). An independence prior is usually used when it is believed that effects of markers influence the trait entirely independently of each other. In this case, we have
(6)
where w1 and w2 are hyper-priors for the inclusion of main effects and epistases, respectively. It is reasonable to choose that w2 ≤ w1 ≤ 0.5. Alternatively, we can embed the dependent structure of main effects and epistases [9] such as
where the conditional probability for an epistasis to be included, γ(i, j) = 1 takes on four different values depending on the main effects,
(7)
This dependent relationship can be advantageous in that we can reduce the size of the model space by limiting certain epistases to be included in the model. For example, if we believe that an epistasis should be considered only if at least one of the main effects is significant, we let w00 = 0. However, because we might miss important marker loci that might affect a phenotype primarily through epistasis, it may be more reasonable to have 0 ≤ w00 ≤ w01 ≤ w11 ≤ 0.5. The hyper-priors, (w1, w2) in Eq. (6) and (w11, w01, w00) in Eq. (7), can indirectly control the expected numbers of effects in the model. Therefore, small values are essential because we expect there are a small number of markers linked to the trait.
Selection criterion
After we collect samples using perfect sampling, the identification of markers that are tightly linked to the genes is given by estimates of marginal posterior probabilities. To this end, we simply count the relative frequencies of model visits in the samples, and the marginal posterior of the ith marker being important is estimated by summing over the posterior of models containing this marker. Then, we list the estimates of marginal posterior probabilities in a numerical order. Their patterns are used to gauge the importance of effects. When the decision is made, the model space prior (w1, w2) in Eq. (6) and (w11, w01, w00) in Eq. (7) play an important role as threshold values. If the marginal posterior probability of the marker is higher than these values, we decide that the corresponding effect of this marker is significant.
Data
We used rheumatoid factor IgM as the quantitative trait values and microsatellite scans for 511 multiplex families over the 22 autosomal chromosomes. The IBD values were obtained using the statistical software MERLIN [10]. A total of 590 independent sib pairs and 407 microsatellite markers were used in the analysis.
Our programs were written in MATLAB and each was run on Super Macintosh G5 with a 2.66 GHz quad-processor.