A Bayesian genome-wide linkage analysis of quantitative traits for rheumatoid arthritis via perfect sampling.

Rheumatoid arthritis is a complex disease caused by a combination of genetic, environmental, and hormonal factors, and their additive and/or non-additive effects. We performed a linkage analysis to provide evidence of rheumatoid factor IgM on linkage, based on Bayesian variable selection coupled with the new Haseman-Elston method. For statistical inferences to estimate unknown parameters, we utilized the perfect sampling algorithm, an emerging simulation technique that alleviates concerns over convergence and sampling mixing. Our methods provide powerful and conceptually simple approaches to simultaneous genome scans of main effects and all possible pairwise interactions. We apply them to the Genetic Analysis Workshop 15 data (Problem 2) provided by the North American Rheumatoid Arthritis Consortium (NARAC).


Background
Rheumatoid arthritis (RA) is a clinically heterogeneous disorder with variability in severity, disease course, and response to therapy. Although the exact cause of rheumatoid arthritis is still unknown, RA is known to be a complex disease caused by a combination of genetic, environmental, and hormonal factors, and their additive and/or non-additive effects (epistases or gene × environment interactions). Genetic risk factors not only determine susceptibility for the disease but also correlate with disease severity and phenotype. Among phenotypes, rheumatoid factor IgM is a significant and common measure for diagnosis of RA. Therefore, genetic linkage analyses of IgM levels may reveal major differences in chromosomal regions showing evidence for linkage.
While recent interest has been focused on genome scans using a large number of marker loci, the common approaches of existing statistical methods produce often inconsistent results. This is due in part to the fact that they test markers one after another and fail to capture the substantial information of epistases among disease loci. The use of Bayesian model selection has been the popular method of remedying the pitfalls of conventional methods in recent years, whereby identifying loci with significant effects is viewed as a model selection problem. Unlike conventional methods suggesting a single best model, Bayesian methods consider multiple possible models along with their probabilities to incorporate model uncertainty. One of the powerful Bayesian model selection approaches is the use of stochastic search variable selection (SSVS) [1][2][3][4], in which Markov chain Monte Carlo (MCMC) sampling algorithms are used to sample from the posterior distributions, thus making identification of promising subsets even for many candidate variables (markers) feasible.
Although Bayesian approaches with MCMC techniques have made intensive computations possible and efficient on large-scale data sets arising in modern genomic and genetic applications, an application of Bayesian model selection is still quite challenging and limited from both a computational standpoint as well as the sensitivity to the choice of prior distributions. The usage of MCMC has been often controversial due to the uncertainty of convergence and the dependence on starting positions. In addition, the samples obtained by MCMC are correlated, which can drastically reduce the efficiency of the approaches. These drawbacks of MCMC, however, can be overcome by perfect sampling, which was first proposed by Propp and Wilson [5] under the name of coupling from the past (CFTP). Perfect sampling uses a scheme of coupling chains in order to guarantee that samples are exactly from the target distribution of interest. The basic idea is to run coupled chains that start from all initial states from the past time -T and run them to time 0, in which at any instant of time t ∈ [-T, 0), the same random seed and an updating function are applied to all possible chains. Once all the chains meet (coalesce), from this time onward, due to the common random seed and an updating function, they follow the same path, and at a time 0 they arrive at the same state, which is then an exact sample from the posterior distribution. Therefore, this procedure guarantees that the effect of initial states wears off, yielding an exact sample regardless of starting values. Although perfect sampling suggests the ideal approach to draw an exact sample, the framework of running chains from all possible states is almost infeasible because of the large number of markers involved.
Motivated by Huang and Djuric [6], we propose an efficient implementation of perfect sampling for high-dimensional data. Then, coupled with the new Haseman-Elston method [7], we carry out screening to identify susceptibility alleles that are more closely linked to rheumatoid factor IgM. We further evaluate their possible epistases. Most existing methods adopt a two-stage procedure to screen epistases, in which epistases are only considered for previously selected markers with significant main effects, and thereby they are bound to miss important loci whose effects influence a trait primarily through epistasis. In contrast, we perform an efficient simultaneous screening both on main effects and epistases. Our methods can handle large problems involving up to thousands of markers without any strict conditions in a reasonable running time. We apply these methods to the RA data of Genetic Analysis Workshop 15 (GAW15) (Problem 2).

Haseman-Elston method
The simple regression method of the Haseman-Elston [8] offers an effective framework for studying linkage between markers and disease. Later, Elston et al. [7] proposed modifications to the original Haseman-Elston method [8] to improve its power. It is based on regression of the squared sum of mean-centered trait values, CP j = (Y 1jm)(Y 2j -m), with mean m on the estimated proportion of alleles shared identically by decent (IBD) by the sibling pair, Y 1j , Y 2j .

The model and prior specifications
Assume that there are p markers on the whole genome with n dependent data (samples). Then, we form a model that includes a number of different marker loci to study their simultaneous effects, which can be best approached in a linear regression fashion such as where μ is the mean, y j is an observed phenotypic value (CP j ) for each sibling pair, x ij is a proportion of IBD for i th marker in j th sample, and β i is an effect of the i th marker.
The variance of the trait is assumed to be ε ~ N(0, φ -1 I), with φ being a precision parameter. To explore promising subsets (a set of markers having evidence of linkage) over the entire model space efficiently, a binary indicator γ i is used to represent an exclusion or inclusion of i th marker in the model [1]. Then a model is represented by γ = (γ 1 ,..., γ p ) and Eq. (1) can be reduced to the p γ = I p γ variables by ignoring columns of X for which γ i = 0. We denote the corresponding model as X γ and coefficient parameters as β γ .

Posterior inference
The statistical inference for the identification of marker loci having evidence of linkage is retrieved through the posterior distribution of γ, which is given by Bayes' rule, π (γ|Y) ∝ π (γ)f(Y|γ). After nuisance parameters β γ and φ -1 are integrated out from the marginalized likelihood, it is simplified to When there are a large number of markers involved, Eq.
(2) is estimated via MCMC algorithms by simulating samples from the posterior distribution without knowing the normalizing constant, but at a risk of false inferences and being subject to initialization biases. We use perfect sampling described as follows.

Posterior simulation via perfect sampling
Under a non-epistatic model, γ = (γ 1 , K, γ p ), for example, we simulate samples from Eq. The original perfect sampling method, CFTP [5], entails running parallel chains from every possible 2 p-1 state from the past time -T to 0 repeatedly until it achieves coalescence. However, our approaches do not require running all these chains based on two following ideas. we decide two sandwich distributions and update S 0 based on Eq. (5). If the coalescence is achieved, a support S 0 is reported as a draw from the target posterior in Eq. (2). Otherwise, we move back at T = -2 and repeat updating for t ∈ [-2, 0], and then check coalescence at 0. The whole procedure is repeated, and a sample is drawn only if coalescence occurs at 0. Otherwise, the starting time is shifted further back, preferably at -2T [5] and the updates perform by reusing the same random seed, which is critical to preclude the space from growing [5]. One of the main keys in our methods is to construct two bounds, and . We have recently proposed how to build these bounds, approximately to succeed the perfect sampling even for high dimensional spaces. The manuscript may be obtained upon request.

Model space prior for epsitases
To account for epistatic effects, we consider two different model space priors of π (γ). An independence prior is usually used when it is believed that effects of markers influ- This dependent relationship can be advantageous in that we can reduce the size of the model space by limiting certain epistases to be included in the model. For example, if we believe that an epistasis should be considered only if at least one of the main effects is significant, we let w 00 = 0. However, because we might miss important marker loci that might affect a phenotype primarily through epistasis, it may be more reasonable to have 0 ≤ w 00 ≤ w 01 ≤ w 11 ≤ 0.5. The hyper-priors, (w 1 , w 2 ) in Eq. (6) and (w 11 , w 01 , w 00 ) in Eq. (7), can indirectly control the expected numbers of effects in the model. Therefore, small values are essential because we expect there are a small number of markers linked to the trait.

Selection criterion
After we collect samples using perfect sampling, the identification of markers that are tightly linked to the genes is given by estimates of marginal posterior probabilities. To this end, we simply count the relative frequencies of model visits in the samples, and the marginal posterior of the i th marker being important is estimated by summing over the posterior of models containing this marker. Then, we list the estimates of marginal posterior probabilities in a numerical order. Their patterns are used to gauge the importance of effects. When the decision is made, the model space prior (w 1 , w 2 ) in Eq. (6) and (w 11 , w 01 , w 00 ) in Eq. (7) play an important role as threshold values. If the marginal posterior probability of the marker is higher than these values, we decide that the corresponding effect of this marker is significant.

Data
We used rheumatoid factor IgM as the quantitative trait values and microsatellite scans for 511 multiplex families over the 22 autosomal chromosomes. The IBD values were obtained using the statistical software MERLIN [10]. A total of 590 independent sib pairs and 407 microsatellite markers were used in the analysis.
Our programs were written in MATLAB and each was run on Super Macintosh G5 with a 2.66 GHz quad-processor.

Choice of hyper-parameters
Before we ran perfect sampling, we had to decide hyperparameters (c, w 1 ) under a non-epistatic model, and (c, w 1 , w 2 ) in Eq. (6) or (c, w 11 , w 01 , w 00 ) in Eq. (7) under an epistatic model. To specify these values, rescaling and sensitivity analysis may be advisable. We checked the sensitivity of our methods toward the choice of c by re-running our algorithms for several values of c between 1 and 10.
The results were not sensitive (data not shown). Choosing (w 1 , w 2 ) and (w 11 , w 01 , w 00 ) in the model space prior is rather straightforward. A smaller value should provide smaller estimates of marginal posteriors, but our results were robust to these values since we took them as threshold values to select important effects. Therefore, in this paper, we only reported the results by fixing (c, w 1 ) = (5, 0.01) for a non-epistatic model and (c, w 1 , w 2 ) = (5, 0.01, 0.01) in Eq. (6) and (c, w 11 , w 10 , w 00 ) = (5, 0.01, 0.01, 0.005) in Eq. (7) for an epistatic model for comparison.

Main effects
We first performed screening of main effects only. We collected 500 samples from perfect sampling. The average coupling time to achieve coalescence for one sample was about 1 minute. Figure 1 displays an empirical frequency of each effect to estimate a marginal posterior probability, π (λ i = 1|Y). We found the highest peak on chromosome 6, and suggestive peaks on chromosomes 2, 4, 5, 11, 19, and 21, which had estimated marginal probabilities greater than w 1 = 0.01.

Total effects (main and interaction effects)
To assess the evidence for epsitases, we included all main effects and two-way pairwise interaction terms in the model. Therefore, the total number of effects considered was 83,028. We compared two different assumptions Eq. (6) and Eq. (7). We collected 500 samples. The average coupling time to achieve coalescence for one sample was about 25 minutes under Eq. (6) and 21 minutes under Eq. (7). The summary of the results is given in Table 1. The same significant main effects as in the above "main effects" were found. Additionally, we found three suggestive interactions effects between chromosomes 6 and 16, 6 and 19, and 6 and 21 under both Eq. (6) and Eq. (7) prior assumptions.

Conclusion
We have applied Bayesian variable selection via perfect sampling to the RA data of GAW15 to identify markers linked to rheumatoid factor IgM. Our methods can accommodate a large number of markers, permit epistatic effects to be considered in the models, and evaluate all effects simultaneously. Therefore, they have significant advantages over the classic approaches. As opposed to other Bayesian methods, our methods do not require any tunings relating to convergence issues of MCMC techniques and there is no dependence on initial values. Therefore, they are reliable even from a small number of drawn samples.
Our analyses have revealed that there is a strong evidence for main effects on chromosome 6, and also marginal evidence for epistases between chromosomes 6 and 16, 6 Marginal posterior probabilities of component λ Figure 1 Marginal posterior probabilities of component λ. The highest peak on chromosome 6, and suggestive peaks on chromosomes 2, 4, 5, 11, 19, and 21. All had estimated marginal probabilities higher than the model prior w = 0.01 (dotted line).