Bayesian variable selection with probit model
The binary probit model is incorporated to implement BVS method. Let us assume that (y, X) indicates the observed data, with yn× 1 a dichotomous categorical outcome vector coded as 1 or 0 representing for RA affected or RA unaffected, respectively, and Xn × pthe predictor matrix. Let z be an n × 1 vector of latent variables, while each z
i
, associated with a categorical outcome, y
i
, is described by a linear regression model:
z
i
= X
i
β + ε
i
, ε ~ N(0, σ2), i = 1,..., n.
The relationship between z
i
and y
i
is defined by y
i
= 1 if z
i
> 0 and y
i
= 0 otherwise. The likelihood function of the model defined in Eq. (1) may be written as f(z|β, σ):
f(z|β, σ) = Nn(Xβ, σ2I).
The variable selection problem arises from the fact that it would be preferable to exclude some unknown subset of the predictors that have negligible influence on the outcome. Thus, statistical models for the variable selection problem can be represented by a selection vector, which is a set of binary indicator variables γ = (γ1,..., γ
p
), where γ
j
= 1 or 0 corresponds to inclusion or exclusion of predictor j in the model, respectively. The prior distribution of the model indicator variables, π(γ), is chosen to reflect prior belief in whether particular SNPs are associated with RA status in our case. A reasonable choice of the prior information might be to have the γ
j
(j = 1,..., p) independent with probability π(γ
j
= 1) = 1-π(γ
j
= 0) = p
j
, thus
π(γ) = ∏ p
j
γj(1-p
j
)1-γj.
The residual variance σ2 for the γth model is modeled as a realization from an inverse gamma prior:
π(σ2|γ) = IG(ν/2, νλ
γ
)
which is equivalent to νλ
γ
~ χ
ν
2. Because the value of selection vector, γ, is of interest and is unknown, the uncertainty underlying variable selection can be modeled by a mixture prior:
π(β, σ, γ) = π(β|σ, γ) π(σ|γ) π(γ).
The posterior distribution of (β, σ, γ) can be obtained from the product of the likelihood function of the model in Eq. (2) and the prior defined in Eq. (5):
π(β, σ, γ|z) = f(z|β, σ) π(β|σ, γ) π(σ|γ) π(γ).
Therefore, integrating out β and σ from Eq. (6) yields the posterior distribution of the selection vector γ:
π(γ|z) ≈ g(γ) ≡ π(γ) ∫ f(z|β, σ) π(β|σ, γ) π(σ|γ) π(γ) d β d σ.
Based on this setting, Metropolis algorithm with Gibbs sampling was incorporated to sample (γ, z) as follows: 1) Metropolis step: π(γ|z) ≈ g(γ) with acceptance probability {g(γnew)/g(γold), 1}; 2) (z|γ, X) has a truncated normal distribution.
In order to update each transition from γold to γnew, the Metropolis algorithm uses deletion, addition, and swapping moves discussed by Brown et al. [4]. Details of the prior information, the posterior distribution, and the updating procedure can be found in George and McCulloch [5] and Sha et al. [2].
Iterative Bayesian variable selection
As mentioned, BVS uses long iterations and take a long time for the Metropolis algorithm to find suitable subset of SNPs. In the worst case, BVS might be unable to provide a promising subset. In order to overcome these problems, we propose to use BVS iteratively with a relatively small number of iterations, which is termed IBVS, to increase the speed of search for promising subsets of SNPs. There are two basic ideas behind IBVS. First, if γ
j
is not significant at the early stage of iteration when long iteration is incorporated in BVS, then the jth marker is excluded in the final model. Second, the model that has high probability is more likely to appear at the early stage of iteration. From these facts, we can use BVS iteratively with relatively small number of iterations to increase the speed of searching for promising subsets of markers. This IBVS can be implemented by the following steps: 1) Start with BVS with full model, i.e., the model having all SNPs. 2) Choose a model for next iteration of BVS, e.g., the model that has highest posterior probability. 3) Repeat Step 1 and 2 with the model chosen in Step 2 until a certain number of SNPs remain in the model.
Materials
There were 100 replicates in GAW15 Problem 3 data sets. Each replicate consisted of 1500 nuclear families (two parents and two offspring) that had an affected sibling pair (ASP) and 2000 unrelated control subjects that had no first-degree relatives with RA. Three marker sets were provided: 1) a set of 730 microsatellite markers fairly evenly spaced on chromosomes with an average intermarker distance of about 5 cM; 2) a set of 9187 SNPs distributed on the genome to mimic a 10 K SNP chip set; 3) a very dense map of 17,820 SNPs on chromosome 6. We utilized the second marker set for our analysis. According to the answer distributed by GAW15, there are three loci (DR, C, and D) on chromosome 6 that increase the risk of RA. Loci DR and C are located between SNP6_153 and SNP6_154, and are in complete linkage equilibrium. Locus C increases RA risk only in women. Locus D is located between SNP6_161 and SNP6_162, and has a rare minor allele frequency of 0.0083. We focused our analysis on the 674 SNPs on chromosome 6 to evaluate the IBVS method.
We first constructed three case-control panels for each of the 100 replicates. The first panel included both males and females. One affected offspring was randomly selected from each of the 1500 families that had an ASP. These 1500 unrelated affected subjects were used as cases. The 2000 unrelated control subjects were used as controls. In addition, because locus C increases RA risk only in women, we also constructed a female case-control panel and a male case-control panel. To maximize the number of female cases, we randomly selected one affected female offspring from each of the ASP families that had at least one female offspring. The female case-control panel consisted of ~1400 unrelated affected female offspring selected from the ASP families and ~1000 female controls. The male case-control panel was constructed similarly. It consisted of ~680 cases and ~1000 controls. Therefore, a total of three case-control panels (total, female, and male, respectively), each having 100 replicates, were constructed. This data set was named DS1.
Second, in order to evaluate the performance of IBVS when p » n, we constructed a subset case-control panel from each of the case-control panels in DS1 by randomly selecting 50 cases and 50 controls. The same 674 SNPs were kept in the panel. This data set was called DS2 (n = 100, p = 674).
Finally, in order to compare the performance between IBVS and traditional SVS directly, we selected a subset of 50 SNPs located between SNP6_128 and SNP6_177 from DS2. This dataset was named DS3 (n = 100, p = 50).
As a comparison, we also carried out the association analysis using BVS and SVS, which was implemented in the Proc Logistic procedure in SAS. We summarized all results obtained from IBVS, BVS, and SVS by calculating power for each SNP indicator, γ
j
, as follows:
where I is the indicator function satisfying I(γ
ij
= 1) = 1 if γ
ij
= 1 and I(γ
ij
= 1) = 0 otherwise; and n is the total number of replicates. Like other iterative methods, e.g., Newton method, there can be many stopping rules that can be applied in Step 3 in IBVS. We used the predetermined number of SNPs (10) based on empirical experience to stop the IBVS algorithm.