We randomly selected one RA case from the affected-sibling pair (ASP) in the first replicate of the GAW15 simulated data for Problem 3. After selecting all of the controls, our final data set included 1500 cases and 2000 controls. In order to ensure a significant finding we reviewed the answers to the simulated data prior to our analyses. Since the strongest signal for RA was simulated to the HLA region on chromosome 6, we limited our analyses to the dense genotyping for chromosome six. In total 17,820 SNPs were simulated on chromosome 6, yielding an average inter-marking spacing of 9586 base pairs. This corresponds to the density one would expect from a genome-wide 300,000 K SNP set.
Model
For each SNP i, we modeled the 2 × 2 allele-by-disease status table using a hypergeometric likelihood with OR = exp[β
i
] [7]. The prior on the log allelic odds ratio β
i
is a mixture of point mass at 0 with a distribution of N(μ
j
, σ
j
), where j = 1,.., J, and J is the number of non-null classes. For example,
where β
ij
is ~N(μ
j
, σ
j
) and X
i
is binomial or trinomial (0,.., J) with probabilities
We considered two ways to separate the markers into associated (non-null, X > 0) and non-associated (null, X = 0) classes. First, we naively assume all non-null loci are derived from the same distribution (J = 1). Second, we assume some markers are positively associated with the outcome, i.e., OR > 1, and others are inversely associated with the outcome, i.e., OR < 1 (J = 2).
A vast majority of the disease-marker associations will be null, so we used conjugate priors to update μ
j
and σ
j
. Conjugate priors are helpful when the number of non-null loci is small and they may provide information distinguishing between classes, i.e., OR < 1 or OR > 1, for the model where J = 2. In principle non-identifiability is a problem; however, by putting very small prior probabilities on identical alternative parameterizations we may avoid this issue [8]. The conjugate priors for μ
j
and σ
j
were μ
j
|σ
j
~ N(μ0, σ
j
/κ0) and σ
j
~ Inv-χ2(σ
j
, ν0). The hyperparameters we used were,
for J = 1
μ0 = log(2), κ0 = 5, ν0 = 5, σ02 = log(2)/2,
and for J = 2
μ10 = log(2), κ10 = 5, ν10 = 5, σ102 = log(2)/2,
μ20 = -log(2), κ20 = 5, ν20 = 5, σ202 = log(2)/2.
We put Dirichlet priors on π = (π0,.., π
J
). For example, when J = 1, π ~Dirichlet (1, 999). To account for differences in prior probability of association, we also varied Dirichlet hyperparameters across regions. We selected three candidate regions with identical, high prior probability of association after reviewing the answers for the simulated data and performing a literature search on RA. The HLA region, where the causal SNP was simulated, and two upstream regions from the literature search were up-weighted. The two upstream regions contained several genes, including WISP3 and VIP, with a potential biologic role in RA [9, 10]. We fixed the ratio of prior odds of association between candidate regions and non-candidate regions at ≈ 53. For example, when J = 1 we set π ~Dirichlet(1, 999) in non-candidate regions and π ~Dirichlet(0.25, 4.75) in candidate regions. In the interest of time and to reduce the computational burden we chose every fifth SNP from the dense data on chromosome 6, for a total of 3564 markers. Of these, 62 were in the HLA region and 22 in each of the two upstream regions. In principle the latent class analysis has no limit as to the number of SNPs that can be analyzed, and given optimized code, more computing resources and more time, an analysis of 300,000–500,000 markers from a genome-wide scan is feasible. Each analysis presented here took approximately 12 hours on three nodes (each with two 3.2 GHz CPUs), so a scan with 300,000 markers would take less than five days on a 30-node cluster.
Model fitting
We used three parallel Gibbs sampling chains with 3000 iterations each in order to fit the model. The parameters μ
j
, σ
j
, and π
j
could be updated by directly sampling from their conditional posterior distributions. The parameters β and X were simultaneously updated using the Metropolis-Hastings algorithm.