SSGS combines hierarchical priors to model alleles within loci with the conditional logistic regression likelihood to model the probability of transmission to a diseased child in case-parent triad data. The full details are given in Swartz et al. [3], and we give a brief overview of the method here.
To review the likelihood, we first define our notation: D+ denotes a child is affected; g = (g
m
, g
f
) denotes the child's genotypes, with each element subscripted with m denoting transmitted from the mother, and f denoting transmitted from the father; G
p
denotes the parental genotypes. Then, given the parents G
m
and G
f
, the probability of transmitting genotype g to the affected child is given by [3, 4, 8]:
where g*|G
m
, G
f
represents all possible pairings of transmitted and non-transmitted alleles consistent with the parental genotypes and RR [(g
m
, g
f
)] is a relative risk function defined previously as a conditional logistic regression function [3].
Using generalized transmission-disequilibrium test (GTDT) coding as described by Schaid [8], we must omit a reference allele for model identifiability. Using calculations from Thomas et al. [4], we select the most prevalent allele as our reference allele. Then, using l = 1,..., L to index the loci, and a = 1,..., A
l
-1 to index the alleles within a locus (and A
l
denotes the maximum number of alleles at locus l), each β
la
refers to an allele main effect for allele a at locus l, with the usual interpretation as the log relative risk for transmission to the case. Note that using GTDT coding assumes additive allelic effects [3].
Next, we assume a multivariate normal prior for the main effects, and hierarchical latent indicators to stochastically search through the main effects that impact the probability of transmission [3, 9]. We denote the first latent indicator as the vector λ= (λ1,..., λ
L
). Each element of λhas a Bernoulli distribution with probability p
l
, where p
l
is the prior probability that locus l is associated with the disease. Similarly, we define a partitioned allele indicator vector, conditional on the loci selected as α|λ= (α1,..., αL) with each partition indicating the alleles at locus l associated with disease, and each element in α1 has a Bernoulli distribution with probability q
la
as its prior, where q
la
is the prior probability that locus l and allele a at locus l are associated with the disease. This dual hierarchical prior structure restricts the stochastic search to models that select alleles within loci.
Conditional on the indicators, we can define the prior for main effects. Let γ
la
= λ
l
α
la
. Therefore, γindicates the allele main effects when both the locus and allele are indicated. We can define the conditional prior for the coefficients, β|γ, as
π (β|γ) = MVN (0, D
γ
RD
γ
),
where , where each k
la
is defined as
where c is large, τ is small, and R is either an identity matrix or a covariance matrix defined by genetic correlation. (For more details concerning c and τ and the flexibility of SSGS using these parameters, see Swartz et al. [3] and references therein.)
To define R, we start with a blocked matrix, L, whose diagonal blocks represent within loci covariance and off-diagonal blocks represent between loci covariance [3]. We model the within-locus covariance by modeling the probability of an allele's presence as a multinomial distribution. The covariance is then defined in the usual way for a multinomial distribution, using the normalized allele frequencies as cell probabilities for the distribution. To model between locus covariance, each element is simply the allele-wise linkage disequilibrium value: . Once the elements of the blocks are defined, we set R = L-1. More details are given in Swartz et al. [3].
Our posterior distribution is intractable, and therefore, we use MCMC simulations to sample from the posterior distribution. The parameters λand αcan be calculated using a Gibbs sampler, while the βvalues are updated by locus using a multivariate Metropolis Hastings step [3]. Software to implement SSGS is available at http://www.epigenetic.org/Linkage/ssgs-public/.
We are mainly interested in the marginal posterior of the γ
la
values given the data. We use the proportion of iterations for each γ
la
= 1 to estimate the posterior probability of each allele being associated with transmission to the case. Once the posterior probabilities are calculated, we use the median model decision rule developed by Barbieri and Berger [10]: select genes with posterior probability of inclusion greater than or equal to 0.5.
Recall that conditional logistic regression models the probability of transmission to the affected offspring, and can include other covariates, such as environmental factors in the model. We compare SSGS with the TDT, implemented in TDTEX program in the Statistical Analysis for Genetic Epidemiology release 5.0 (SAGE 5.0) [11]. TDTEX performs McNemar's test of counts of transmitted versus non-transmitted alleles, which is a powerful χ2 text of association that cannot include additional covariates.
Data analysis
We chose to analyze families from Replicates 1 and 2 of Problem 3. Because this method requires case-parent triads, we randomly selected one of the affected sib pairs to be the case. We analyze the first 250 families from both replicates, focusing on six microsatellite markers from simulated chromosome 6: markers 35 to 40. This gave us a total of 59 alleles from Replicate 1 and 57 alleles from Replicate 2. By looking at the answers of the simulation in advance, these markers are far enough away from any of the disease-associated loci to be assumed independent of the disease locus. We analyze these markers under four different prior model specifications. We use the same values for p
l
and q
la
as in the sensitivity analysis combined with either using the identity matrix or defining the dependence structure in R as a function of allele frequencies and linkage disequilibrium as in Swartz et al. [3]: 0.5, 0.25, 0.1, 0.01. Additionally, we compare the results from SSGS with standard inference from conditional logistic regression using Stata 8 and the TDT as implemented in TDTEX.
In order to evaluate the method in the presence of a signal, we performed a second analysis of all four methods mentioned above, applied to markers closer to the simulated DR locus, a gene locus involved in increasing risk for the disease. Using the same algorithm as described for the simulated microsatellites [12], we generated three dense microsatellites using the dense single-nucleotide polymorphisms (SNPs) from chromosome 6 at the following locations: 1) 48.40 cM, 2) 49.44 cM, and 3) 51.52 cM. (When constructing the microsatellites, we omitted the SNP located exactly on the DR locus.) From the "answers", we know that dense microsatellite 2 is the closest to the DR locus. Because the simulated signal was so strong, we only used data from Replicate 1 for this analysis.
For these sets of markers, by using the 250 families, not all alleles appeared in our sample, and some alleles had very low frequencies in our sample. Therefore, we considered a minor allele (MA) as any allele with less then a frequency of 4% in our sample, and pooled them to one pseudo-allele. If the pseudo-allele still had less than 4% frequency after pooling the MAs, we then pooled the MA with the least frequent allele with frequency greater than 4% in the data set.