Maximum likelihood with clustered markers
Abecasis et al. [2] proposed a clustered markers approach to model LD, which they implemented in the computer program MERLIN. The main idea is to cluster tightly linked markers and to model LD between the markers in a cluster by estimating haplotype frequencies with the expectation-maximization (EM) algorithm. The haplotype frequencies are estimated for each cluster independently. Markers are clustered together when the level of LD exceeds a threshold to be specified by the user. The approach requires that there are no recombination events between markers in a cluster. When the genotypes indicate an obligate recombinant between markers in a cluster, genotypes are set to missing to remove the obligate recombinant. Furthermore, the approach does not model LD between clusters.
Independent markers approach
First, pair-wise LD correlation coefficients are estimated between pairs of markers. When the level of LD between a pair of markers exceeds a threshold, which must be specified by the user, one of the markers is removed from the data set. As a result, the remaining set of markers will not have markers with levels of LD that exceed the threshold.
A first-order Markov model
We propose to use a first-order Markov model to handle LD between the markers; LD between each pair of markers is modeled with an approach similar to the one proposed by Yang et al. [3] for a marker and a trait. In contrast with their work, we will compute identity-by-descent (IBD) statistics such as Z
pairs
[1], and will not model LD between marker and trait alleles.
The first-order Markov model is applied to the alleles of each founder individual independently:
P(G
i
|α, β) = P(Gl
i
|αl)∏l=2,...,LP(Gl+1
i
|Gl
i
, α(l,l+1), β(l,l+1)),
where i is founder, where Gl
i
represents an allele (which can be paternal or maternal) for marker l of a founder i, and L is the number of markers. α and β are the LD parameters that quantify the linkage between the alleles of two adjacent SNPs:
P(A|a, α(l,l+1), β(l,l+1)) = α(l,l+1), P(A|b, α(l,l+1), β(l,l+1)) = β(l,l+1).
In this equation, the alleles of marker l are denoted by (A, B); the alleles of marker l + 1 are denoted by (a, b). Our approach assumes Hardy-Weinberg equilibrium between founder alleles. MERLIN and the independent markers approach also make this assumption. The goal is to infer the posterior probability distributions of segregation indicators of the non-founders (NF) for locus l, denoted by s
NF
l, given marker data, integrated over the nuisance parameters α(l,l+1) and β(l,l+1):
Here, G
i
with iε{F} represents the vector of paternal or maternal founder alleles for all marker loci; when multiple families are available, the product is implied to run over the founder alleles in all families. P(G
i
|α, β) is the first-order Markov model given by Eq. (1). Thus, the segregation indicators of individuals in different families become dependent through the jointly shared, unobserved model parameters αand βgiven marker data M.
Evaluation of the integral in Eq. (2) is computationally infeasible when the number of families is large (>4). Therefore, we use the following computationally feasible procedure:
1. For each pair of adjacent markers (l, l + 1), infer P(α(l,l+1), β(l,l+1)|M(l,l+1)) using the marker data of all families. If the number of markers is L, this step entails L - 1 independent inference problems, independent of the number of families. Thus, the marker data of all families for the two adjacent markers (l, l + 1) is used to estimate the parameters α(l,l+1), β(l,l+1). In this step we use set of discrete values for these parameters so that exact computation using the junction tree algorithm [4] is feasible when the number of founders per family is limited. We assume a uniform prior distribution over αand β.
2. For each family f, independently compute the IBD-sharing statistics for marker l from
where s
f
denotes the segregation indicators of the non-founder individuals in family f. The integrals are carried out approximately using the set of discrete values for α and β.
Application to data
We compared the Bayesian first-order Markov approach, the maximum-likelihood first-order Markov approach, the maximum-likelihood clustered markers approach implemented in MERLIN, and the independent SNPs approach on the 61 families from the Canadian population of GAW15 Problem 2, genotyped with the Illumina marker set. These are affected sib-pair families in which generally only the children have marker data. It has been reported that LD between markers in the 6 K Illumina set appears to be lower than LD between markers in the 10 K Affymetrix set, although significant LD is present [5].
We analyzed chromosomes 1, 2, 3, 4, and 6 in this data set. We estimated the parameters α(l,l+1) and β(l,l+1) for the markers on these chromosomes for which Illumina reports the Decode centimorgan chromosomal location using the procedure described above. For the Bayesian first-order Markov approach, we used a set of ten discrete values for the parameters α and β, chosen such that they covered 99.9% of the posterior probability mass. They were determined for each pair of adjacent markers independently. The parameters of the maximum-likelihood first-order Markov were determined using the same set of discrete values.