A Hidden Markov Model (HMM) is defined by its states and transitions between states [7]. The hidden property is corresponding to the fact that states are not observed. Rather, emitted symbols in an alphabet are observed and those are related to the states according to some function.

### Review of Hidden Markov Models for intercross genotyping

Our haplotyping HMM approach is quite similar to the genotyping approach described for outbred lines in [5]. A first overview of the extension for haplotyping is also found there. A more formal description of the corresponding, slightly simpler, case for genotype probabilities in inbred lines can be found in [8], corresponding to the implementation found in R/qtl [9].

The state space in our model consists of a set of binary flags, determining the current phase in some individuals. In the case of finding genotype probabilities in an *F*_{2} cross, those flags represent the two *F*_{1} parent individuals, where the meiosis in each individual can transmit an allele originating from either the grandfather or the grandmother. Transitions between states represent recombination events. The Markov model is of the continuous-time variety, where time in this case corresponds to the mapping distance. Emission symbols correspond to the marker data. This constitutes the hidden quality of the model, as grandparental marker data might not map uniquely to the state.

### Application of HMMs for haplotyping

The genotype probability model in the intercross gains its speed from the fact that only a single “focus” individual, its parents and grandparents need to be analyzed at the same time. In the model, what is essentially tracked is the gametes generated by meiosis in the parents, expressed as grandparental origin in each locus. As two meioses are tracked, there are 2 ∗ 2 = 4 states.

The same general approach, where analysis takes place over a pedigree including a focus individual and the immediate ancestors to that focus individual, can be extended to include further generations of ancestors. Adding one more generation would imply 64 states (tracking 6 independent meioses), as even the phase in alleles that were not transmitted need to be calculated to accurately track possible state transitions. Adding another generation increases the number of states to 16, 384, which in practice is not tractable. As noted above, the genotyping model tracking phase in the parent, but that information is never recorded explicitly. In order to make the implicit tracking of phase in individual analyses explicit, we introduce a parameter, called skewness, per marker per individual. This parameter indicates the probability of the “true” ordering of the unordered marker value pair {*A*, *B*} to be *AB* or *BA*, respectively. Rather than simply using binary emission probabilities for supported or unsupported marker values, the emission probability will be dependent on the state realisation considered. A skewness parameter value of 0.5 is neutral to ordering, while a value of 0.0 exlusively permits *AB*, and a value of 1.0 will exclusively permit *BA*. A dataset is phased if the skewness parameters for all heterozygous marker value pairs take such extreme values.

For the haplotyping application, we are using the 64-state model with the addition of skewness parameters. The approach of centering each analysis on a focus individual is kept from the genotyping model. A full haplotyping iteration is thus only a matter of computing the model for each marker in each focus individual. Although no ancestral genotype information is used for the grandparents, the present of linkage will favour skewness assignments mapping linked alleles to the same strand. With a proper setup, the skewness parameters for all marker-individual pairs can be trained in an iterative manner based on analyses in the pedigrees where the individuals appear.

### Training methodology

We use a modified Baum-Welch approach [10] for training the skewness parameters. This approach is based on expectation-maximisation of the model probability for observed data under optimisation of some or all model parameters. The actual value distribution of a parameter under optimisation in one iterated realisation of the model is used as the prior distribution in the next one, until convergence is reached. The skewness parameter for the first heterozygote on each chromosome in each individual is fixed to 0.0, to avoid unnecessary symmetries. The transition parameters related to recombination can be kept fixed, based on a pre-determined marker map with (Haldane) mapping distances, or be subject to optimisation in tandem with the skewness parameter set.

The strand numbering is *not* directly connected to parental origin. Rather, the model is computed multiple times, allowing for the (grand)parental origin of each strand to be either the sire or dam for the respective parent. When the model has converged, most such alternatives can be truncated early due to extremely low likelihood, as a form of optimisation similar to a beam search. The total number of such shift combinations is 8 in the 64-state model, as shifts are only needed for the focus individual and its parents. The same skewness parameter is appearing in multiple separate analyses, as one individual can appear in multiple focus analyses: A parent will appear in the analysis pedigree of every offspring. An individual acting as the focus individual in one pedigree can also be a parent or a grandparent in another, in a multi-generational setting. Furthermore, some pedigrees can be completely uninformative for a specific parameter. The “optimal” skewness assignment based on such a pedigree will always be the a priori value of the parameter from the previous iteration, and it could be called skewness-agnostic. For these two reasons, a voting scheme is implemented. The skewness contribution from each analysis pedigree corresponds to the deviation from the a priori value used. Thus, uninformative locations are not hampering convergence if information is found in other pedigrees. Furthermore, the voting is scaled for contributions to ancestors more than one generation away from the focus individual. This is due to the fact that there is a single meiosis event truly tracked, the one resulting in a gamete forming the parent individual (*F*_{1}). The offspring to that individual are only giving different aspects of information on the same event, and should be weighted accordingly. If this was not done, the haplotype estimates in an *F*_{0} individual would be biased towards the allele combination found in *F*_{1} individuals with a high number of *F*_{2} offspring.

The voting approach where lack of information is ignored was chosen to accelerate convergence. However, in early iterations information is lacking in many loci and those few fragments of information that are present can turn out to be incorrect. For this reason, the new value *p′* for the a skewness parameter with an a priori value of *p* is limited by clamping to satisfy the condition |*p* – *p′*| ≤ *p*(1 – *p*) min(1/(0.5 + (1 – *p*)), 1/(0.5 + *p*)). This clamping puts a limit to the relative change in preference of one phase assignment over the other, as expressed by the ratio *p*/(1 – *p*), to a factor of 3. For example, a skewness parameter value starting out at 0.5 will always be found in the range [0.25, 0.75] after one iteration. If the value is updated to 0.25, it will be found in the range [0.1, 0.5] in the following iteration. By limiting the rate of change for the probability ratio between these two complementary states, unstable behaviour is avoided.

In the *F*_{0} generation, the only definition of the alleles arises from the initialisation of the skewness value in one marker. In other words, the two strands can only be defined and separated relative to that anchor marker. Phasing a full chromosome, especially a long one, is problematic as the linkage between a distant marker and this anchor will be very weak or non-existent. The haplotyping problem based on meioses is inherently local, along the chromosome, and between closely related individuals. Skewness values might converge in different directions in different regions of the chromosome, assuming an extraneous recombination event. The traditional Baum-Welch approach, even with the modified voting scheme, would only resolve such a situation at a very slow pace, if at all. Therefore, an inversion step is added to the algorithm, where inverting all the skewness values downstream of the current marker is tested in each iteration.

If the total probability of the observed sequences, over all pedigrees, increases with such a change, it is accepted. Due to the model structure, such an inversion can be performed using the normal data structures of hidden Markov model algorithms, by essentially only modifying the state vector when combining forward and backward probabilities. In practice, care needs to be taken to avoid oscillatory behaviour, especially with respect to inversion events.

### Summarised algorithm

The following algorithm summarises the haplotyping approach described above for a single chromosome. As different chromosomes are completely unrelated from a linkage and phasing perspective, the whole approach can simply be repeated per chromosome. Isolating the phasing operation per chromosome can also simplify parallelisation of the computations and reduce the amount of memory needed for each step.

1. Initialise skewness values to 0.5 everywhere, except at one heterozygous marker in every individual

2. Loop until convergence (e.g. a minimum sum of skewness changes, or a fixed number of iterations based on chromosome length)

(a) Loop over all focus individuals

i. Loop over all markers

A. Compute the marginalised probability for strand realisations corresponding to observed marker values, taking all 8 parental shifts (defined above) into account, using suitable hidden Markov model algorithms.

B. Contribute the deviation between the current skewness value, and the resulting ratio, for each individual in the analysis pedigree, as a vote. Divide the votes to grandparents by the total number of (half-)siblings sharing that parent.

C. Compute and contribute probability changes arising from possible skewness inversions in each individual in the dataset in a similar manner.

(b) Loop over all individuals

i. Loop over all markers

A. Update skewness values according to recorded votes. Cap the maximum possible change in the ratio *p*/(1 – *p*) between iterations, in order to retain stability.

B. If the recorded votes for a favorable inversion at this position exceeds some threshold, perform an inversion of all downstream skewness values.

### Materials

The common dataset prepared for the 14th QTL-MAS workshop was used. This dataset contains 3,226 individuals, spread over 5 generations, with 20 founders. 10, 031 SNP markers were defined over 5 chromosomes all about 100 million base pairs in length. A uniform recombination rate of 1 cM/Mbp was applied for both sexes. Haplotypes for the founders were sampled from a simulated population of 5, 000, using the methods described in [11].