We analyzed the 100 replicates simulated for GAW 15 (Problem 3). To apply a phylogeny-based method, we need to work on a candidate region where the disease susceptibility site is typed, and where the recombination rate is low. We used the answers to choose a 200-kb region of chromosome 6 around the DR locus that contained two DS sites: the DR locus and locus C. In this region, nine single-nucleotide polymorphisms (SNPs) (including locus C) were selected. A tenth biallelic locus was added, corresponding to the DR locus in which the lower risk alleles DR1 and DRX were pooled. The linkage disequilibrium is low within these ten sites: the highest r2 is between locus C and SNP 4 (r2 = 0.65) and it is the only pair of loci with an r2 above 0.2.
For each replicate the first affected child of the first 500 families was selected to obtain 500 trios. Missing data were generated on the different loci (with the same percentage of missing data on each locus) on both parents and children. In each replicates, the same individuals had their genotypes missing at the same loci in order to ensure a similar pattern of missing data over replicates.
Reconstruction of missing data and missing phases
Missing phases and missing genotypes were reconstructed either only by an algorithm to infer the most probable haplotypes without missing data for each individual, or by a multiple imputation method. For both methods, the first step was the inference of all the possible haplotypic configurations and their probabilities. It was performed with the software ZAPLO . The first method then consists of picking the most likely haplotypes for each individual. The only families kept for the analysis were those with a low level of haplotype uncertainty; i.e., families with a best configuration posterior probability >50% and at least 25% difference between the posterior probabilities of the best and second best configuration. Similar results were obtained with other cut-off values (data not shown). The multiple imputation procedure is the same as the one described in Croiseau et al. . Briefly, it consists of repeating two steps: 1) given the current values of two parameters (population haplotype frequencies and affected child genotype frequencies), sampling a complete data set according to the posterior probabilities of each genotypic configuration and 2) given the current data set, updating the two parameters. After a burn-in period of 1000 iterations, every 1000 iterations, the current complete data file was retained. We ran the algorithm until we obtained ten complete data sets.
Identification of the susceptibility sites
The identification of the DS sites was performed with the software ALTree . At first, 1000 equiparsimonious unrooted trees were reconstructed for the 30 most frequent haplotypes using the parsimony method implemented in the software PAUP*, version 4.0b10 . To ensure that various tree configurations were explored, PAUP* was launched 10 times, 100 trees being retained each time. Then, a new character called S, which represents the disease status, was defined for each haplotypes. The state of this character depends on the proportion of cases carrying a given haplotype (state 1 for a large proportion of cases and 0 otherwise). The character state changes were optimized on the tree for each character (including S) using the deltran option. A correlated evolution index (V
) was calculated between the changes of each site i and the changes of the character S. This index was defined as the difference between the number of observed and expected co-mutations between site i and character S, divided by the square root of the number of expected co-mutations . To take into account the 10 imputed data sets, we calculated the median of the V
over these 10 data sets. Finally, the sites with V
≤ 0 were discarded and the two site(s) with the highest V
are retained as putative DS sites.