The first 500 families of each of the 100 replicates simulated for GAW 15 (Problem 3) were considered and case-parent trios were obtained by selecting both parents and the first affected sib in each sibship. Using the answers, we chose to focus on chromosome 6 in the region containing both the DR and C loci, and we were interested in detecting the effect of the C locus. In this region, nine SNPs (including locus C) were selected. A tenth biallelic locus, corresponding to the DR locus in which the lower risk alleles DR1 and DRX were pooled, was added.
Starting from the complete data, we randomly deleted genotypes at locus C to generate different levels of missing data, but we kept the complete information at the other loci. To limit the impact of variation in the patterns of missing data between replicates, we chose to delete the genotypes of the same individuals in different replicates and to have the same proportion of missing data for different family members. The proportion of missing data was varied between 5 and 50 percent. A MI algorithm [3] that we recently developed to deal with case-parent trio data [4] was performed for each sample.
Briefly, the principle of this method is to fill in missing data with values that are predicted by the observed data. For each family containing a missing value, a haplotype is selected among all the compatible haplotypes with a probability given by the current posterior distribution (at the starting point, this posterior distribution comes from an expectation maximization algorithm). Population haplotype frequencies are then updated using the new posterior distribution that comes from the current complete data file. These two steps are iterated a large number of times and when the stationary distribution is reached (here after a burning period of 1000 iterations) a small number of complete data sets (here this number was nine) are selected every 1000 iterations. Each simulated complete data set is analyzed separately and the results are combined to produce estimates that incorporate missing data uncertainty [3, 5, 6].
Inference of missing values is performed using observed genotypes, affection status data, and family structure.
In the present study, analysis was performed using a conditional logistic regression method [2, 7, 8] that compares the genotype of an affected child (case, c) to the three possible genotypes that can be formed by the untransmitted parental alleles (pseudo controls, pc
j
with j = 1 to 3). The likelihood of the data is written as a linear function:
where is an indicator taking value 1 if case or pseudo control j in family k has genotype i, and 0 otherwise. βi = log ORi, with β0 being the baseline risk for reference genotype. Under the null hypothesis of no association, the log likelihood is simply: Ln(L0) = β0.
For each of the m complete data files i, we calculate the likelihood ratio test d
i
as
d
i
= 2[ln(L1) - ln(L0)]
and combine the d
i
across data sets using the method described in Schafer [5] and Rubin and Little [6]. The power to detect the association with each locus was obtained by computing the proportion of replicates for which the test is significant at a nominal level of 5% at each marker. Given the fact that the DR locus is located in the studied region and has a strong effect on the disease, we also performed tests conditional on the DR locus to see if the association remains at the other loci after accounting for the DR locus effect.