The impact of complex informative missingness on the validity of the transmission/disequilibrium test (TDT)

The transmission/disequilibrium test was introduced to test for linkage and association between a marker and a putative disease locus using case-parent triads. Several extensions have been proposed to accommodate incomplete triads. Some strategies assumed that parental genotypes were missing completely at random and some methods allowed informative missingness for parental genotypes. However, the above tests assumed that offspring genotypes were missing completely at random and concluded that the transmission/disequilibrium test remained a valid test by excluding incomplete triads from the analysis. In this article, the conditional distribution of ascertained triads allowing informative missingness for offspring genotypes, as well as their parental genotypes, was derived and several tests under such scenarios were evaluated. In simulations, independent triads from the Genetic Analysis Workshop 15 simulated data (Problem 3) was ascertained. When offspring genotypes were missing informatively, simulation results revealed inflated type I error and/or reduced power for the transmission/disequilibrium test excluding incomplete triads. Background Recently, family-based association studies have drawn substantial attention in genetic studies as a way to avoid spurious association due to population admixture. The transmission/disequilibrium test (TDT) by Spielman et al. [1] was proposed to test for linkage and association between a marker and a disease locus using ascertained case-parent triads. However, parental genotypes may be unavailable due to refusals or other unknown causes. Assuming that only one parental genotype is available and the other one is missing completely at random (MCAR), Clayton [2] and Weinberg [3] proposed likelihood ratio tests and Sun et al. [4] introduced the TDT with only one parent is available (1-TDT) to incorporate such dyads (affected offspring with one parental genotype). Later, the expectation maximization algorithm based haplotype relative risk (EM-HRR) proposed by Guo et al. [5] extended the haplotype relative risk (HRR) test [6] to accommodate both dyads and monads (affected offspring without parental genotype). However, when missingness cannot be ignored (i.e., a missing pattern of parental genotypes is related to the disease under study), the assumption of MCAR is violated and these tests may be invalid. from Genetic Analysis Workshop 15 St. Pete Beach, Florida, USA. 11–15 November 2006 Published: 18 December 2007 BMC Proceedings 2007, 1(Suppl 1):S26 <supplement> <title> <p>Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci</p> </title> <editor>Heather J Cordell, Mariza de Andrade, Marie-Claude Babron, Christopher W Bartlett, Joseph Beyene, Heike Bickeböller, Robert Culverhouse, Adrienne Cupples, E Warwick Daw, Josée Dupuis, Catherine T Falk, Saurabh Ghosh, Katrina A Goddard, Ellen L Goode, Elizabeth R Hauser, Lisa J Martin, Maria Martinez, Kari E North, Nancy L Saccone, Silke Schmidt, William Tapper, Duncan Thomas, David Tritchler, Veronica J Vieland, Ellen M Wijsman, Marsha A Wilcox, John S Witte, Qio g Yang, Andreas Ziegler, Laura Almasy a d Jean W MacCluer</editor> <note>Proce dings</note> <url>http://www.biomedc ntral.com/content/pdf/1753-6561-1-S1-info.pdf</url> </supplement> This article is available from: http://www.biomedcentral.com/1753-6561/1/S1/S26 © 2007 Guo; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. BMC Proceedings 2007, 1(Suppl 1):S26 http://www.biomedcentral.com/1753-6561/1/S1/S26


Background
Recently, family-based association studies have drawn substantial attention in genetic studies as a way to avoid spurious association due to population admixture. The transmission/disequilibrium test (TDT) by Spielman et al. [1] was proposed to test for linkage and association between a marker and a disease locus using ascertained case-parent triads. However, parental genotypes may be unavailable due to refusals or other unknown causes. Assuming that only one parental genotype is available and the other one is missing completely at random (MCAR), Clayton [2] and Weinberg [3] proposed likeli-hood ratio tests and Sun et al. [4] introduced the TDT with only one parent is available (1-TDT) to incorporate such dyads (affected offspring with one parental genotype). Later, the expectation maximization algorithm based haplotype relative risk (EM-HRR) proposed by Guo et al. [5] extended the haplotype relative risk (HRR) test [6] to accommodate both dyads and monads (affected offspring without parental genotype). However, when missingness cannot be ignored (i.e., a missing pattern of parental genotypes is related to the disease under study), the assumption of MCAR is violated and these tests may be invalid. When parental genotypes are missing informatively, Allen et al. [7] and Chen [8] proposed likelihood ratio tests to assure the validity of testing for association between a candidate gene and a disease. However, the cost of accounting for informative missingness is reduced power. When the missing pattern was indeed completely at random, one can see that Allen et al.'s strategy could be less powerful than the 1-TDT [7]. This is also true for Chen's method (see Table 4 [8]). The power of Chen's score statistic with 1 degree of freedom is less than that of the TDT using only intact triads for a common (rare) allele under the dominant (recessive) disease model, as is the score statistic with 2 degrees of freedom for both rare and common variant alleles under the multiplicative inheritance. This means that the inclusion of dyads reduces the power of the score test in these cases.
Regardless of different missing patterns among parental genotypes, the above-mentioned methods assumed that offspring genotypes were MCAR. In the following, the conditional distribution of ascertained triads that allows informative missingness for offspring genotypes will be derived, as well as their parental genotypes, and several tests under such scenarios will be evaluated.

Distribution of ascertained triads
First, it was assumed that the data consisted of genotypes of bi-allelic markers such as a single-nucleotide polymorphism (SNP). Therefore, there are exactly two alleles, B 1 and B 2 , at the marker locus. The distribution of complete triads was derived as the following: Let G o , G pf , G pm be the offspring's, father's, and mother's genotypes, respectively. Let G of and G om be the offspring allele inherited from the father and mother, respectively. Here, imprinting was not considered, and the four possible joint probabilities of a given parental genotype and the probability of transmitting a given allele to the offspring from that parent, all conditional on offspring affected status are: When the disease model is recessive, Ott (Table 2, [9]) showed that = (s + /r)s, = (s + /r)(1s) -/r, = (1 s -/r)s + /r and = (1s -/r)(1s), where r is the allele frequency of the recessive disease allele, and s is the allele frequency of marker allele "B 1 ". The parameter denotes the recombination fraction, and = p(aB 1 )p(a)p(B 1 ) denotes the disequilibrium coefficient between the marker and the disease locus.
Let I f , I m and I o be binary indicator functions for father, mother, and offspring having missing genotype information. For example, I f = 1 if the father's genotype is missing and 0 otherwise. Let P o11 , P o12 , and P o22 denote missing rates for offspring with B 1 B 1 , B 1 B 2 , and B 2 B 2 genotypes, respectively. Similarly, let P f11 , P f12 , and P f22 (P m11 , P m12 , and P m22 ) denote missing rates for father (mother) with B 1 B 1 , B 1 B 2 , and B 2 B 2 genotypes, respectively. Note that we do not assume any pattern for the nine missing parameters, i.e., missingness of a given parental genotype can be dependent or independent of the other parent's and/or offspring's genotype. Assuming random mating, one can calculate the conditional probability of ascertaining a complete triad with the father, mother, and affected offspring's genotypes being B 1 B 1 , B 1 B 2 , and B 1 B 2 , respectively, as The distribution of remaining ascertained triads can be derived in a similar manner and is displayed in Table 1. P k i, j and M k i, j are the conditional probability and observed counts for each type of triad data, where k = "0", "1", or "2" represents the total number of B 1 alleles transmitted to the offspring, and i, j = "0", "1", or "2" represents the total number of B 1 alleles for fathers and mothers, respectively.

Validity of the TDT under various missing patterns
As shown in Table 1, the conditional probability of a heterozygous parent transmitting the B 1 (B 2 ) allele to the affected offspring was calculated as . When there is no linkage or no association, T 1 = T 2 , if and only if offspring genotypes are missing completely at random (P o11 = P o12 = P o22 ). Therefore, when offspring genotypes are missing informatively (at least two of P o11 , P o12 , and P o22 are not equal), the TDT does not provide a valid test for linkage and association by excluding incomplete triads from the analysis (T 1 T 2 ). Such phenomenon is also true for the HRR proposed by Falk and Rubinstein [6], which is a valid test for association in the presence of linkage.
T P P P P P P T P P

Simulations
Unrelated nuclear families were used each with two affected siblings and complete parental genotypes from the Genetic Analysis Workshop 15 simulated data (Problem 3). Based on the 100 replicates provided, the first 10 replicates were pooled together. To assure the assumption of independence among ascertained triads, we randomly selected only one affected offspring from each nuclear family to form the new population for simulations. In order to reflect realistically complex disease models, missing status for the affected offspring and their parents was assigned. The missing patterns considered were the reces-sive, dominant, and additive genetic effect models for both major and minor alleles as indicated in the second column of Table 2. Therefore, only a proportion of families with an affected offspring were eligible for the ascertainment and the total number of families ascertained including triads, dyads and monads were 200.
"SNP6_150" on chromosome 6 and "SNP15_55" on chromosome 15 were used in power and type I error simulations, respectively. Several other SNPs were also considered but with similar results and the results are not shown here. For SNP6_150 (SNP15_55), genotype fre- quencies are 0.41 (0.31) for major homozygote, 0.46 (0.50) for heterozygote, and 0.13 (0.19) for minor homozygote. A total of 1000 repetitions were conducted for power and type I error simulations. The TDT and HRR were applied to the subset of complete triads. The 1-TDT [4] and EM-HRR [5] were both applied to the subset of complete triads and dyads.

Results
In Table 2, the first column indicates the model of missingness (1, MCAR for all genotypes; 2, informative missingness for parental genotypes and MCAR for offspring genotypes; 3, informative missingness for all genotypes). The three brackets in the second column display missing rates for the father, mother and offspring, respectively. The results in the first seven rows indicate that, when offspring genotypes are MCAR, the TDT and HRR are valid tests at 5% nominal level as seen in Guo et al. [10]. However, the 1-TDT and EM-HRR were invalid due to inflated type I error over the nominal level when parental genotypes are missing informatively (row 2-7), which matches the results in Allen et al. [7] and Chen [8]. In addition to previous findings, we also discovered that power of the 1-TDT and EM-HRR can be not only inflated (row 2-4), but also reduced (row 5-7) compared to the scenario under MCAR (row 1), providing that the missing rate for genotype "11" is preferentially higher or lower.
The remaining missing patterns (row 8-13) are when all family members are missing informatively. By excluding incomplete triads from the analysis, the TDT and HRR are no longer valid for testing linkage and association. However, incorporation of dyads and monads reduced such biases. We also found that power of the TDT and HRR excluding incomplete triads can be either reduced (row 8-10) or inflated (row 11-13) compared to the scenario under MCAR (row 1) when the missing rate for genotype 11 is preferentially higher or lower.

Discussion
The TDT was introduced to test for linkage and association between a marker and a putative disease locus using caseparent triads. Assuming that offspring genotypes are missing complete at random, the TDT excluding incomplete triads is considered a valid test even when parental genotypes are missing informatively. However, if a specific genotype is missing preferentially for parents, it is also likely to occur for the affected offspring.
In this article, the conditional distribution of ascertained triads allowing informative missingness for offspring genotypes as well as their parental genotypes was derived. Through mathematical calculations, we prove that the TDT and HRR do not provide a valid test for linkage and association under such a missing pattern. In addition, we confirmed our conclusion based on computer simulations, since we observed inflated type I error and/or reduced power for the TDT and HRR under such scenarios. Therefore, if the missing pattern for offspring genotypes is not confirmed to be completely at random, a significant result from the TDT or HRR using only complete triads does not assure true association between the marker and a putative disease locus.