We have previously described a multipoint transmission-disequilibrium test (TDT) method that is based on local smoothing [8]. Our study demonstrated that a) TDT statistics at tightly linked markers are correlated, and b) when tightly linked markers are genotyped, the smoothed TDT statistics can achieve a greater statistical power compared with the non-smoothed version. These findings suggest that TDT statistics can be combined, even though the different studies have genotyped non-overlapping set of markers.
Combining data sets with non-overlapping markers and individuals
In this section, we outline our statistical methodology in a simple setting: two studies have genotyped non-overlapping sets of markers on independent sets of individuals in a common genomic region, and both studies have used a case-parents trio design. In each study, the TDT statistics, TDTAor TDTB, can be computed at the genotyped markers [9].
To motivate our test statistic, we first consider a marker that has been genotyped in both studies. In Study A, let b1 denote the number of informative transmissions, in which A alleles are transmitted but a alleles are not transmitted, and let c1 denote the converse (i.e., a alleles but not A alleles are transmitted). Likewise, let b2 and c2 denote the corresponding numbers of informative transmissions in Study B. With complete genotype data, we would compute the TDT by pooling the data:
(1)
We next show that, under the null hypothesis, the last term in Eq. (1) has an expectation of 0. Let R1 = B1 + C1 > 0 (the capital letters denote the random variables), and R2 = B2 + C2 > 0. Under the null hypothesis, L(B1 | R1) ~ Binom(R1, 0.5), L(B2 | R2) ~ Binom(R2, 0.5), and B1 and B2 are independent. We then have:
Therefore, under the null hypothesis, the pooled TDT statistic is nearly a weighted average of the corresponding TDT statistics in the respective studies. The weights are proportional to the number of informative parents (n
i
= b
i
+ c
i
). Assuming that the two studies sampled comparable populations (e.g., allele frequencies are similar at all loci), we approximate these weights by the number of trios.
For a marker that is not genotyped in one study (but is genotyped in the other), we try to impute the TDT statistic using neighboring markers. We then add the observed and imputed TDT scores from the two studies. Suppose M markers have been genotyped by either Study A or Study B. Denote the physical locations of these markers by {t1,...,t
M
}. Let be a vector indicating whether marker m is genotyped in Study A. Denote the TDT statistics using each of the two samples as TDTAand TDTB, respectively. Let fA(t) be the results of applying a local linear regression fitting to (t
i
, TDTA) data. At each marker, we compute:
In other words, if a marker is genotyped in Study A (vA= 1), we simply take the TDT statistic; if a marker t
i
is not gentoyped in Study A, we impute its expected TDT statistic using the predicted value. Similarly, we compute TB(t) using data from Study B. The combined test statistic at each marker is simply:
We implement the imputation step using the loess function in R. The choice of the smoothing parameter depends on many factors such as the age of the disease mutation, the population under study, and the marker density. While an optimal window size is difficult to define, an examination of inter-marker LD guides our choice: we seek a region within which the genotyped markers are in high LD. Roughly speaking, we are faced with a trade-off between bias and variance: smoothing over a wide region tends to reduce variance of the imputed statistics at the cost of an increased bias. Therefore, an alternative to loess with pre-specified bandwidth is a smoothing spline with the degree of freedom chosen by cross-validation. To properly account for the imputation, and to correct for multiple comparison, we perform a simulation-based test: conditioning on the parents' genotype, we generate the transmitted and the non-transmitted haplotypes under the null hypothesis, re-impute the TDT statistics, and compute TDTcomb on the simulated data. The observed max
i
TDTcomb(t
i
) is compared with the null distribution of the corresponding maxima in the simulated data.
Data set example
To illustrate our proposed method, we analyze Replicate 1 of the simulated RA data. This data set consists of 1500 nuclear families, each of which has both parents and two affected children genotyped. It is known that there is a strong effect of DR type at the HLA locus on chromosome 6. A simple TDT analysis using all 1500 families unambiguously demonstrates preferential transmission of DR-2 or DR-3 alleles to the affected individuals. However, is the DR allele the sole variant affecting the disease in the region? To address this question, we examine the transmission from parents who are homozygous 1/1 at the DR locus. If the DR locus explains the entire association in the region, conditioning on parents being 1/1, there should not be preferential transmission at any markers nearby. Among 1500 mothers, 70 have genotype 1/1. Our analyses highlight a practical difficulty: performing stratified analysis on a subset of samples further reduces the sample size; thus, stratified analyses are particularly likely to suffer from small sample size even when the main study has good power.
On chromosome 6, we restrict ourselves to the 293 SNP markers falling within 1.5 × 106 bp around the DR locus. We consider a situation in which each third of the families are genotyped on a different platform. The 293 SNPs are randomly divided into three sets, and there is no overlap in the three sets of markers or individuals. Because the risk of RA is much higher among women, we hypothesize that there may be gene × sex interaction. Furthermore, there has been ambiguous evidence regarding maternally transmitted risk elements [10]. Therefore, we looked at four types of transmission: father to son, father to daughter, mother to son, and mother to daughter. Because the phase is known for all the affected children, the four types of transmission can be examined independently. For each type of transmission, we perform a TDT analysis on each of the three subsets of families. Because the diagnosis of RA is often ambiguous, we hypothesized that the more severe cases are more likely to carry the genetic risk factor. Therefore, a severity measure, on the scale of 1 to 5, is used as a relative weight. We then combine the three sets of TDT scores to compute TDTcomb, with a bandwidth approximately 15 markers.