Consider a sample of n nuclear families. Suppose that there are M genotyped markers across the genome or in a candidate region for each sampled individual, also, that all children in the nuclear families are affected. Schaid et al. [1] proposed a general score test for association of a multi-allelic genetic marker using case-parents design. We first extend this method to include multiple diseased children in one family and deal with multi-marker haplotypes. Because each family has two diseased children in GAW15 Problem 3 data, at this point, we just consider the case with two affected children. It is straightforward to extend the approach to a general situation with more than two affected children in each family.
General score tests for multiple children
We use D1 and D2 to represent the first and the second affected children, respectively. Let gc1, gc2, g
m
, and g
f
denote the genotypes of the first child, the second child, mother, and father, respectively. The probability of genotype of the diseased child, given the genotypes of the parents is
Here, G is the set of the four possible genotypes the parents can produce. Choosing a baseline genotype, let r(g) be the relative risk of genotype g to the baseline genotype. Following Schaid et al. [1], we use log-linear model to model the relative risk, that is, r(g) = exp(XTβ), with X representing the numerical coding of the genotype g (see Coding section). Then, the conditional likelihood of one family is given by
If there are n families, denote the corresponding numerical coding of gc1 and gc2 in the ith family as Xi1 and Xi2, respectively. The likelihood function can be shown as , where is the coding vector associated with a genotype and G
i
is the set of the four possible genotypes the parents of ith family can produce. Following the general form of Rao's score test, the score test statistic S = UV-1U' has a χ2 distribution S = UV-1U* ~ , where the degrees of freedom r is the rank of matrix V, which is the information matrix of likelihood function L with element , and U = ∂lnL/∂β|β=0. The quantities U and V can be expressed as , , with , , where , j = 1, 2, 3, 4 are the numerical coding corresponding to the four possible genotypes that the parents of the ith family can produce.
Coding
Suppose for one haplotype block there are m distinct haplotypes, denoted by h1,...,h
m
. For each person, the genotype in this block, denoted as g, can be a combination of any two haplotypes selected from h1,...,h
m
. Under the assumption that the phase information of the genotype is known, we use two different ways to code the genotypes.
The first coding scheme is defined as follows. Let X denote a m-dimensional indicator vector, X = (x1,...,x
m
). The jth element x
j
, is the number of haplotype h
j
in the genotype g, so x
j
can only take three possible values – 0, 1, or 2 – when g has 0, 1, or 2 haplotypes h
j
, respectively. We also consider the second coding in which we test whether a specific haplotype h
L
is associated with the disease. In this case, X is a scalar value, taking 0, 1, or 2 when g has 0, 1, or 2 haplotypes h
L
, respectively. Using this coding, if there are m distinct haplotypes in one block, we will have m tests for this block. Let p1,...,p
m
denote the p-values of the m tests. In order to have an overall test between the haplotype block and the disease, we test the null hypothesis H0, where at least one haplotype is associated with the disease. The p-value of testing H0 is given by p = min{p1,...,p
m
) × m. Thus, using either of the two coding schemes, we have a p-value corresponding to each haplotype block (or a single marker).
Select significant SNPs by controlling false-discovery rate (FDR)
Suppose we have B haplotype blocks. Let P
i
denote the p-value of the test of association between the ith haplotype block and the disease by using the score test statistic discussed above. Denote the ordered p-values by P(1),...,P(B). A block is considered to be associated with the trait if its p-value is less than a threshold δ
B
. The threshold δ
B
is determined by controlling the FDR at level α [5]. The threshold δ
B
can be calculated by
We choose those blocks with associated p-values satisfying p ≤ δ
B
as the blocks that have a significant association with the disease.
Haplotype blocks
One of the main objectives of this analysis is to compare the performance of the score test by using different haplotype-block information. We consider three different methods to find haplotype blocks. One method, which we call the tight block method, results in limited haplotype diversity within each block. The second method is to find evenly spaced blocks. The third method considers each single marker as a block. Many recently developed approaches can be used to find haplotype blocks with limited haplotype diversity within each block. We use a modified version of the approach developed by Zhu et al. [6] to find tight blocks. Consider two biallelic markers: marker A with alleles A1 and A2 and marker B with alleles B1 and B2. Let p11 denote the population frequency of haplotype A1B1, and , denote the population frequency of allele A
i
and B
i
(i = 1, 2), respectively. One of the LD measures (r2), which is proportional to the statistical power of association tests, is defined by
The approach in Zhu et al. [6] to find tight blocks is roughly the same as finding blocks in which all markers have a pair-wise r value > r0. For the purpose of the power comparison, we choose r0 = 0.2 for our analysis.
We also use the program HaploBlockFinder V0.7 [7] to find the tight blocks. The power calculations resulting from each of these two approaches are very similar. Thus, we only report the results based on tight blocks found by the approach in Zhu et al. [6].