Consider a sample of n nuclear families and N unrelated controls. Suppose that we have genotyped M markers across the genome or in a candidate region for each sampled individual. Also, all children in the nuclear families are affected and the disease status of the parents is available. The reason for considering this kind of sample is the design of the GAW15 Problem 3 data set. To detect disease susceptibility loci, based on the sample structure, we proposed three methods. All three methods are two-stage approaches – the methods in the first stage are used to screen and select SNPs and those in the second stage are used to test the association on the selected SNPs.
The three approaches we propose for the first stage are based on a test statistic for a case-control study. Consider a case-control study with N1 cases and N2 controls, and each sampled individual has a genotype at a bi-allelic marker with two alleles A and a. To test the association between the marker and the disease, one can use the statistic:
(1)
where and are the sample frequencies of allele A in cases and controls, respectively; is the estimate of the variance of - ; p0 is the sample allele frequency of allele A in the whole sample. Under the null hypothesis of no association, this test statistic asymptotically follows a standard normal distribution. When the absolute value of T is large, we reject the null hypothesis of no association. Based on the test statistic T, we propose the following three tests that can be used in the first stage to screen SNPs:
1. Consider affected parents of the sampled nuclear families as cases and unaffected parents of the sampled nuclear families as controls. The test statistic T based on this sample is denoted by T
cc
. The T
cc
only uses the nuclear families (does not need the unrelated controls).
2. Consider all the parents of the n sampled nuclear families as cases and the N sampled unrelated controls as controls. The test statistic T based on this sample is denoted by T
pc
. If A is a high risk allele, the frequency of A among the parents should be higher than that in the controls, because each pair of parents has at least one affected child.
3. The third approach is a combination of the T
pc
and T
cc
. The test statistic of this approach is Fisher's combination of the p-values of the two tests and is given by T
cb
= -2(log P1 + log P2), where P1 and P2 are the p-values of the tests T
pc
and T
cc
, respectively. Under the null hypothesis of no association, T
cb
will follow a χ2 distribution with 4 degrees of freedom [8].
We use the PDT [7] to test association in the second stage. Suppose there are n
i
affected children in the ith family.For a biallelic marker with two alleles A and a, we code the three genotypes aa, Aa, and AA as 0, 1, and 2, respectively. Let X
ij
, X
iF
, and X
iM
denote the codes of the genotypes of the jth child, father, and mother in the ith family. Let
,
, and
.
Then the test statistic of the PDT is given by . Under null hypothesis of no association, the PDT follows the standard normal distribution.
When we apply the two-stage approaches, we first apply one of T
pc
, T
cc
, or T
cb
to each of the M markers and get M p-values. Select L markers with the smallest p-values (we will discuss later how to choose L). Then, we apply the PDT to the L selected SNPs, and declare a SNP as significant if the p-value of the PDT at this marker is less than a threshold δ
Lα
. The threshold δ
Lα
is determined by controlling the FDR, the ratio of the number of falsely rejected null hypotheses to the total number of rejected null hypotheses, at level α. To control the FDR we can choose the cut-off δ
Lα
as follows [4]: let p(1),...,p(L) be the ordered p-values when we apply the PDT to the L selected markers, then .
In our simulation studies and application to analyze the GAW15 simulated data, we use the following method to calculate the power of the two-stage test to detect one disease locus, say locus D. Suppose that there are K replicated samples. Let k denote the number of samples in which locus D is selected in the first stage and the p-value of locus D in the second stage is less than δ
Lα
. Then, the power to detect Locus D is k/K.