Genome-wide association tests by two-stage approaches with unified analysis of families and unrelated individuals

Multiple testing is a problem in genome-wide or region-wide association studies. In this report, we consider a study design given by the Genetic Analysis Workshop 15 (GAW15) Problem 3 – nuclear families (parents with their affected children) and unrelated controls. Based on this design, we propose three two-stage approaches to deal with the problem of multiple testing. The tests in the first stage, statistically independent of the association test used in the second stage, are used to screen or select single-nucleotide polymorphisms (SNPs). Then, in the second stage, a family-based association test is performed on a much smaller set of selected SNPs. Thus, the problem of multiple testing is much less severe. Our simulation studies and application to the dense SNP data of chromosome 6 in the GAW15 Problem 3 show that the two-stage methods are more powerful than the one-stage method (using the family-based association test only).


Background
Genome-wide or region-wide association is a promising approach to mapping complex disease genes [1,2]. However, the success of genome-wide or region-wide association studies will depend on whether the information gain of increased number of single-nucleotide polymorphisms (SNPs) will be diluted by the multiple-comparison problem [3]. When tens or hundreds of thousands of SNPs are tested for association, the p-values need to be adjusted for controlling type I error rates. Most multiple-testing adjustment approaches, including Bonferroni correction for controlling the family-wise error rate and the method proposed by Benjamini and Hochberg [4] for controlling the false discovery rate (FDR), become more conservative as more tests are done.
In case-control studies, several authors have proposed a two-stage design that utilizes two independent samples [5,6]. The first sample is used to screen and select SNPs for association tests. The association tests are conducted on the selected SNPs by using the second sample, so that the number of association tests is diminished and the correction for multiple testing is less severe. Recently, in mapping quantitative trait loci using family data, Van Steen et al. [3] proposed an interesting approach that performs the SNP screening and association test using the same sample.
The basic idea of Van Steen et al.'s method is that the screening test based on the traits and between-family genotype scores is statistically independent of the association test that depends on trait values and within-family genotype scores. The screening test is used first to select SNPs, and the association test is performed on a much smaller set of selected SNPs. Unfortunately, the same idea cannot be applied to family-based analyses for qualitative traits.
In this article, we propose several two-stage methods to test association for qualitative traits by using nuclear families (including parental phenotypes) or nuclear families and unrelated controls. To analyze the data set of nuclear families, we compare the allele frequency in affected parents with that of unaffected parents (test I) to screen and select SNPs. Then the pedigree disequilibrium test (PDT) [7] is used to perform the association test on the selected SNPs by comparing the alleles that are transmitted to the children with those that are not transmitted. To analyze the data set that contains nuclear families and unrelated controls, as the data set in the GAW15 Problem 3, we propose two methods to screen SNPs. One is comparing the allele frequency in parents with that in unrelated controls (test II). The other is a combination of test I and II. All the proposed screening tests are independent of the association test, that is, the PDT. Furthermore, because a significant association only depends on the results of the PDT, the proposed two-stage approaches are robust to population admixture. We compare the performance of the proposed methods by using PDT alone through simulation studies and analysis of the data set of the GAW15 Problem 3. Our simulation and the GAW15 data analysis results show that the three proposed two-stage methods have correct type I error rates and, in most cases, are more powerful than the PDT.

Methods
Consider a sample of n nuclear families and N unrelated controls. Suppose that we have genotyped M markers across the genome or in a candidate region for each sampled individual. Also, all children in the nuclear families are affected and the disease status of the parents is available. The reason for considering this kind of sample is the design of the GAW15 Problem 3 data set. To detect disease susceptibility loci, based on the sample structure, we proposed three methods. All three methods are two-stage approaches -the methods in the first stage are used to screen and select SNPs and those in the second stage are used to test the association on the selected SNPs.
The three approaches we propose for the first stage are based on a test statistic for a case-control study. Consider a case-control study with N 1 cases and N 2 controls, and each sampled individual has a genotype at a bi-allelic marker with two alleles A and a. To test the association between the marker and the disease, one can use the statistic: where and are the sample frequencies of allele A in cases and controls, respectively; is the estimate of the variance of -; p 0 is the sample allele frequency of allele A in the whole sample. Under the null hypothesis of no association, this test statistic asymptotically follows a standard normal distribution. When the absolute value of T is large, we reject the null hypothesis of no association. Based on the test statistic T, we propose the following three tests that can be used in the first stage to screen SNPs: 1. Consider affected parents of the sampled nuclear families as cases and unaffected parents of the sampled nuclear families as controls. The test statistic T based on this sample is denoted by T cc . The T cc only uses the nuclear families (does not need the unrelated controls).
2. Consider all the parents of the n sampled nuclear families as cases and the N sampled unrelated controls as controls. The test statistic T based on this sample is denoted by T pc . If A is a high risk allele, the frequency of A among the parents should be higher than that in the controls, because each pair of parents has at least one affected child.
3. The third approach is a combination of the T pc and T cc . The test statistic of this approach is Fisher's combination of the p-values of the two tests and is given by T cb = -2(log P 1 + log P 2 ), where P 1 and P 2 are the p-values of the tests T pc and T cc , respectively. Under the null hypothesis of no association, T cb will follow a χ 2 distribution with 4 degrees of freedom [8].
We use the PDT [7] to test association in the second stage. Suppose there are n i affected children in the i th family.For a biallelic marker with two alleles A and a, we code the three genotypes aa, Aa, and AA as 0, 1, and 2, respectively. Let X ij , X iF , and X iM denote the codes of the genotypes of the j th child, father, and mother in the i th family. Let Then the test statistic of the PDT is given by .
Under null hypothesis of no association, the PDT follows the standard normal distribution.
When we apply the two-stage approaches, we first apply one of T pc , T cc , or T cb to each of the M markers and get M pvalues. Select L markers with the smallest p-values (we will discuss later how to choose L). Then, we apply the PDT to the L selected SNPs, and declare a SNP as significant if the p-value of the PDT at this marker is less than a threshold In our simulation studies and application to analyze the GAW15 simulated data, we use the following method to calculate the power of the two-stage test to detect one dis-ease locus, say locus D. Suppose that there are K replicated samples. Let k denote the number of samples in which locus D is selected in the first stage and the p-value of locus D in the second stage is less than δ Lα . Then, the power to detect Locus D is k/K.

Simulated data
We first evaluate the FDR under null the hypothesis of no marker associated with disease. Under the null hypothesis, the FDR and the family-wise type I error rate will be the same. We also evaluate the statement of the independence between the tests in the first stage and the PDT in the second stage. For these purposes, we generate genotypes for each individual at 100 SNPs. For a given rare allele frequency, we generate the genotypes of the parents of the nuclear families and unrelated controls by assuming Hardy-Weinberg equilibrium and independence between markers. Each of the parents randomly transmits one of the two alleles to a child to form the child's genotype. We consider two different sample sizes, two different rare allele frequencies, and two different numbers of children in each family. For each simulation scenario, we generate 1000 samples to estimate the FDR at nominal level 0.05. For each of the two-stage approaches, we use a threshold α to select SNPs in the first stage, that is, we select all SNPs whose p-values are less than α. We choose several values for the threshold. If a two-stage approach has a correct FDR, we know that the tests used in the first stage and in the second stage should be independent. The results are summarized in Table 1. Table 1 shows that the FDRs of the three two-stage approaches are very consistent with the nominal level, especially when the sample size is large or the minor allele frequency is large (>0.1). The FDR of the one-stage test PDT is slightly conservative for a small sample size. When the sample size is large (>100 families and 120 controls), the FDRs of the PDT are also very consistent with the nominal level.

GAW15 data analysis
We applied three two-stage approaches and the PDT to analyze the dense SNP data of chromosome 6 in the GAW15 Problem 3 (simulated rheumatoid arthritis data). The data contain 100 replicate data sets. In each replicate data set, there are 1500 nuclear families, 2000 unrelated controls and two affected children in each family. Each individual has genotypes at 17,820 SNPs on chromosome 6. From the answer provided with the data set, we know that there are three trait loci: Locus DR, Locus C, and Locus D on chromosome 6. Locus DR affects the risk of rheumatoid arthritis (RA). Locus C increases RA risk only in women. These two loci are in the same position. The typed SNP 3437 on chromosome 6 is in the same position as Locus DR and Locus C, that is, the recombination rates between SNP 3437 and Locus DR and between SNP 3437 and Locus C are both zero. The rare allele of Locus D increases RA risk five-fold. In the dense SNP panel of chro-mosome 6, the SNP that is nearest to Locus D is SNP 3917. The genetic distance between locus D and SNP 3917 is 0.00171 cM, and the physical distance is 1565 bp.
We first compare the power of the four methods to detect SNP 3437. The association between SNP 3437 and RA turns out to be very strong, such that the power of the four methods is all 100%. In order to do the power comparison, we reduce the sample size and, instead of detecting SNP 3437, we detect SNP 3455. SNP 3437 and SNP 3455 are in linkage disequilibrium with r = 0.36 and D' = 0.45. The results of the power comparison for detecting SNP 3455 using a smaller sample size are summarized in Table  2. The results show that the two-stage approaches T pc and T cb , which use both family data and unrelated controls, are more powerful than the PDT, especially when α is 1%, 5%, or 10%. However, the two-stage approach T cc is not as powerful as the PDT when α is less than 0.3. When α is between 0.5 and 0.8, the T cc is slightly more powerful than the PDT (results not shown).
The power of the four methods to detect SNP 3917 is summarized in Table 3. Because the power of all four methods is not very high, we doubled the sample size by merging two replicate data sets together as one sample. The results  in Table 3 show that, in the case of one child in each family, all three two-stage approaches are more powerful than the PDT when α > 0.05. When the value of α is between 0.05 and 0.1, the T cb is much more powerful than the PDT. In the case of two children in each family, the T cb is also more powerful than the PDT when α > 0.01. The other two two-stage approaches, the T pc and T cc , are only slightly more powerful than the PDT when α ≥ 0.05.

Discussion
In this report for genome-wide or region-wide association studies, we proposed three two-stage approaches to analyze family data or data sets that contain family data as well as unrelated controls. Based on our simulation studies and applications of the data sets of the GAW15 Problem 3, we are able to demonstrate that, in the case of one child in each family -the typical data set of the TDT design -all three two-stage approaches are more powerful than the PDT. In almost all the cases we considered, the T cb using family data and unrelated controls is more powerful than the PDT, and in several cases the T cb can double the power of the PDT. How to choose the value of the threshold α is a problem. From our simulation studies, one can see that the value of α around 0.01 may be a good choice for the T pc and T cb . If only the family data are available, we would use the two-stage approach T cc . In the case of one child, the value around 0.1 may be a good choice for α. In the case of two children, the T cc does not benefit much. In general, we need further investigation for choosing the value of α.

Conclusion
Our simulation and the GAW15 data analysis results show that the three proposed two-stage methods have correct type I error rates and, in most cases, are more powerful than the PDT.