### LRT_PC statistic

We recently proposed a LRT_PC approach to test the association between a given SNP window and a disease status [6]. To calculate the test statistic, we first perform principal component (PC) analysis to the genotype scores of the sampled individuals. Then, the LRT_PC test statistic is given by
, where *n* and *m* are the numbers of cases and controls, respectively, and the
values are the sample variance-covariance matrices of the first *K* PCs in cases, controls, and the pooled sample, respectively. Wang et al. [6] showed that the LRT_PC test is more powerful than the Hotelling's *T*^{2} test and the LD contrast test [2–5] in most cases. The power of the LRT_PC test is perhaps due to its ability to capture the differences of the means and the variance-covariance matrices of genotype scores in cases and controls simultaneously.

### Two-stage sliding-window approach

Because the LRT_PC test may be more powerful than other multi-marker tests, we wanted to use it to analyze the data set of GAW16 Problem 1. However, the LRT_PC can only be applied to a small chromosome region. To apply the LRT_PC to genome-wide association studies, we propose a sliding-window approach [7]. To use sliding windows, we divide all SNPs into contiguous overlapping windows and apply the LRT_PC in each window. Suppose that we use windows with a window size of *S*, then, all the SNPs can be divided into windows 1 to *S*, 2 to *S* + 1, 3 to *S* + 2, and so on.

Because we do not know the distribution or asymptotic distribution of the test statistic LRT_PC, we need to use a permutation approach to estimate the *p*-value of the test. For a genome-wide association study, the number of windows usually is more than 500,000 and the number of permutations usually is no less than 1000 (100,000 permutations were used in this study). The computation is not feasible for the sliding-window approach discussed above. Thus, we propose a two-stage approach. In the two-stage approach, we split all individuals into two sub-samples. In the first stage, by assuming that all individuals are genotyped at all SNPs, we use the first sub-sample to select *R* most promising SNP windows with the largest values of the LRT_PC statistic calculated via the first sub-sample. In the second stage, only the genotypes at SNPs within the *R* most promising windows are used. In this stage, we use the second sub-sample to assess *P* values for the *R* selected windows by permutations and claim significance by the false-discovery rate (FDR) correction in Benjamini and Hochberg [9]. For the two-stage approach, we only need to do permutations in the second stage. Thus, the two-stage approach is computationally much more efficient than one-stage approach.

To analyze the data set of GAW16 Problem 1 using LRT_PC based two-stage sliding-window approach, we use the following settings: window size is 5; the number of windows selected in the first-stage, *R*, is 1000; the sample size of the first sub-sample is 15% of the total sample (15% cases and 15% controls). In the first stage, the number of PCs used in the LRT_PC test in each window is 5, i.e., we do not perform PC analysis. In the second-stage, the number of PCs used in the LRT_PC test in each window, *K*, is decided by the fact that the first *K* PCs can explain 85% of the total variability.

To choose the sample size of the first sub-sample, we did a power analysis based on a single-marker test similar to that of Wang et al. [8]. Our results showed that the optimal value of the sample size of the first sub-sample is between 10% and 30% of the total sample. We use the results based on a single-marker test as a reference to choose the sample size of the first sub-sample in this study (15% of the total sample).

As pointed by Skol et al. [10], our proposed two-stage approach may be not as powerful as joint analysis. However, the results of Skol et al. also showed that when the sample size of the first sub-sample is small (15% of the total sample), the power difference between the two-stage approach and joint analysis is also small. To compare the power of the two-stage and one-stage approaches, we have done a small scale simulation study (10,000 SNPs and 1000 permutations). The simulation results showed that when the first sub-sample is 15% of the total sample, the power difference between the two approaches is also small. In summary, compared with the joint analysis and one-stage approach, our proposed two-stage approach has a small power loss in exchange for a big increase in computational efficiency.