### Optimal window size defined by PC analysis

We consider a study with total *M* individuals in a data set and with genotype information denoted by vectors *G*_{
i
}= (*g*_{i1}, *g*_{i2},⋯, *g*_{
iN
})^{
T
}(*i* = 1,2,⋯, *M*) at *N* SNP loci for the *i*^{th} individual. We code the genotype *g*_{
ij
}as 0, 1, or 2 for the number of minor (less frequent) alleles at SNP *j*, *j* = 1,2⋯, *N* of individual *i*. Let *y*_{
i
}denote the trait value of individual *i*.

In the sliding-window frame, a window denoted as
is a set of neighboring SNPs {*b*, *b* + 1, *b* + 2, ⋯, *b* + *l* - 1}. A variable sized sliding window which begins with SNP *b*, denoted as Ω^{
b
}, is a collection of windows
with *l* ranging from *s* to Γ^{
b
}, where *s* and Γ^{
b
}are the smallest and largest window sizes.

In this study, we apply PC method to define the optimal window size. The basic idea is that we attempt to find the largest window size in which *c*_{0} proportion of the total information can be explained by the first *k* PCs and *c*_{0} and *k* are predefined criteria. We define this largest window size as the optimal window size. Start with a window
with *l* = *s*= *k* + 1, so that at least the window length is longer than the number of the important PCs.

Let
denote the sample variance-covariance matrix of genotypic numerical codes in window
and
denote the *j*^{th} largest eigenvalue of
Thus, in window
, the total variance in the original dataset explained by the *j*^{th} PC is
. Let
as the proportion of the total variability explained by the first *k* PCs. Our main idea of choosing the optimal window size of each sliding window is to find the largest window size in which *c*_{0} proportion of the total variability can be explained by the first *k* PCs among a set of windows Ω^{
b
}.

### Bisection method for searching the optimal window size and computational consideration

Using the exhaustive searching method may be computational demanding for determining the optimal window size. We propose to use bisection method. Let *s* and Γ denote the predefined smallest and largest window sizes among a set of windows Ω^{
b
}, where b is the starting SNP of the set of windows.

By adapting bisection method, the searching procedure for the optimal window size in Ω^{
b
}includes following steps:

Step 1: Let *l* be the middle point of *s* and Γ, that is, *l* = [(*s* = Γ)2/], where [a] is the largest integer that is less than or equal to a.

Step 2: Conduct PC analysis within the window
, where a window begins at SNP *b* and has a size *l*.

Step 3: Calculate *C* (the proportion of the total variability explained by the first *k* PCs) for the window
. If *C* > *c*_{0}, we let *s = l*, that is, we update the smallest window size *s*. Otherwise, we let Γ = *l*, that is, we update the largest window size Γ.

Step 4: Repeat Step 1 to Step 3 until Γ - *s* ≤ 1.

In the window
, if the proportion of the total variability explained by the first *k* PCs is greater than *c*_{0}, the optimal window size will be Γ; otherwise, the optimal window size will be *s*.

Until now, we have not mentioned how to choose the starting SNP *b*. Of course for the first window, *b* = 1. To choose *b* for other windows, the following three methods are typically used. For the *i*^{th} (*i* > 1) window, choose 1) *b = i*; 2) *b* = *n*_{
i
}, where *n*_{
i
}is the middle SNP of the (*i*-1)^{th} window; 3) *b* = *m*_{
i
}+1, where *m*_{
i
}is the last SNP of the (*i*-1)^{th} window. In this article, we use the first method to choose the starting SNP *b*.

By using bisection method, our proposed variable length sliding-window method is computationally efficient. Consider a set of windows Ω^{
b
}with the smallest window size *s*, largest window size Γ, and starting SNP *b*. The computational complexity to find the optimal window size in Ω^{
b
}using the bisection algorithm is Γ^{3}log_{2}(Γ - *s*). If we have *N* SNPs in total, the computational complexity to find all the optimal window sizes is *N*Γ^{3}log_{2}(Γ - *s*). In this article, we use Γ = 35 and *s* = 4. Suppose *N* = 500,000 in a genome-wide association study. Then, *N*Γ^{3}log_{2}(Γ - *s*) <*N*^{2}. As pointed out by one of the reviewers, HAPLOVIEW program may be used to find beginning and end of a window. Using HAPLOVIEW, *N*^{2} of pair-wise *r*^{2} need to be calculated. To calculate *r*^{2}, we need to estimate haplotype frequencies. Theoretically, our proposed method should be computationally more efficient than HAPLOVIEW. In fact, we have done a preliminary simulation study. The results show that the computation time of our proposed method is about a hundred times faster than HAPLOVIEW.

### Score test

After we find the optimal window size for each sliding window, we use the score test statistic based on a logistic model [5] to test for association within each sliding window. Consider
, a window beginning at SNP *b* with an optimal window size *l*. Take *b* = 1 as an example for windows that start at the first SNP. Let
denote its first *k* PCs of the *i*^{th} individual, where *i* = 1,2,⋯, *M*. Suppose that the *k* PCs follow a logistic model, then, the score test statistic is given by *T*^{2} = *U'V*^{-1}*U*, where
,
,
, and *M* is the sample size. The statistic *T*^{2} asymptotically follows a χ^{2} distribution with *k* degrees of freedom. We select significant windows after adjusting for multiple testing using a Bonferroni correction.