### Linear discriminant analysis using penalized orthogonal-components regression

As shown by Zhang et al. [5], the POCRE algorithm sequentially constructs orthogonal components to maximize, upon standardization, their correlation to the response residuals; at the same time, the algorithm uses a penalization by means of empirical Bayes thresholding [6] to effectively identify sparse predictors for each component. The POCRE algorithm is computationally efficient because of its sequential construction of leading sparse principal components. In addition, this construction offers other distinct properties, such as the ability to group highly correlated predictors and to allow for collinear or nearly collinear predictors.

Consider the following multiple linear regression:

where

*Y* is the phenotype vector of length

*n*,

*X* is an

*n* ×

*p* genotype matrix,

*β*_{
j
} is the additive effect of that genetic variant,

*j* = 1, …,

*p*,

*n* is the number of individuals, and

*p* is the number of SNPs. We then further assume that

*Y* and

*X* are centered at 0 and that the POCRE algorithm sequentially constructs the orthogonal components

,

, …, where

and

,

*i* ≥ 2, is iteratively built to be orthogonal to

, where

*w*_{
i
},

*i* ≥ 1, is calculated as

*γ*/||

*γ*||, which minimizes:

subject to ||*α*|| = 1, where *g*_{
γ
}(*γ*) is a penalty function defined by a proper regularization on *γ* with tuning parameter *λ*. Zhang et al. [5] used empirical Bayes thresholding methods, as proposed by Johnstone and Silverman [6], to introduce the proper penalty *g*_{
γ
}(*γ*).

In this case-control genome-wide association study, the LDA is implemented with the POCRE algorithm. First, we define *Y*_{
i
} = 1 if individual *i* is from the case population, and *Y*_{
i
} = −1 otherwise. Second, the LDA is implemented with the threshold *c* = 0 by regressing the phenotype vector *Y* = (*y*_{1}, …, *y*_{
n
})^{
T
} against *X* using the POCRE algorithm. The design matrix *X* here contains the genotypic values of both common variants and macrovariants constructed by the PLS regression. The tuning parameter *λ* is elicited by using a testing data set with candidates from 0.8 to 0.9, with a step size of 0.01.

To prevent using the same data twice, a process that results in overfitting, we selected one out of 200 phenotype replicates and applied the PLS regression to collapse rare variants to obtain the macrovariants for each gene. We then used results from the PLS regression to analyze another phenotype replicate using the POCRE algorithm. This is not the case in real data analysis, and we suggest taking a traditional approach to splitting the data into training and testing sets, even though we may suffer a reduction in power by reducing sample size. When the sample size is a concern, we recommend using the whole data set twice, the first time for collapsing rare variants and the second time for association study, even though it may magnify the type I error.

### Data set and preprocessing

Finally, we applied the proposed methods to the binary trait in the GAW17 data. All 697 individuals were kept for our analysis after preprocessing the data using PLINK [7] for quality control. We differentiated genetic variants into three categories: SNPs with MAF ≥ 0.05, SNPs with 0.005 ≤ MAF < 0.05 but no other SNPs within the corresponding genetic regions, and macrovariants for genes with multiple rare SNPs. Three other factors—Age, Sex, and Smoke—were also used to control environmental effects. For detailed data information, see [8].