Suppose we have observed data {Y, x, C} where Y is an n × p matrix of n subjects for which a block of p variables/phenotypes are measured (eg, methylation values at a genomic region/gene with pCpG dinucleotides), x is an n × 1 vector of an observed trait of interest (eg, HDL phenotype) and C is an n × r matrix where the columns are known confounding factors (eg, age, sex).
PCEV approach
PCEV is a dimension-reduction technique that searches for a linear combination (a principal component) of the columns of Y, ypcev = Y w (w is ap × 1 vector), that maximizes h2(w), the ratio of the variance in Y explained by x to the total variance of Y, while taking into account the confounding factors, C. This new score ypcev can then be used as a phenotype in standard statistical models to test for the relationship between Y and x. Searching for ypcev is equivalent to projecting the rows of Y into w, where w is the most relevant direction in p-dimension space to x. A linear relationship between Y and x can be tested by H01 : corr(ypcev, x) = 0. This test requires the use of the data twice, and therefore a naïve approach for p value calculation will suffer from Type I error inflation. However, the null H01 is equivalent to testing for H02 : h2(wpcev) = 0 which uses the data only once. Turgeon et al. [5] derived an analytic test for the null hypothesis H02, which was shown to yield the proper Type I error rate.
VC-score approach
In a reverse model where x (eg, HDL) is modeled as the response variable and Y as a design matrix of p predictors, the VC model links x to Y using a linear mixed-effects regression model in which Y has an effect on the variance of x instead of on its mean [6]. This approach was developed to test association between a set of rare variants and a phenotype of interest. However, the test can be adapted easily to handle different types of design matrices, such as methylation from a genomic region of interest. This method can be extended to take into account population and family structures. The family-based VC-score approach is a linear mixed-effects model in which a second random effect for genetic relationships (ie, kinship) is added [7].
Phenotypic, methylation, and covariate data
Circulating blood lipids, HDL, TGs, and the methylation profiles were measured at baseline and following 3 weeks of daily treatment with 160 mg of micronized fenofibrate [2]. For this study, we investigated HDL and TG changes among 714 participants for whom pretreatment methylation data were available. Because the PCEV approach has only been implemented for use with independent subjects, analyses using this method were conducted using 242 unrelated individuals. The selection of the maximum set of unrelated individuals from each pedigree was done using a greedy algorithm that used the kinship matrix to sequentially remove related individuals [8]. Log-transformations were performed for TG, as this variable was not normally distributed.
T-cell pre- and posttreatment DNA-methylation at 463,995CpG sites were already normalized using ComBat [3]. These CpG sites were allocated to 22,319 genes. We also included sites located 20 kb up- and downstream of the gene region. Only the CpG sites with gene annotations were evaluated in the analyses; consequently, we analyzed 401,326 CpG sites. Because PCEV works when the block and sample sizes are comparable [5], we divided the largest gene blocks to obtain 22,488 gene regions with no more than 130 CpG sites per block.
We focused on the pretreatment methylation levels to evaluate the effect of individual CpG sites and genes on explaining the observed heterogeneity in response to treatment. To capture unwanted variability in methylation profiles, which could result from variation in cell purity or batch effects, we constructed principal components of genome-wide methylation levels using 2000 randomly sampled probes from all autosomes. The association analyses between pretreatment methylation probes and blood lipid changes were adjusted for age, sex, study center, smoking status, diagnosed metabolic syndrome status, the fast time on the pre- and posttreatment visits, and the top 4 methylation-derived principal components (PCs).