Based on genes identified as associated with ΔHDL in family-based variance-component association tests (see Zhao et al. [3]), we selected 10 genes to explore causal relationships. Methylation probe sets were created to include all probes in a window defined by (start − 20 kb, end+ 20 kb) of each gene, to capture probes that could be implicated in cis-regulation. To adjust for potential unexplained confounding, principal components (PCs) capturing genome-wide variations in methylation levels were calculated from 2000 randomly sampled probes from all autosomes (see Zhao et al. [3]). The pretreatment methylation levels and ΔHDL were then adjusted for the fixed effects of the top 4PCs as well as age, sex, smoking, center, fast time, and metabolic syndrome status, and for a random effect with covariance based on the kinship matrix, to capture effects resulting from familial relationships. Residuals were used for further MR analyses. SNPs were selected in a large window around each gene (start − 400 kb, end+ 400 kb). The large size of these windows was necessary to ensure enough SNPs for the constrained instrumental variables (CIV) method described below. Missing values, approximately 0.5% of all SNP data, were imputed using the K-nearest neighbor method with the Bioconductor package impute. When SNPs in the set were highly correlated (p > 0.8) with neighboring SNPs, we kept only 1 SNP closest to the 5′ end of each cluster. The resulting SNP set is referred to as the full set of SNPs (or F). Univariate linear models were fit between the pretreatment methylation residuals for each CpG near the selected genes, and each retained SNP near the same gene. Based on these linear regression results, reduced sets of SNPs (R), with significant F-statistics (p < 0.05), were constructed for use with some of the MR methods.
MR analyses using two-stage least squares (TSLS) [4], Egger regression [5], and our new method, CIV, briefly described below, were performed to evaluate the potential causal effects of variability in pretreatment methylation levels (X) on ΔHDL (Y). In TSLS and Egger regression, SNPs (G) are used to estimate the exposure \( \widehat{X} \), and then the outcome, Y, is regressed on the estimated \( \widehat{X} \) to estimate the causal effect of X on Y. Egger regression adjusts for some of the possible pleiotropic effects and also detects small sample bias.
The CIV method is designed to adjust causal effect estimates of X on Y when potential pleiotropic exposures, Z, are measured [6]. Naïve inclusion of genotypes with pleiotropic effects among SNPs to be used as instruments may lead to biased estimation of the causal effect. CIV finds a penalized linear projection orthogonal to Z to construct a valid and strong instrumental variable. A constrained optimization approach using smoothed penalty functions forces approximately sparse models. The strength of CIV instruments can be measured with a global F-statistic and the concentration parameter [7]. The latter measures the overall association between X and G, whereas the former also considers the number of instruments used; if there are many weak instruments, this will be reflected in the F-statistic. F-statistics< 10 are often considered weak instruments. Simulation studies [6] have compared CIV with TSLS, Egger regression, and other popular MR methods under scenarios varying the instruments’ relative strength, validity and pleiotropic directions, and showed that CIV estimates causal effects with little to no bias.
For CIV analysis, the neighborhood around each gene of interest was partitioned into 2 subsets: a set of probes where causal inference is desired ({X}: the methylation probe set for each gene) and a set of CpGs whose potential pleiotropic effects are of concern ({Z}: methylation probe sets for genes up to 100 kb on either side of the probes in {X}). For each CpG in {X}, causal inference analysis was performed with CIV, TSLS, and Egger methods. Only the CIV method also used the probe set {Z} for analysis. For CIV, the full set (F) of SNPs was used for analysis; for Egger and TSLS, both sets F and R were used as instrumental variables. For all methods, bootstrap confidence intervals, based on 200 bootstrap samples, were constructed for the estimated causal effect of X on Y.