Factors associated with heterogeneity in microarray gene expression in peripheral blood mononuclear cells from large pedigrees

Background Genome-wide microarray expression is a rich source of functional genomic data. We examined evidence for differences in expression from peripheral blood mononuclear cells between individuals, examined some of factors that may be responsible and provide recommendations for analysis. Methods A total of 643 individuals from 17 large Mexican American pedigrees had microarray gene expression data generated from peripheral blood mononuclear cells. This data has previously been used to map cis- and trans-expression quantitative trait loci using genome-wide linkage analysis. We estimated both principal components and cell proportions in these data, and tested them for association with clinical factors to provide insight into causes of variation in gene expression between individuals. Results We identified that there were highly significant differences in the second principal component of gene expression between pedigrees, with 3 pedigrees being outliers. The estimated cell proportions identified 1 individual who was a gross outlier, as well as pedigrees that differed from others in their estimated proportions of helper and cytotoxic T cells. Conclusions These phenomena could be from either pedigree-specific genetic variation, technical artefacts, or clinical factors. Incorporating factors that influence gene expression into genetic analysis, and exclusion of outliers could improve the power of genetic mapping of expression traits.


Background
Functional genomics studies face many challenges, including defining the cell type(s) of study, and their relative proportions. In the Genetic Analysis Workshop 19 (GAW19) Mexican American family data [1], microarray gene expression data were obtained from 647 individuals after peripheral blood samples were subjected to cell separation using Histopaque® (Sigma Chemical Co.) which separates mononuclear from polynuclear cells [2]. This is expected to remove eosinophils, neutrophils, and basophils, leaving T-, B-, NK-lymphocytes and monocytes. The proportions of T-and B-lymphocytes and monocytes vary between individuals, are heritable, and genome-wide association studies have identified numerous loci for them (eg, Nalls et al. [3]). Gene expression analysis performed on mixtures of cell types can potentially be confounded by heterogeneity of cell types. Similar observations have been made for epigenetic studies using DNA methylation [4].

Methods
We centered and scaled the gene expression data for each probe to have mean 0 and variance 1. Using a singular value decomposition of the 643 × 20,634 scaled matrix, we found the loadings for the principal components (PCs) and the proportion of total variability accounted for by each (R prcomp). The expression PCs were then used as the response variable to examine the relationship between them and covariates. Specifically, for each of the first 3 PCs we tested for their association with age, gender, medication, blood pressure (BP), hypertension, and smoking status, all measured at visit 1, as well as with pedigree number, one at a time. We also fitted the models for age stratified by gender and, finally, a model with age, gender, and their interaction along with pedigree number. BP measures were missing for 12 individuals at visit 1. Treated systolic blood pressure (SBP) and diastolic blood pressure (DBP) values had 10 and 5 mmHg added to their measured values, respectively, as suggested previously [5]. Statistical significance was defined as p < 0.05.
Estimation of the proportion of cytotoxic (CD8+), helper (CD4+) T-, and B-lymphocytes and monocytes in peripheral blood mononuclear cells from each individual was achieved by identifying gene expression signatures for different cell types from HaemAtlas [6] using 4879 probes that overlapped with the GAW19 data, and the quadratic programming algorithm of Gong et al. [7] as implemented in the R package Cell-Mix [8]. Examination of differences in each of the cell proportions between pedigrees was estimated using analysis of variance.

Descriptive statistics
Of the 647 individuals with gene expression data, 4 did not appear in the phenotype or pedigree files, leaving 643 for analysis. Of those with expression data, at visit 1, the mean age was 39.6 years (SD = 16.9), with 269 males and 374 females. There were 497 nonsmokers, 133 smokers, and 13 with missing smoking data (coded as a third category in the analysis). A total of 123 had hypertension (SBP >140 mmHg or DBP >90 mmHg or on anti-BP medication), 513 did not, with 7 missing. In addition, 631 individuals had BP measurements (12 missing). Of those with BP data, 559 reported they did not take BP medication, 65 reported taking medications, and 7 were unknown. We only adjusted the 65 individuals who reported taking medications (SBP +10 mmHg, DBP +5 mmHg). The mean unadjusted SBP was 121.9 mmHg (122.9 adjusted), SD = 19.4 (20.9 adjusted). The mean unadjusted DBP was 71.2 mmHg (71.8 adjusted), SD = 9.97 (10.5 adjusted).

Association of gene expression principal components with covariates
Gene expression data was available for individuals from 17 pedigrees (Table 1). PC analysis identified 26 PCs that account for 50 % of the original variability (Fig. 1). The first 3 PCs account for approximately 12, 5, and 4 %, respectively, of the variance in gene expression with the first 10 PCs accounting for the majority of the variation. For the first 3 PCs, there was only a nominally significant association of PC3 with age ( Table 2), but not for PC1 or PC2. Gender was not significantly associated with any of the 3 PCs. There was no significant association of age with first 3 PCs in males (p > 0.05), but there was a borderline association within females for PC3, p = 0.03. Models including age, gender and their interaction did not identify significant interactions for any of the first 3 PCs. There does appear to be a slight relationship between BP medication and the first and third PCs ( Table 2). Medicated individuals had lower values of these 2 PCs. A logistic regression of medication as the response with the first 3 PCs as predictors also showed that the probability of medication decreases as the first (p = 0.015) and third (p = 0.025) PCs increase, but there is no effect on the probability of medication for the second PC (p = 0.74). There was no significant association of SBP with each of the 3 PCs, nominal evidence for association of DBP with PC3. Hypertension status was associated with PC1 (p = 0.016, Table 2).
There was no significant association between pedigree number and PC1 or PC3 (p = 0.091 and 0.26, respectively). In contrast, there was a highly significant association between pedigree number and PC2 (R 2 = 0.19, F = 9.137, P < 2:2 × 10 −16 ) . Specifically, 3 pedigrees (5, 6, and 8) had significantly different PC2 values (see Table 1; Figs. 2 and 3). Inclusion of age, gender and the interaction between age and gender did not appreciably alter the findings, with pedigrees 5, 6, and 8 still showing significantly different PC2 values (Fig. 3). PCs were also estimated from unrelated individuals (using data available from Genetic Analysis Workshop 18), and their weights were applied to the remaining subjects, but the conclusions were not altered.  Table 1). For Tc lymphocytes the significance is driven by differences between pedigrees 27 and 3 and the rest, whereas for Th lymphocytes it is driven primarily by differences between pedigrees 3, 5, 27, and 47 (see Table 1 and Fig. 4). One individual from pedigree 8 (ID# T2DG0800552) was identified to be a gross outlier based on Tc and Th proportions (Fig. 4), consistent with an acute viral infection. The estimated proportion of T cytotoxic and helper lymphocytes were significantly correlated with the first 3 PCs (Table 3), although this may be tautological, as some Three pedigrees (7, 9, and 11) had no individuals with gene expression data. Global p = 0.091 for PC1, p < 2.2e-16 for PC2, p = 6.5e-12 for cytotoxic T-cell proportion, and p = 8.5e-9 for helper T-cell proportion. Bold indicates pedigrees with significant differences at α = 0.05/17(pedigrees) level (p < 0.0029)

Discussion
A large number of PCs are detected in the microarray gene expression data. Although age, sex, and other clinical factors were not associated with the 3 first PCs, pedigree number was highly significantly associated with PC2, with 3 pedigrees being gross outliers. Pedigree differences in PC2 could be from genetic variation that is related to pedigree membership that is influencing gene expression. Alternatively, it could be a result of pedigree differences in technical procedures or in the proportion of different cell types in those subjected to analysis. Cell proportions were estimated and 1 individual was shown to be a gross outlier and power may be improved by exclusion of such subjects. An overlapping set of 2 and 4 pedigrees had significant differences in the estimated proportion of Tc and Th lymphocytes, respectively. The analysis did not take pedigree structure into account, potentially leading to inflated type 1 error. In general, identification of factors that are associated with differences between individuals in functional genomics measures can potentially be used to improve the

Conclusions
This is not the first [9,10], nor likely the last description of possible batch effects in functional genomic data. According to the description of the GAW19 expression data, the lab method was as described in Göring et al. [2], while the data that was distributed underwent different processing [1], mostly focused on providing data for probes where the "detection p value" was consistent, with detectable expression across most individuals. It is unlikely that such preprocessing would produce PCs that we observed in the data.