Volume 10 Supplement 7

Genetic Analysis Workshop 19: Sequence, Blood Pressure and Expression Data. Proceedings.

Open Access

Integrating multiple genomic data: sparse representation based biomarker selection for blood pressure

  • Hongbao Cao1,
  • Wei Guo1,
  • Haide Qin1,
  • Mengyuan Xu1,
  • Benjamin Lehrman1,
  • Yu Tao1 and
  • Yin-Yao Shugart1Email author
BMC Proceedings201610(Suppl 7):40

DOI: 10.1186/s12919-016-0044-7

Published: 18 October 2016



Although many genes have been implicated as hypertension candidates, to date, few studies have integrated different types of genomic data for the purpose of biomarker selection.


Applying a newly proposed sparse representation based variable selection (SRVS) method to the Genetic Analysis Workshop19 data, we analyzed a combined data set consisting of 11522 gene expressions and 354893 single-nucleotide polymorphisms (SNPs) from 397 subjects (case/control: 151/246), with the aim to identify potential biomarkers for blood pressure using both gene expression measures and SNP data.


Among the top 1000 variables (SNPs/gene expressions = 575/425) selected, the bioinformatics analysis showed that 302 were plausibly associated with blood pressure. In addition, we identified 173 variables that were associated with body weight and 84 associated with left ventricular contractility. Together, 55.9 % of the top 1000 variables showed associations with blood pressure related phenotypes(SNP/gene expression =348/211).


Our results support the feasibility of the SRVS algorithm in integrating multiple data sets of different structure for comprehensive analysis.


The determinants of blood pressure (BP) are likely to be a complex combination of genetic, environmental, and other potential confounders including age, gender and smoking status [1]. Moreover, it has been documented that heritability accounts for one-third to two-thirds of the variability in BP [2].

Genome-wide association studies (GWAS) [36] and gene expression studies [7] have been conducted to identify biomarkers, such as single-nucleotide polymorphisms (SNPs) and gene expression, associated with BP phenotypes. Although many genes have been reported as hypertension candidates [8], to date, a limited number of studies have integrated different types of genomic data to select biomarkers.

Here we used a sparse representation based variable selection (SRVS) method [9] to integrate a gene expression data set and a SNP data set acquired from the same subjects, for the purpose of identifying BP related biomarkers, and facilitate the understanding of genetic mechanism of the BP disease. The SRVS method has been shown to be feasible in identifying schizophrenia candidate biomarkers, while integrating functional magnetic resonance imaging data and SNP data [10]. It has also been demonstrated that the use of multiple data types may provide higher power to identify potential biomarkers that would be missed by using independent data analysis [11].


Data description

The data set was provided by the Genetic Analysis Workshop 19. Phenotypes were measured at 4 time points, including age; hypertension diagnosis (HD; yes = 1; no = 0); Systolic Blood Pressure (SBP); Diastolic Blood Pressure (DBP); medication status (MS); smoking status (SS; yes = 1/no = 0) and gender. The expression data set consisted of 647 subjects with 16383 expression probes. In the SNP data set, there were 959 subjects with 472049 SNPs, measured from the odd numbered chromosomes (1 ~ 21). In the current study, we used the data obtained from the third examination that has roughly balanced hypertension/non-hypertension numbers (Table 1). This data set included 397 subjects from 46 families having both SNP data and gene expression data. For simplicity, we deleted gene expression probes and SNPs with no associated gene, resulting in a combined data set of X R 397×(11522 + 354893) (397 subjects with 11522 gene expression probes and 354893 SNPs). Table 1 summarizes the data set and their clinical measures (age, sex, HD, SBP, DBP, MS, SS).
Table 1

Descriptive statistics of data set


Data set

Subject Number (m)


SBP (meanSD)


DBP (mean SD)


Hypertension cases


Age (mean SD)


Sex (male)


MS (taking drug)




Sparse representation-based variable selection

We used 2 regression models to describe the relationship between BP and 6 impact factors: Age, Sex, MS, SS, SNP and gene expression variation.
$$ BP={\displaystyle {\sum}_{i=1}^4{\delta}_{\mathrm{i}}{X}_i}+y $$
$$ y=\left[{X}_5,{X}_6\right]\left[\begin{array}{c}\hfill {\delta}_5\hfill \\ {}\hfill {\delta}_6\hfill \end{array}\right]+\varepsilon =X\delta +\varepsilon $$

Where X i for i = 1~6 are the 6 impact factors; δ i are the regression coefficients for each factor. In this study, BP R m×1 is the BP measurement, SBP or DBP, where m is the number of subjects; X 1~X 4 R m×1 are Age, Sex, MS and SS, respectively; δ 1~δ 4 R 1×1; X 5 R m×11522 represents the gene expression measures and δ 5 R 11522×1; X 6 R m×354893 represents the SNPs and δ 6 R 354893×1; ε R m×1 is the residual vector. X R m×n is the genetic data matrix integrating both gene expression data and SNP data; n = 11522 + 35893 represents the total number of gene expression probes and SNPs; columns of X are normalized to have unit L2 norm. \( \delta =\left[\begin{array}{c}\hfill {\delta}_5\hfill \\ {}\hfill {\delta}_6\hfill \end{array}\right]\in {R}^{n\times 1} \) is the solution to be found.

Here, we used the Linear Least Squares (LLS) method to solve the linear regression given by Eq. (1) and acquire the residual y.

In this analysis, we assumed that only a small number of variables (eg, gene expressions or SNPs) were closely associated with the phenotype (BP). Therefore, the underdetermined linear regression problem given by Eq. (2) becomes a sparse problem aiming to find a sparse solution δ, with a few non-zero entries corresponding to BP related genetic variables.

Considering nm, we employed a SRVS method, proposed by [10] to solve Eq. (2) and identify potential biomarkers (gene expressions/SNPs) associated with BP.

Sparse representation-based variable selection algorithm

  1. 1.

    Initialize δ (0) = 0;

  2. 2.

    For Step l,randomly shuffle X with Fisher-Yates algorithm [12]; Then separate X into sub-matrixes in size m × k; denote those sub-matrixes as X l R m×k ;

  3. 3.
    Solve the following L p minimization problem to get the optimal sparse solution δ l R k×1 for each sub-matrix X l :
    $$ min\left\Vert {\delta}_l\left\Vert {}_p\right. subject\;to\;\left\Vert y-{X}_l{\delta}_l\left\Vert {}_2\right.\le \varepsilon \right.\right.; $$
  4. 4.

    Update δ (l) R n×1 with δ l : δ (l) (I l ) = δ (l-1) (I l )+δ l ; where δ (l) (I l ) and δ (l-1) (I l ) denote the I l th entries in δ (l) and δ (l-1), respectively;

  5. 5.

    If a stopping rule is not satisfied, update l = l + 1 and go to Step 2. Otherwise, set δ = δ (l)/l and terminate. The non-zero entries in δ correspond to the column vectors selected, that is, variable selection.


In Step 2, the column number of sub-matrixes X1 is chosen according to Cao et al. [10]. In Step 5, we set the following 2 stopping rules: a.)‖δ (l)/l − δ (l − 1)/(l − 1)‖2 < α, where α is a predefined threshold; and b.) The probability that each pair of column vectors in X compared should be greater than 1-p stop . The algorithm terminates when both rules are satisfied, which decides the total number of iterations. The Matlab software toolbox for the proposed SRVS algorithm has been made available online: http://hongbaocao.gousinfo.com/Software4Download.html.

Bioinformatics analysis

For each top gene selected, we used a biomedical data analysis tool, the Rat Genome Database (RGD) for bioinformatics analysis. The bioinformatics analysis was based on Human Genome Assembly GRCh37 (Genome Reference Consortium Human genome build 37) [13]. The input into RGD are the genes selected (the selected SNP/expression corresponded genes). The outputs include the quantitative trait locus (QTL) study name, logarithm of odds (LOD) score, p value trait and sub-trait. Significant variables (SNPs/gene expressions with LOD score > 3) were reported.


The impact of Age, Sex, MS and SS on BP

Table 2 details the considered regression coefficients: Age, MS, SS and Sex. We obtained these coefficients by solving Eq. (1) using a LLS approach. Figure 1 presents the SBP and DBP measures on the 397 subjects before and after the regression.
Table 2

Regression coefficients between BP (SBP/DBP) and 4 clinical measures: Age, MS, SS and Sex







Corr before/after regression

m = 397












The regression coefficients were obtained from linear regression models given by Eq. (3) fitted using the least squares approach. The ‘Corr’ is the Pearson correlation coefficients

Fig. 1

Blood pressure phenotypes of 397 subjects. SBP-res and DBP-res are the residual y of regression problem given by Eq. (1) for SBP and DBP, respectively; x-axis represents the subjects; y-axis represents the blood pressure phenotypes at each subject

It can be seen from Fig. 1 that the residual SBP-res and DBP-res were strongly correlated (Pearson correlation coefficient > 0.82). Therefore, to select BP related genetic variables (SNP/gene expression), we focused on the case using SBP-res as phenotype for Eq. (3).

Sparse representation-based variable selection

Figure 2 describes the variable selection results for the data set. Specifically, we analyzed the top 1000 variables (SNPs/gene expressions) selected using the SRVS method from the integrated data set consisting of 11522 gene expression probes and 354893 SNPs. Among those variables, 575 SNPs and 425 expressions were selected, corresponding to 756 genes in total. Figure 2 presents the number of SNPs and gene expressions selected in the top 100 to 1000 variables.
Fig. 2

Number of SNPs and gene expressions selected in the top 100 to 1000 variables selected

Bioinformatics analysis

For each of the top 1000 variables (SNPs/gene expressions = 575/425), we performed a bioinformatics analysis using RGD as a validation effort, aiming to explore the biological relevance of the selected SNPs and expression signals. Here we define “significant association between genes and disease” as LOD score greater than 3. Figure 3 presents the detailed analysis results. Among those 1000 variables, 302 were plausibly linked to BP (LOD score > 3), 173 were linked to body weight and 84 were associated with left ventricular contractility. Together, 55.9 % of the top 1000 variables revealed association with BP related disease (SNP/gene expression =348/211), corresponding to 330 genes.
Fig. 3

LOD analysis results for the top 1000 variables selected. a Pie plot of the variable distribution for the top 1000 variables. b Bar plot for the number of variables linked to left ventricular, body weight and blood pressure in the top 100 to 1000 variables selected


In this study, we integrated gene expression and SNP data to select BP related biomarkers using a sparse representation based method—SRVS [10]. The potential influence of 4 covariates on SBP was regressed out and the residuals were then used as the phenotype vector for genomic variable selection. Bioinformatics analysis [13] was performed to study the association of the selected markers/genes to BP-related disease.

Needless to say, in addition to genomic factors, environmental factors also play an important role in BP. Therefore, regressing out their potential influence on BP is necessary for the genomic analysis. In this study, we first calculated the regression coefficients for the regression of SBP and 4confounders: Age, MS, SS and Sex. The results (see Table 2) indicated that BP was positively associated with age, SS and sex, and negatively associated with MS. Nevertheless, age had a weaker impact on BP compared with the other 3 measures, whereas sex seemed to play the most important role among the 4 factors. In addition, the correlations between SBP and DBP before and after regressing out the effect of those influential factors (0.25 vs. 0.82) may indicate that those measures had different influence on SBP and DBP. Because the residual of SBP and DBP after regression showed strong correlation (Pearson correlation coefficient > 0.82; see Fig. 1), we chose to focus on SBP residual based analysis.

Using the SBP residual and integrated data as inputs, the SRVS algorithm ranked the 366415 variables (11,522 gene expression signals and 354,893 SNPs) in descending order, based on their contribution to SBP. We focused on the top 1000 variables. Interestingly, although there are many more SNPs than gene expression probes (354,893 vs. 11,522), a similar number of SNPs and expression signals were selected (SNPs/gene expressions = 575/425). Moreover, the selected gene expression signals dominated the top 400 variables (>90 %), as shown in Fig. 2. This may suggest that gene expression signals are more closely related to the disease phenotype in this data set. However, we would like to point out that non-independence may raise false positive rates in analysis of both SNP data and expression data.

For each of the top 1000 variables selected, we used an online bioinformatics tool RGD to validate the selected variables and identify the biologically meaningful SNPs and expression signals. Among the 425 gene-expression signals selected, approximately half (211/425) of the RGD provided evidence of strong association with BP phenotypes (i.e., body weight, BP and left ventricular contraction), as depicted in Fig. 3. It has been conceptualized that obesity can lead to increased risk of heart disease and high BP [13], while the left ventricle influences the BP directly.

Among the top 500 to 1000 selected variables, more SNPs than gene expression signals were selected, as shown in Fig. 2. In addition, more left ventricular contractility related genes were identified. In total, approximately 60 % of the selected SNPs were identified as “BP related” (348/575) (LOD score > 3). This observation may suggest that, although SNPs are unlikely to directly cause the disease phenotypes, they may affect the development of BP related diseases via regulating RNA expressions.

It should be noted that, while most genes were identified using 1 marker (either SNP or gene expression), some newly identified genes were selected multi-times by different makers. Those genes include GNB1, MEGF6, MMEL1, MORN1, PANK4, PLCH2, PRDM16, PRKCZ, and TP73. These markers are worth further study.

Among the top 1000 variables selected, 44 % do not show strong association with BP (Enrichment LOD score < 3). However, for many of the remaining genes there was evidence of weak linkage (Enrichment LOD > 2) and some demonstrated strong linkage to BP in rat studies [14]. Because of the lack of space, we did not include a detailed discussion of these variables.

Of note, both case and control groups included family members. Although the shared genetic factors may enrich true signals and therefore help to detect potential biomarkers that may be missed in independent subject analysis, this familial correlation may also lead to increased false positives. Therefore, further analysis using independent samples of larger size should be performed to validate the results reported here and to study the correlations between the selected variables. We would like to note that this work focuses more on the feasibility of our sparse algorithm than the discovery of true biomarkers.


Using our SRVS based integrated analysis of gene expression and SNP data sets, we ranked 11522 gene expression measurements and 354893 SNPs and then performed bioinformatics analysis on each of the top 1000 variables selected. Results showed that 559 variables (SNPs/gene expressions), corresponding to 330 genes, may serve as potential biomarkers for BP related disease (LOD score > 3). Nevertheless, a portion of the selected variables are likely to be false positives. Molecular validation is needed before any solid conclusions can be made. However, results of the current study demonstrate the feasibility of the SRVS algorithm for a comprehensive analysis of multiple data sets of different structure.



CH, GW, HQ, XM, YT, LB and SYY are supported by the Intramural Program of National Institute of Mental Health (NIMH) (MH002930-03). The views expressed in this manuscript do not necessarily represent the views of the NIMH, National Institutes of Health, Health and Human Services, or the United States Government.


This article has been published as part of BMC Proceedings Volume 10 Supplement 7, 2016: Genetic Analysis Workshop 19: Sequence, Blood Pressure and Expression Data. Summary articles. The full contents of the supplement are available online at http://bmcproc.biomedcentral.com/articles/supplements/volume-10-supplement-7. Publication of the proceedings of Genetic Analysis Workshop 19 was supported by National Institutes of Health grant R01 GM031575.

Authors’ contributions

CH and SYY conceived of and designed the study, and performed data annotation and analysis. The manuscript was analyzed by YYS and written by CH, GW, HQ, XM, LB, YT and SYY. All authors read and approved the manuscript.

Competing interests

The authors declare they have no competing interests.

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Authors’ Affiliations

Unit on Statistical Genomics, Division of Intramural Research Programs, National Institute of Mental Health, National Institutes of Health


  1. Buck CW, Donner AP. Factors affecting the incidence of hypertension. CMAJ. 1987;136(4):357–60.PubMedPubMed CentralGoogle Scholar
  2. Levy D, Ehret GB, Rice K, Verwoert GC, Launer LJ, Dehghan A, et al. Genome-wide association study of blood pressure and hypertension. Nat Genet. 2009;41(6):677–87.View ArticlePubMedPubMed CentralGoogle Scholar
  3. Flister MJ, Tsaih SW, O’Meara CC, Endres B, Hoffman MJ, Geurts AM, et al. Identifying multiple causative genes at a single GWAS locus. Genome Res. 2013;23(12):1996–2002.View ArticlePubMedPubMed CentralGoogle Scholar
  4. International Consortium for Blood Pressure Genome-Wide Association Studies, Ehret GB, Munroe PB, Rice KM, Bochud M, Johnson AD, et al. Genetic variants in novel pathways influence blood pressure and cardiovascular disease risk. Nature. 2011;478(7367):103–9.View ArticleGoogle Scholar
  5. Kochunov P, Glahn D, Lancaster J, Winkler A, Kent Jr JW, Olvera RL, et al. Whole brain and regional hyperintense white matter volume and blood pressure: overlap of genetic loci produced by bivariate, whole-genome linkage analyses. Stroke. 2010;41(10):2137–42.View ArticlePubMedPubMed CentralGoogle Scholar
  6. Adeyemo A, Gerry N, Chen G, Herbert A, Doumatey A, Huang H, et al. A genome-wide association study of hypertension and blood pressure in African Americans. PLoS Genet. 2009;5(7):e1000564.View ArticlePubMedPubMed CentralGoogle Scholar
  7. Hoffmann J, Wilhelm J, Marsh LM, Ghanim B, Klepetko W, Kovacs G, et al. Distinct differences in gene expression patterns in pulmonary arteries of patients with chronic obstructive pulmonary disease and idiopathic pulmonary fibrosis with pulmonary hypertension. Am J Respir Crit Care Med. 2014;190(1):98–111.View ArticlePubMedGoogle Scholar
  8. Padmanabhan S, Newton-Cheh C, Dominiczak AF. Genetic basis of blood pressure and hypertension. Trends Genet. 2012;28(8):397–408.View ArticlePubMedGoogle Scholar
  9. Cao H, Duan J, Lin D, Calhoun V, Wang Y. Integrating fMRI and SNP data for biomarker identification for Schizophrenia with a sparse representation based variable selection method. BMC Med Genomics. 2013;6 Suppl 3:S2.View ArticlePubMedPubMed CentralGoogle Scholar
  10. Cao H, Duan J, Lin D, Shugart YY, Calhoun V, Wang Y. Sparse Representation Based Biomarker Selection for Schizophrenia with Integrated Analysis of fMRI and SNPs. Neuroimage. 2014;102(Pt 1):220–8.View ArticlePubMedGoogle Scholar
  11. Cao H, Lei S, Deng HW, Wang YP. Identification of genes for complex diseases using integrated analysis of multiple types of genomic data. PLoS One. 2012;7(9):e42755.View ArticlePubMedPubMed CentralGoogle Scholar
  12. Fisher RA, Yates F. Statistical tables for biological, agricultural and medical research, OCLC 14222135 London: Oliver & Boyd. 1948. p. 26–7.Google Scholar
  13. Laulederkind SJ, Hayman GT, Wang SJ, Smith JR, Lowry TF, Nigam R, et al. The Rat Genome Database 2013--data, tools and users. Brief Bioinform. 2013;14(4):520–6.View ArticlePubMedPubMed CentralGoogle Scholar
  14. Haslam DW, James WP. Obesity Lancet. 2005;366(9492):1197–209.View ArticlePubMedGoogle Scholar


© The Author(s). 2016