Predicting qualitative phenotypes from microarray data – the Eadgene pig data set
© Robert-Granié et al; licensee BioMed Central Ltd. 2009
Published: 16 July 2009
The aim of this work was to study the performances of 2 predictive statistical tools on a data set that was given to all participants of the Eadgene-SABRE Post Analyses Working Group, namely the Pig data set of Hazard et al. (2008). The data consisted of 3686 gene expressions measured on 24 animals partitioned in 2 genotypes and 2 treatments. The objective was to find biomarkers that characterized the genotypes and the treatments in the whole set of genes.
We first considered the Random Forest approach that enables the selection of predictive variables. We then compared the classical Partial Least Squares regression (PLS) with a novel approach called sparse PLS, a variant of PLS that adapts lasso penalization and allows for the selection of a subset of variables.
All methods performed well on this data set. The sparse PLS outperformed the PLS in terms of prediction performance and improved the interpretability of the results.
We recommend the use of machine learning methods such as Random Forest and multivariate methods such as sparse PLS for prediction purposes. Both approaches are well adapted to transcriptomic data where the number of features is much greater than the number of individuals.
Often, an important goal of transcriptomic analyses is to identify differentially expressed genes; the expression level of each gene is explained by the phenotype in a linear model setting (either regression or ANOVA for a quantitative or a qualitative phenotype). Another important goal is to find biomarkers, i.e. genes that have a high predictive value for the phenotype. One statistical method that can be considered is discriminant analysis, where the phenotype is modelled as a linear combination of a subset of gene expressions. However, in the case of transcriptomic data, gene expressions are highly correlated, leading to multicollinearity problems in discriminant analysis. Furthermore, the high number of variables limits the use of this method. To circumvent this problem, one can pre-select the variables, or use approaches that can deal with a large number of variables.
In this study, the objective was to assess the behaviour of 2 prediction methods on one data set [1, 2], where the phenotype is a set of factors, genotype or breed (Large White denoted LW, and Meishan denoted MS), and treatment (control and ACTH, coded c and a). Expression levels were available for 3686 genes on 24 animals. We first focus on the machine learning approach Random Forest (RF), and then on the modelling approach sparse Partial Least Squares (sPLS) to analyse this data set.
Random forests  is a classification algorithm that takes advantage of the unstable property of Classification and Regression Trees (CART) classifiers  and their lack of accuracy by aggregating them. It combines two sources of randomness that improve the prediction accuracy: bagging (bootstrap aggregating) and random feature selection to construct each CART. This results in a low correlation of the individual trees as well as low bias and low variance. The individual trees Tk are constructed as follows:
- N bootstrap samples (B1,..., BN) are drawn from the original data.
- Each sample Bk (k = 1,..., N) is used as a training set to construct an unpruned tree Tk. Let p be the input variables of the tree. For each node of Tk, m variables are randomly selected (m<<p) to determine the decision at the node, where m is constant during the forest growing. Then the best split among these m predictors is chosen to split the node.
While constructing each tree Tk, about one-third of the cases are left out of the bootstrap sample and are not used in its construction. These data are called "Out-of-bag" or OOB data and are used as an "internal" test set for each tree that is grown.
The OOB predictions are then aggregated and the error rate, called "OOB error estimate" is computed for the whole forest and should lead to an accurate and unbiased generalisation error .
The Mean Decrease Accuracy measure was used in this study as a feature selection criterion, where the OOB data are used to obtain estimates of variable importance by evaluating their contribution to the prediction accuracy. The values of each variable in the OOB cases are randomly permuted and are run along the tree. The proportion of cases in the correct classes with permuted OOB data is then subtracted from the proportion of cases in the correct classes where OOB data have not been permuted. The Mean Decrease Accuracy averages the difference between these two accuracies over all trees in the forest and normalizes it by the standard error.
PLS and Sparse PLS
The loading vectors are the p- and q- dimensional vectors uh and vh for each PLS dimension h and are respectively associated to the X and Y data sets. The associated latent variables are defined as ξh = Xuh and ωh = Yvh. As in PCA, the loading vectors uh and vh are directly interpretable, as they indicate the importance of the variables from both data sets in relation with each other. The latent variables ξh and ω h , that are n-dimensional vectors contain the information regarding the similarities or dissimilarities between the individuals or samples . PLS is an iterative method that is suitable for high dimensional data sets and has a valuable stability property. However, this interesting approach does not allow feature selection, which renders the results difficult to interpret in the n<<p problem. Lê Cao et al.  proposed a sparse version of the PLS, that combines variable selection and modelling in a one-step procedure for such problems. The sparse PLS (noted sPLS) is based on Lasso regression  that penalizes the loading vectors using Singular Value Decomposition to solve the PLS .
Two criteria were used to select the dimension size H and the number of predictive genes to select on each dimension of sPLS: the Root Mean Squared Error Prediction (RMSEP), and the Qh2 that measures the marginal contribution of each latent variable to the predictive power of the PLS model. Briefly, 1-Qh2 is the ratio of the average PRediction Error Sum of Squares to the average of the Residual Sum of Squares, over the variables (refer to  for more details).
The methods used to analyse this data set are implemented in the R software (RandomForest R package  , the R package "integrOmics" for sPLS is available at www.math.univ-toulouse.fr/biostat).
Random Forest (RF) does not require fine-tuning of its parameters. In this study, however, random forest classifications with 10000 trees were performed in order to obtain stable results. When applied to the whole data set (3686 genes), RF gave a reasonably high prediction power, with an OOB estimate of error rate equal to 12.5%. After a pre-selection of differentially expressed genes (662 genes with a FDR < 20%), the prediction was perfect (with 0% of the OOB estimate of error rate).
PLS and sPLS
Recall that Y is the phenotype matrix of the 4 indicators (LWa, LWc, MSa, MSc) for each animal. The number of dimensions H to be retained was estimated with the Qh2 criterion, for which a value below the threshold 0.0975 indicates a significant contribution for the prediction purpose [4, 9]. The Qh2 values calculated for each dimension of the PLS and the sPLS showed that 2 dimensions were enough to capture the whole information for both PLS or sPLS. An equivalent coding for Y is the 2 column matrix of genotype and treatment factors that will be considered in the following.
The number of dimensions being fixed to 2, the optimal number of genes selected on each dimension (equal number of genes on both dimensions for the sake of simplicity), was determined with the RMSEP for both sPLS and PLS. The optimal result, i.e. the lowest RMSEP obtained was 10 genes on each dimension. In this case, the sPLS gave better predictions than PLS (not shown).
Figure 4 clearly illustrates the superiority of sPLS on PLS in terms of interpretability, as the PLS does not allow for variable selection.
The list of significantly expressed genes (t test) did not exactly match with the list of sPLS predictive genes. This shows that the information captured by the 2 approaches may bring complementary as well as relevant results.
Due to the clear structuring of the data, it is difficult to compare the performances of the statistical prediction approaches. A thorough biological interpretation of the results is now needed to validate the use of these methods.
In the case where q > 1 in the Y matrix, few other approaches have been developed for variable selection and integration of two-block data sets based on elastic net procedure  or shrinkage methods . However, most of them focus on a canonical analysis, i.e. a symmetric relationship between the data sets, which is not the case in this study. The reader can refer to  (Canonical Correlation Analysis with Elastic Net) or  (Co-inertia analysis from ) for biological data sets.
In the case where q = 1, as performed with RF when we combined the phenotypes in one class vector, we find ourselves in a typical multiclass problem. Several approaches have been developed for feature selection, among them the reader can refer to Recursive Feature Elimination , Nearest-Shrunken Centroid  or Optimal Feature Weighting , that can deal with more than 2 classes.
The differential analysis in , and the 2 predictive approaches presented here gave coherent, similar but complementary insights. On this data set however, expression patterns were so different in the 4 classes that the conclusions of the comparisons between the above statistical tools are not to be generalised.
In microarray data, the statistical criteria are often limited by the small number of samples. Therefore, it is strongly recommended to combine statistical assessments with a sound biological interpretation of the data, as was shown for example in . They showed the importance of the interpretation of the results and found interesting complementarities between predictive approaches in several data sets, in terms of biological processes. Therefore, we also recommend the use of various predictive statistical tools when searching for biomarkers.
We thank the Eadgene network of excellence.
This article has been published as part of BMC Proceedings Volume 3 Supplement 4, 2009: EADGENE and SABRE Post-analyses Workshop. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/3?issue=S4.
- Hazard D, Liaubet L, SanCristobal M, Mormede P: Gene array and real time PCR analysis of the adrenal sensitivity to adrenocorticotropic hormone in pig. BMC Genomics. 2008, 9: 101-10.1186/1471-2164-9-101.PubMed CentralView ArticlePubMedGoogle Scholar
- Breiman L: Random forests. Mach Learn. 2001, 45 (1): 5-32. 10.1023/A:1010933404324.View ArticleGoogle Scholar
- Breiman L, Friedman JH, Olshen RA, Stone CJ: Classification and Regression Trees. Belmont, CA. 1984Google Scholar
- Wold H: Multivariate Analysis. 1966, New York: WileyGoogle Scholar
- Wold H, Eriksson L, Trygg J, Kettaneh N: The PLS method – Partial Least Squares projection to latent structures – and its applications in industrial RDP (research, development, and production). 2004, Unea UniversityGoogle Scholar
- Lê Cao KA, Rossouw D, Robert-Granie C, Besse P: A Sparse PLS for Variable Selection when Integrating Omics Data. Stat Appl Genet Mol Biol. 2008, 7 (1): Article 35-PubMedGoogle Scholar
- Tibshirani R: Regression shrinkage and selection via the Lasso. J Roy Stat Soc B Met. 1996, 58 (1): 267-288.Google Scholar
- Shen H, Huang JZ: Sparse principal component analysis via regularized low rank matrix approximation. Journal of Multivariate Analysis. 2008, 99 (6): 1015-1034. 10.1016/j.jmva.2007.06.007.View ArticleGoogle Scholar
- Tenenhaus M: La régression PLS: théorie et pratique. Technip. 1998Google Scholar
- Liaw A, Wiener M: Classification and Regression by randomForest. R News. 2002, 2: 18-22.Google Scholar
- Bonnet A, Le Cao KA, SanCristobal M, Benne F, Robert-Granie C, Law-So G, Fabre S, Besse P, De Billy E, Quesnel H, et al: In vivo gene expression in granulosa cells during pig terminal follicular development. Reproduction. 2008, 136 (2): 211-224. 10.1530/REP-07-0312.View ArticlePubMedGoogle Scholar
- Baccini A, Besse P, Déjean S, Martin PGP, Robert-Granie C, San Cristobal M: Stratégies pour l'analyse de données transcriptomiques. Journal de la Société Française de Statistique. 2005, 146: 5-44.Google Scholar
- Zou H, Hastie T: Regularization and variable selection via the elastic net. J Roy Stat Soc B. 2005, 67: 301-320. 10.1111/j.1467-9868.2005.00503.x.View ArticleGoogle Scholar
- Bondell HD, Reich BJ: Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR. Biometrics. 2008, 64 (1): 115-123. 10.1111/j.1541-0420.2007.00843.x.PubMed CentralView ArticlePubMedGoogle Scholar
- Waaijenborg S, Hamer PCVDW, Zwinderman AH: Quantifying the association between gene expressions and DNA-Markers by penalized canonical correlation analysis. Stat Appl Genet Mol Biol. 2008, 7 (1): Article3-PubMedGoogle Scholar
- Culhane AC, Perriere G, Higgins DG: Cross-platform comparison and visualisation of gene expression data using co-inertia analysis. BMC Bioinformatics. 2003, 4: 59-10.1186/1471-2105-4-59.PubMed CentralView ArticlePubMedGoogle Scholar
- Doledec S, Chessel D: Co-Inertia Analysis – an Alternative Method for Studying Species Environment Relationships. Freshwater Biol. 1994, 31 (3): 277-294. 10.1111/j.1365-2427.1994.tb01741.x.View ArticleGoogle Scholar
- Guyon I, Weston J, Barnhill S, Vapnik V: Support vector machine with recursive feature selection. Mach Learn. 2002, 46: 389-422. 10.1023/A:1012487302797.View ArticleGoogle Scholar
- Tibshirani R, Hastie T, Narasimhan B, Chu G: Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci USA. 2002, 99 (10): 6567-6572. 10.1073/pnas.082099299.PubMed CentralView ArticlePubMedGoogle Scholar
- Lê Cao K-A, Bonnet A, Gadat S: Multiclass classification and gene selection with a stochastic algorithm. Computational Statistics and Data Analysis. 2009, Google Scholar
- Lê Cao KA, Goncalves O, Besse P, Gadat S: Selection of biologically relevant genes with a wrapper stochastic algorithm. Stat Appl Genet Mol Biol. 2007, 6 (1): Article29-PubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.