Predicting qualitative phenotypes from microarray data – the Eadgene pig data set
 Christèle RobertGranié^{1},
 KimAnh Lê Cao^{1} and
 Magali SanCristobal^{2}Email author
https://doi.org/10.1186/175365613S4S13
© RobertGranié et al; licensee BioMed Central Ltd. 2009
Published: 16 July 2009
Abstract
Background
The aim of this work was to study the performances of 2 predictive statistical tools on a data set that was given to all participants of the EadgeneSABRE Post Analyses Working Group, namely the Pig data set of Hazard et al. (2008). The data consisted of 3686 gene expressions measured on 24 animals partitioned in 2 genotypes and 2 treatments. The objective was to find biomarkers that characterized the genotypes and the treatments in the whole set of genes.
Methods
We first considered the Random Forest approach that enables the selection of predictive variables. We then compared the classical Partial Least Squares regression (PLS) with a novel approach called sparse PLS, a variant of PLS that adapts lasso penalization and allows for the selection of a subset of variables.
Results
All methods performed well on this data set. The sparse PLS outperformed the PLS in terms of prediction performance and improved the interpretability of the results.
Conclusion
We recommend the use of machine learning methods such as Random Forest and multivariate methods such as sparse PLS for prediction purposes. Both approaches are well adapted to transcriptomic data where the number of features is much greater than the number of individuals.
Keywords
Background
Often, an important goal of transcriptomic analyses is to identify differentially expressed genes; the expression level of each gene is explained by the phenotype in a linear model setting (either regression or ANOVA for a quantitative or a qualitative phenotype). Another important goal is to find biomarkers, i.e. genes that have a high predictive value for the phenotype. One statistical method that can be considered is discriminant analysis, where the phenotype is modelled as a linear combination of a subset of gene expressions. However, in the case of transcriptomic data, gene expressions are highly correlated, leading to multicollinearity problems in discriminant analysis. Furthermore, the high number of variables limits the use of this method. To circumvent this problem, one can preselect the variables, or use approaches that can deal with a large number of variables.
In this study, the objective was to assess the behaviour of 2 prediction methods on one data set [1, 2], where the phenotype is a set of factors, genotype or breed (Large White denoted LW, and Meishan denoted MS), and treatment (control and ACTH, coded c and a). Expression levels were available for 3686 genes on 24 animals. We first focus on the machine learning approach Random Forest (RF), and then on the modelling approach sparse Partial Least Squares (sPLS) to analyse this data set.
Methods
Random forests
Random forests [2] is a classification algorithm that takes advantage of the unstable property of Classification and Regression Trees (CART) classifiers [3] and their lack of accuracy by aggregating them. It combines two sources of randomness that improve the prediction accuracy: bagging (bootstrap aggregating) and random feature selection to construct each CART. This results in a low correlation of the individual trees as well as low bias and low variance. The individual trees T_{k} are constructed as follows:
 N bootstrap samples (B_{1},..., B_{N}) are drawn from the original data.
 Each sample B_{k} (k = 1,..., N) is used as a training set to construct an unpruned tree T_{k}. Let p be the input variables of the tree. For each node of T_{k}, m variables are randomly selected (m<<p) to determine the decision at the node, where m is constant during the forest growing. Then the best split among these m predictors is chosen to split the node.

While constructing each tree T_{k}, about onethird of the cases are left out of the bootstrap sample and are not used in its construction. These data are called "Outofbag" or OOB data and are used as an "internal" test set for each tree that is grown.

The OOB predictions are then aggregated and the error rate, called "OOB error estimate" is computed for the whole forest and should lead to an accurate and unbiased generalisation error [2].
The Mean Decrease Accuracy measure was used in this study as a feature selection criterion, where the OOB data are used to obtain estimates of variable importance by evaluating their contribution to the prediction accuracy. The values of each variable in the OOB cases are randomly permuted and are run along the tree. The proportion of cases in the correct classes with permuted OOB data is then subtracted from the proportion of cases in the correct classes where OOB data have not been permuted. The Mean Decrease Accuracy averages the difference between these two accuracies over all trees in the forest and normalizes it by the standard error.
PLS and Sparse PLS
The loading vectors are the p and q dimensional vectors u_{h} and v_{h} for each PLS dimension h and are respectively associated to the X and Y data sets. The associated latent variables are defined as ξ_{h} = Xu_{h} and ω_{h} = Yv_{h}. As in PCA, the loading vectors u_{h} and v_{h} are directly interpretable, as they indicate the importance of the variables from both data sets in relation with each other. The latent variables ξ_{h} and ω_{ h }, that are ndimensional vectors contain the information regarding the similarities or dissimilarities between the individuals or samples [5]. PLS is an iterative method that is suitable for high dimensional data sets and has a valuable stability property. However, this interesting approach does not allow feature selection, which renders the results difficult to interpret in the n<<p problem. Lê Cao et al. [6] proposed a sparse version of the PLS, that combines variable selection and modelling in a onestep procedure for such problems. The sparse PLS (noted sPLS) is based on Lasso regression [7] that penalizes the loading vectors using Singular Value Decomposition to solve the PLS [8].
Two criteria were used to select the dimension size H and the number of predictive genes to select on each dimension of sPLS: the Root Mean Squared Error Prediction (RMSEP), and the Qh^{2} that measures the marginal contribution of each latent variable to the predictive power of the PLS model. Briefly, 1Qh^{2} is the ratio of the average PRediction Error Sum of Squares to the average of the Residual Sum of Squares, over the variables (refer to [9] for more details).
The methods used to analyse this data set are implemented in the R software (RandomForest R package [10] , the R package "integrOmics" for sPLS is available at www.math.univtoulouse.fr/biostat).
Results
Random Forest
Random Forest (RF) does not require finetuning of its parameters. In this study, however, random forest classifications with 10000 trees were performed in order to obtain stable results. When applied to the whole data set (3686 genes), RF gave a reasonably high prediction power, with an OOB estimate of error rate equal to 12.5%. After a preselection of differentially expressed genes (662 genes with a FDR < 20%), the prediction was perfect (with 0% of the OOB estimate of error rate).
PLS and sPLS
Recall that Y is the phenotype matrix of the 4 indicators (LWa, LWc, MSa, MSc) for each animal. The number of dimensions H to be retained was estimated with the Qh^{2} criterion, for which a value below the threshold 0.0975 indicates a significant contribution for the prediction purpose [4, 9]. The Qh^{2} values calculated for each dimension of the PLS and the sPLS showed that 2 dimensions were enough to capture the whole information for both PLS or sPLS. An equivalent coding for Y is the 2 column matrix of genotype and treatment factors that will be considered in the following.
The number of dimensions being fixed to 2, the optimal number of genes selected on each dimension (equal number of genes on both dimensions for the sake of simplicity), was determined with the RMSEP for both sPLS and PLS. The optimal result, i.e. the lowest RMSEP obtained was 10 genes on each dimension. In this case, the sPLS gave better predictions than PLS (not shown).
Figure 4 clearly illustrates the superiority of sPLS on PLS in terms of interpretability, as the PLS does not allow for variable selection.
The list of significantly expressed genes (t test) did not exactly match with the list of sPLS predictive genes. This shows that the information captured by the 2 approaches may bring complementary as well as relevant results.
Discussion
Due to the clear structuring of the data, it is difficult to compare the performances of the statistical prediction approaches. A thorough biological interpretation of the results is now needed to validate the use of these methods.
In the case where q > 1 in the Y matrix, few other approaches have been developed for variable selection and integration of twoblock data sets based on elastic net procedure [13] or shrinkage methods [14]. However, most of them focus on a canonical analysis, i.e. a symmetric relationship between the data sets, which is not the case in this study. The reader can refer to [15] (Canonical Correlation Analysis with Elastic Net) or [16] (Coinertia analysis from [17]) for biological data sets.
In the case where q = 1, as performed with RF when we combined the phenotypes in one class vector, we find ourselves in a typical multiclass problem. Several approaches have been developed for feature selection, among them the reader can refer to Recursive Feature Elimination [18], NearestShrunken Centroid [19] or Optimal Feature Weighting [20], that can deal with more than 2 classes.
Conclusion
The differential analysis in [1], and the 2 predictive approaches presented here gave coherent, similar but complementary insights. On this data set however, expression patterns were so different in the 4 classes that the conclusions of the comparisons between the above statistical tools are not to be generalised.
In microarray data, the statistical criteria are often limited by the small number of samples. Therefore, it is strongly recommended to combine statistical assessments with a sound biological interpretation of the data, as was shown for example in [21]. They showed the importance of the interpretation of the results and found interesting complementarities between predictive approaches in several data sets, in terms of biological processes. Therefore, we also recommend the use of various predictive statistical tools when searching for biomarkers.
Declarations
Acknowledgements
We thank the Eadgene network of excellence.
This article has been published as part of BMC Proceedings Volume 3 Supplement 4, 2009: EADGENE and SABRE Postanalyses Workshop. The full contents of the supplement are available online at http://www.biomedcentral.com/17536561/3?issue=S4.
Authors’ Affiliations
References
 Hazard D, Liaubet L, SanCristobal M, Mormede P: Gene array and real time PCR analysis of the adrenal sensitivity to adrenocorticotropic hormone in pig. BMC Genomics. 2008, 9: 10110.1186/147121649101.PubMed CentralView ArticlePubMedGoogle Scholar
 Breiman L: Random forests. Mach Learn. 2001, 45 (1): 532. 10.1023/A:1010933404324.View ArticleGoogle Scholar
 Breiman L, Friedman JH, Olshen RA, Stone CJ: Classification and Regression Trees. Belmont, CA. 1984Google Scholar
 Wold H: Multivariate Analysis. 1966, New York: WileyGoogle Scholar
 Wold H, Eriksson L, Trygg J, Kettaneh N: The PLS method – Partial Least Squares projection to latent structures – and its applications in industrial RDP (research, development, and production). 2004, Unea UniversityGoogle Scholar
 Lê Cao KA, Rossouw D, RobertGranie C, Besse P: A Sparse PLS for Variable Selection when Integrating Omics Data. Stat Appl Genet Mol Biol. 2008, 7 (1): Article 35PubMedGoogle Scholar
 Tibshirani R: Regression shrinkage and selection via the Lasso. J Roy Stat Soc B Met. 1996, 58 (1): 267288.Google Scholar
 Shen H, Huang JZ: Sparse principal component analysis via regularized low rank matrix approximation. Journal of Multivariate Analysis. 2008, 99 (6): 10151034. 10.1016/j.jmva.2007.06.007.View ArticleGoogle Scholar
 Tenenhaus M: La régression PLS: théorie et pratique. Technip. 1998Google Scholar
 Liaw A, Wiener M: Classification and Regression by randomForest. R News. 2002, 2: 1822.Google Scholar
 Bonnet A, Le Cao KA, SanCristobal M, Benne F, RobertGranie C, LawSo G, Fabre S, Besse P, De Billy E, Quesnel H, et al: In vivo gene expression in granulosa cells during pig terminal follicular development. Reproduction. 2008, 136 (2): 211224. 10.1530/REP070312.View ArticlePubMedGoogle Scholar
 Baccini A, Besse P, Déjean S, Martin PGP, RobertGranie C, San Cristobal M: Stratégies pour l'analyse de données transcriptomiques. Journal de la Société Française de Statistique. 2005, 146: 544.Google Scholar
 Zou H, Hastie T: Regularization and variable selection via the elastic net. J Roy Stat Soc B. 2005, 67: 301320. 10.1111/j.14679868.2005.00503.x.View ArticleGoogle Scholar
 Bondell HD, Reich BJ: Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR. Biometrics. 2008, 64 (1): 115123. 10.1111/j.15410420.2007.00843.x.PubMed CentralView ArticlePubMedGoogle Scholar
 Waaijenborg S, Hamer PCVDW, Zwinderman AH: Quantifying the association between gene expressions and DNAMarkers by penalized canonical correlation analysis. Stat Appl Genet Mol Biol. 2008, 7 (1): Article3PubMedGoogle Scholar
 Culhane AC, Perriere G, Higgins DG: Crossplatform comparison and visualisation of gene expression data using coinertia analysis. BMC Bioinformatics. 2003, 4: 5910.1186/14712105459.PubMed CentralView ArticlePubMedGoogle Scholar
 Doledec S, Chessel D: CoInertia Analysis – an Alternative Method for Studying Species Environment Relationships. Freshwater Biol. 1994, 31 (3): 277294. 10.1111/j.13652427.1994.tb01741.x.View ArticleGoogle Scholar
 Guyon I, Weston J, Barnhill S, Vapnik V: Support vector machine with recursive feature selection. Mach Learn. 2002, 46: 389422. 10.1023/A:1012487302797.View ArticleGoogle Scholar
 Tibshirani R, Hastie T, Narasimhan B, Chu G: Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci USA. 2002, 99 (10): 65676572. 10.1073/pnas.082099299.PubMed CentralView ArticlePubMedGoogle Scholar
 Lê Cao KA, Bonnet A, Gadat S: Multiclass classification and gene selection with a stochastic algorithm. Computational Statistics and Data Analysis. 2009, Google Scholar
 Lê Cao KA, Goncalves O, Besse P, Gadat S: Selection of biologically relevant genes with a wrapper stochastic algorithm. Stat Appl Genet Mol Biol. 2007, 6 (1): Article29PubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.