Genome wide association analysis of the 16th QTL- MAS Workshop dataset using the Random Forest machine learning approach

Background Genome wide association studies are now widely used in the livestock sector to estimate the association among single nucleotide polymorphisms (SNPs) distributed across the whole genome and one or more trait. As computational power increases, the use of machine learning techniques to analyze large genome wide datasets becomes possible. Methods The objective of this study was to identify SNPs associated with the three traits simulated in the 16th MAS-QTL workshop dataset using the Random Forest (RF) approach. The approach was applied to single and multiple trait estimated breeding values, and on yield deviations and to compare them with the results of the GRAMMAR-CG method. Results The two QTL mapping methods used, GRAMMAR-CG and RF, were successful in identifying the main QTLs for trait 1 on chromosomes 1 and 4, for trait 2 on chromosomes 1, 4 and 5 and for trait 3 on chromosomes 1, 2 and 3. Conclusions The results of the RF approach were confirmed by the GRAMMAR-CG method and validated by the effective QTL position, even if their approach to unravel cryptic genetic structure is different. Furthermore, both methods showed complementary findings. However, when the variance explained by the QTL is low, they both failed to detect significant associations.


Background
Genome wide association studies (GWAs) are now widely used in the livestock sector to estimate the association among multiple single nucleotide polymorphisms (SNPs) distributed across the whole genome and one or more trait. GWAs are typically carried out on a singlepoint by performing a marginal chi-square test or regression. However, these methods do not take into account linkage disequilibrium between markers and the genetic structure of the population that may have a large impact on structured populations (e.g. cattle populations). Approaches for genome wide pedigree-based quantitative trait loci (QTL) analysis have been developed (e.g. GRAMMAR-CG), which are based on mixed model and regression, where the genomic kinship matrix estimated through genomic marker data can be used to correct for familiar correlation and cryptic relatedness [1].
As computational power increases, the use of more advanced machine learning techniques to analyze large genome wide datasets becomes possible [2], these techniques include Support Vector Machines [3], Bayesian Networks [4] and Random Forest [5].
The Random Forests (RF) algorithm [6] is a machinelearning method that has been widely applied to classification and regression problems, and is particularly well suited to circumstances in which the number of potential explanatory variables exceeds the number of observations, as is the case for GWAs. The RF algorithm produces a collection of trees (forest), each grown on a different bootstrap sample of observations, and at each split (node) of a tree, a different random subset of predictors (SNP) is evaluated to identify the best split. The final scores are then calculated by aggregating predictions resulting from all the trees grown in the forest.
RF embraces a combination of characteristics that makes it appropriate for genetic applications: it is well suited for very large datasets; it is non-parametric, thus does not require a causal model to be specified, it is highly parallelizable and considers interactions between predictors.
The objective of this study was to identify SNPs associated to the three traits simulated in the 16th MAS-QTL workshop dataset using the Random Forest approach and to compare them with the results obtained by the Grammar-CG method. SNPs identified by both methods were verified with the actual QTL positions.

Dataset
The dataset used was provided by the organisers of the 16th QTLMAS workshop and consisted of 4080 individuals (G0 to G4). The simulated genome was 499.750 Mb consisting of 5 chromosomes carrying 2,000 equally distributed SNPs. The GWA analysis was conducted on 3000 samples, all females belonging to generations G1 to G3, for which phenotypic information for three traits (yield deviations) was provided. The analysis was performed on: yield deviations (YD1, YD2 and YD3), the estimated breeding values (EBV) obtained from a single trait model (tr1_ST, tr2_ST, tr3_ST) and the EBVs obtained from a multiple trait model (tr1_MT, tr2_MT, tr3_MT).

Variance components and EBV estimation
Variance components and EBVs were obtained separately, using REMLF90 and BLUPF90 programs, respectively [7]. The model used to estimate variance components and EBVs was: where μ is a general mean for the k th trait, GEN is a fixed effect for i generations (i = 1 to 3), Animal is a random animal effect with distribution~N(0,σ 2 a ), where σ 2 a is the additive genetic variance, and e is the random residual with distribution~N(0, σ 2 e ), where σ 2 e is the residual variance. Covariance between traits was considered only in multiple-trait analysis.

Random Forest
Feature selection (SNPs) analysis was performed with the randomForest package in R [8] using 3000 individuals and the 9042 SNPs that passed quality control checks out of the total 10000 SNP. The minimum size of the terminal nodes was set to 5. The number of trees grown was set to 1000. The subset of samples evaluated at each tree was 70% of the total number of samples (n = 2100). The number of variables evaluated at each node was set to the square root of the number of predictors (p = 94). All SNPs were ordered by Mean decrease Gini index [6] and the most strongly associated SNPs are at the top of the lists shown in Table 2, 3 and 4.

Grammar-CG
Genome-wide association analysis was performed with the GenABEL package in R using a three step GRAM-MAR-CG (Genome wide Association using Mixed Model and Regression -Genomic Control) approach [1,9].

Variance components and EBV estimation
Mean and standard deviations of the nine phenotypes used are shown in Table 1. The heritability estimates resulting from the single trait model were 0.38, 0.38 and 0.50 for trait 1, 2 and 3, respectively. Large genetic correlations between traits 1 and 2 were observed (0.83), whereas lower genetic correlation was observed for trait 2 and 3 (0.12). Negative correlation was observed between traits 1 and 3 (-0.44).

Association mapping
The two QTL mapping methods used, GRAMMAR-CG and RF, were successful in identified the largest QTLs  Tables 2, 3 and 4. Both methods showed good precision in the identification of the QTL in comparison with the "real" QTL     position provided by [10]. Interestingly the exact markers flankingthe QTL were identified for all traits. Differences however were observed depending on i) the phenotype analysed, YD, single trait EBV and Multiple trait EBV and on ii) the method used RF or GRAMMAR-CG.
With regard to Trait 1, the GRAMMAR-CG method identified 8 significant associations for multiple trait EBV, 6 for single trait EBVs and 10 for YD, only 4 of which are common between the three phenotypes. The RF approach identified the same number of markers per phenotype, but only 2 markers were in common between the methods of analysis and phenotype. The two markers identified by both approaches were the QTL which explained the largest variance, however, the other markers are all true associations and indicate that using different types of phenotypes for the same trait and different analysis methods may overlap, but may also show some differences in QTLs and positions. Traits 2 and 3 share the same pattern as observed for trait 1. Several QTL were identified in common between phenotypes and methods but just a few were in common between analysis methods: 2 markers for trait 2 and 3 markers for trait 3. When the YD phenotype was used, a larger number of significant SNPs were detected. This may be due to the larger variability of the YD compared to the more regressed EBV phenotypes (Table 1).
Interestingly both methods failed to identify the QTLs on chromosomes 4 and 5 for Trait 3. The variance explained by the markers is low, suggesting that both methods are not able to detect QTLs which explain a small amount of variance. The RF approach, however, detect the QTL on chromosomes 5 and 3 for Trait 1.
Overall the results of the RF were confirmed by the results of the GRAMMAR-CG method and were validated by the effective positions given the QTL. Interestingly, even though the RF approach does not directly use family structure information through a relationship matrix (genomic or additive), as is the case in the GRAMMAR-CG approach, correct identification of QTL positions is achieved.

Conclusions
In this study we proposed the use of recursive partitioning approaches such as Random Forest, as an alternative to traditional regression methods to detect the genetic loci. The results of the RF approach were consistent with those of the GRAMMAR-CG method and validated by the effective positions given for the QTL. However, when the variance explained by the QTL was low, both failed to detect a significant association.