### Data

The simulated data set contained 3226 individuals spanning five generations out of which 2326, constituting the first four generations, were phenotyped and genotyped for 10031 biallelic SNPs arrayed on a genome encompassing five chromosomes. The remaining 900 individuals, representing the fifth generation, had genomic but lacked phenotypic records on the single quantitative trait. The covariate for each genotype with alleles *A*_{1} and *A*_{2} was set to 1 for *A*_{1}*A*_{1}, -1 for *A*_{2}*A*_{2} and 0 for *A*_{1}*A*_{2} or*A*_{2}*A*_{1}.

### Random forests

RF regression uses an ensemble of unpruned decision trees, each grown using a bootstrap sample of the training data, and randomly selected subsets of predictor variables as candidates for splitting tree nodes. The RF regression prediction for a new observation

*x* (

) is made by averaging the output of the ensemble of

*B* trees

as [

5]:

where Ψ_{
b
} characterizes the *b*th RF tree in terms of split variables, cutpoints at each node, and terminal node values.

We implemented RF in the R package *randomForest* with decision trees as base learners [3]. Following various recommendations [2, 3], we evaluated different combinations of the values of the number of trees to grow, *ntree* = {500, 1000, 2000}, the number of SNPs randomly selected at each tree node, *mtry* = {0.5, 1, 2} × the default value of *mtry* of sample size/3 for regression, and the minimum size of terminal nodes of trees, below which no split is attempted, *nodesize* = 1. The parameter configuration with the highest prediction accuracy was *ntree* =1000, *mtry* = 3000 and *nodesize* =1. We ranked SNPs by the relative importance of their contributions to predictive accuracy, quantified by how much prediction error increased when the observations left out of the bootstrap samples, the out-of-bag data for a SNP, were randomly permuted while data for all the other SNPs were left unchanged [2, 3].

### Stochastic Gradient Boosting

Boosting is an ensemble learning method for improving the predictive performance of classification or regression procedures, such as decision trees [

5]. Gradient-boosted models can also handle interactions, automatically select variables, are robust to outliers, missing data and numerous correlated and irrelevant variables and can construct variable importance in exactly the same way as RF [

5]. Boosting iteratively adds basis functions in a greedy fashion such that each additional basis function further reduces the selected loss (error) function [

5,

9]:

where *β*_{
m
}, *m* =1,2,…, *M* are the basis expansion coefficients, and *b*(*x*, *γ*) are simple functions of the multivariate argument *x*, with a set of parameters *γ*=(*γ*_{1},*γ*_{2},…,*γ*_{
M
}).

We used regression trees as basis functions. Boosting regression trees involves generating a sequence of trees, each grown on the residuals of the previous tree [5, 9]. Prediction is accomplished by weighting the ensemble outputs of all the regression trees. We used stochastic gradient boosting, assuming the Gaussian distribution for minimizing squared-error loss in the R package *gbm*[9]. We determined the main tuning parameter, the optimal number of iterations (or trees), using an out-of-bag estimate of the improvement in predictive performance. This evaluates the reduction in deviance based on observations not used in selecting the next regression tree. The minimum number of observations in the trees' terminal nodes was set to 1, the shrinkage factor applied to each tree in the expansion to 0.001 and the fraction of the training set observations randomly selected to propose the next tree in the expansion to 0.5. With these settings boosting regression trees with at most 8-way interactions between SNPs required 3656 iterations for the training dataset based on inspecting graphical plots of the out-of-bag change in squared error loss against the number of iterations [9].

### Support Vector Machines (SVMs)

SVMs perform robustified regression for quantitative responses by exploiting the relationships between observations by arraying predictors in observation space using a set of inner products. For regression with a quantitative response, SVM uses the model

where the basis functions,

*h*(

*x*)

^{
T
}, which can be linear (or nonlinear) transformations of one (or more) predictors (

*x*), are additively combined with the vector of weights (

*β*). We used the “

*ε*-insensitive” SVM regression that uses only residuals smaller in absolute value than some constant (

*ε*) and a linear loss function for larger residuals. This is a robustified regression for which the minimization exercise can be written in regularized sum of squares form [

5,

6] as:

is an “

*ε*-insensitive” error measure, ignoring errors less than

*ε*,

*λ* is a positive constant that controls the trade-off between the approximation error and the amount up to which deviations larger than

*ε* are tolerated to get solutions for the SVM regression problem,

*y* is a quantitative response and

denotes the norm under a Hilbert space. The SVM optimization procedure produces solution functions of the form [

5,

6]:

where
are positive weights given to each observation and estimated from the data and the inner product kernel *K*(*x*_{
i
},*x*_{
j
}) is a *N* × *N* symmetric and positive definite matrix [5]. Typically only a subset of
are nonzero, and the associated observations are called support vectors, hence the name support vector machines. Since the solution depends on the input values only through the inner products *K*(*x*_{
i
},*x*_{
j
}), a flexible fitting is achieved by transforming the cross-products using the kernel function (*K*(*x*_{
i
},*x*_{
j
})) that alters how two observations are related to each other.

We used the *ε*-insensitive SVM regression with a linear kernel to predict GEBVs in the R package *e1071*[8] with an insensitivity zone of *ε* = 10 and a regularization (cost) parameter (*λ* > 0) of *λ* = 0.001 determined by grid search.

### Assessing prediction performance

We used 5-fold cross-validation and the Pearson correlation between the simulated values and predicted GEBVs from the validation set and between the predicted and true breeding values (TBVs) for the non-phenotyped individuals constituting the fifth generation to quantify the predictive accuracy of each method. The training and validation sets respectively contained 60 and 15 crosses and encompassed all phenotyped individuals except the 20 founders.