A two-step approach combining the Gompertz growth model with genomic selection for longitudinal data.

BACKGROUND
We used the Gompertz growth curve to model a simulated longitudinal dataset provided by the QTLMAS2009 workshop and applied genomic evaluation to the derived model parameters and to a model-predicted trait value.


RESULTS
Prediction of phenotypic information from the Gompertz curve allowed us to obtain genomic breeding value estimates for a time point with no phenotypic records. Despite that the true model used to simulate the data was the logistic growth model, the Gompertz model provided a good fit of the data. Genomic breeding values calculated from predicted phenotypes were highly correlated with the breeding values obtained by directly using the respective observed phenotypes. The accuracies between the true and estimated breeding value at time 600 were above 0.93, even though t600 was outside the time range used when fitting the data. The analysis of the parameters of the Gompertz curve successfully discriminated regions with QTL affecting the asymptotic final value, but it was less successful in finding QTL affecting the other parameters of the logistic growth curve. In this study we estimated the proportion of SNPs affecting a given trait, in contrast with previously reported implementations of genomic selection in which this parameter was assumed to be known without error.


CONCLUSIONS
The two-step approach used to combine curve fitting and genomic selection on longitudinal data provided a simple way for combining these two complex tasks without any detrimental effect on breeding value estimation.

Results: Prediction of phenotypic information from the Gompertz curve allowed us to obtain genomic breeding value estimates for a time point with no phenotypic records. Despite that the true model used to simulate the data was the logistic growth model, the Gompertz model provided a good fit of the data. Genomic breeding values calculated from predicted phenotypes were highly correlated with the breeding values obtained by directly using the respective observed phenotypes. The accuracies between the true and estimated breeding value at time 600 were above 0.93, even though t600 was outside the time range used when fitting the data. The analysis of the parameters of the Gompertz curve successfully discriminated regions with QTL affecting the asymptotic final value, but it was less successful in finding QTL affecting the other parameters of the logistic growth curve. In this study we estimated the proportion of SNPs affecting a given trait, in contrast with previously reported implementations of genomic selection in which this parameter was assumed to be known without error.

Conclusions:
The two-step approach used to combine curve fitting and genomic selection on longitudinal data provided a simple way for combining these two complex tasks without any detrimental effect on breeding value estimation.

Background
A longitudinal trait is a composite of phenotypes recorded over time which have a complex genetic correlation structure. Different types of non-linear functions have been used to model a time-dependent trait and dissect its genetic components. For instance, the Gompertz model has been used for analysing the polygenic components [1] and growth QTL [2] for live weight in sheep. Genomic selection (GS) commonly refers to a new class of methods for genetic evaluation using very dense marker maps covering the entire genome [3]. The overall trend so far has been that GS increases the accuracy of the breeding values, especially for those individuals without phenotypic information.
The objective of this study was to estimate genomic breeding values for the trait at time 600 (t600), which resided outside the range of longitudinal yield data provided by the QTLMAS2009 workshop. We implemented a two-step procedure in which first the Gompertz function was fitted to the data for each individual and, then genomic selection was performed on the predicted phenotype at t600 and on the parameter estimates derived from the fitted Gompertz curve.

Data
The data provided by QTLMAS2009 is fully described in [4]. It consisted of 100 full-sib families, each with 20 offspring. Half of the offspring (training set) have both phenotype information of yield at 5 distinct time points (0, 132, 25, 397, 530) and genotype data on 453 SNP markers across 5 Morgans. The remaining offspring (candidate set) had only genotype information.

Procedure
To obtain genomic breeding values for t600, we used an approach composed by two independent steps: Firstly, a Gompertz growth curve was used to model the performance records across time, and to estimate the model descriptors (A, B, C) which best fit the phenotypes of each individual. Secondly, genomic evaluation was applied to obtain genomic estimated breeding values (GEBVs) for t600 using two different methods: I) estimating GEBVs for the model parameters (A, B, C; i.e. 3 GEBVs per individual) and using them to estimate the breeding value for t600 from the Gompertz function; II) predicting the phenotypes at t600 from evaluating the Gompertz function with the estimated parameters and later applying genomic selection on the predicted t600 phenotypes.

Growth model
The Gompertz equation is of the form: y(t) = Ae {-e[Be(C-t)/A]} , where y(t) is the yield at time t; A the final yield; B the maximum growth rate and C the age at maximum growth rate. The curve fitting was implemented using nonlinear regression in SAS [5]. The Gompertz function was fitted to each individual separately to estimate individual model parameters A, B, C. Subsequently, the fitted individual equations were used to predict the trait at t600 (or t600 GEBVs if using the parameter GEBVs).

Genomic evaluation
A Bayes B type of analysis was used as first described by Meuwissen et al. [3]. Under a Bayesian framework the model accounts for the fact that not all SNPs affect the trait in question. The model assumed in the method is: where y is the vector of phenotypes; b contains the fixed effects and X is its incidence matrix; α i is the allelic substitution effect for SNP i; g i is the vector of genotypes (1, 2 & 3 for genotypes 00, 10/01 and 11, respectively) for SNP i; and e the vector of residuals distributed N(0,  e 2 ). The allelic substitution effects a for each SNP are assumed to be from a mixture distribution with probability π of having an effect on the trait and with probability (1-π) of not affecting the trait at all. If the SNP is affecting the trait, its allelic substitution is The implementation of the model was done using Gibbs sampling. The parameters  e 2 ,  snp 2 and π were also calculated in the analysis using flat priors. So far, the implementations of Bayes B reported in the literature have not estimated π, but assumed it was known without error.
For each analysis, a MCMC chain was run and the first 10000 cycles were discarded as burn-in period. Following this, 10000 realisations were collected, each separated by 50 cycles between consecutive realisations. The posterior mean was used as the estimate for each parameter of interest.

Growth model parameters
The Gompertz model provided a good fit of the data (see additional files 1 and 2) with the curve fitted for each individual being statistically significant. To further test how well the Gompertz curve fitted the phenotypic data, phenotypic values were predicted at all 5 time points for which observed phenotypic data was available. The Pearson and Spearman correlations between the true and predicted phenotypic values at t530 were above 0.99, with similar high correlations obtained for the other 4 time points. These high correlations remained when comparing the GEBVs calculated for both the true and predicted phenotypes.

Estimation of GEBVs for the parameters of the Gompertz curve
Univariate analyses were performed to each of the three parameters of the Gompertz function. The correlations between the univariate GEBVs for the three parameters were high (correlations between GEBVs for A-B, A-C and B-C were 0.97, 0.71 and 0.59, respectively. The posterior means of π for A, B and C were 0.059, 0.082 and 0.219, respectively. The posterior probabilities for the SNPs having an effect on the parameters A, B and C, and their estimated allelic substitution effects are shown in Figures 1 and 2, respectively. The results suggest that parameters A and B are affected by the same SNPs, with some others affecting parameter C. This is consistent with the high correlation between GEBV for A and B

Estimation of GEBVs for the trait at a given time point
The GEBVs for t600 obtained by evaluating the Gompertz function with GEBVs for A, B and C (method I) were very similar to those calculated from method II which evaluated the predicted performance at t600 (see Figure 3). The correlation between both approaches for calculating GEBVs was 0.99, with GEBVs from method I having slightly larger variance. The GEBVs obtained from the genomic selection on the predicted trait at t600 show a very similar trend as found for parameters A and B, with the same SNPs of large effect found for A and B also affecting t600 (see additional file 3). The estimate of π for t600 was 0.048.
Comparison with the true model used to simulate the data The true model used to simulate the data was the logistic growth curve (see [4]) described as     (Table 1. For correlations within training and candidate sets see additional file 4). Moreover, the analysis of parameters A and B showed that the SNPs with the highest probability of having an effect on the trait were located at the positions where the six QTL affecting F 1 were simulated. Locations containing QTL for parameters F 2 and F 3 were less associated with QTL affecting the parameters of the Gompertz curve.
Despite that the Gompertz curve was not the true model, its use provided very accurate GEBVs for t600. The correlation between the true breeding values and GEBVs are presented in Table 1 and additional file 4. Both methods of estimating GEBVs yielded similar accuracy. The Pearson and Spearman correlations between true and estimated breeding value with methods I and II for all individuals in the pedigree were above 0.93.
In this study, the proportion of SNPs affecting the trait, π, was estimated in the analysis. This contrasts with previously reported implementations of Bayes B where π was assumed to be known without error. The π values were slightly overestimated, partly due to the low linkage disequilibrium between SNPs (average r 2 between consecutive SNP was 0.15) and also to the fact that the Gompertz function was not the true model. However, the success in estimating such an important parameter from the data itself, even when assuming a uniform prior, provides an improvement in genomic evaluation relative to assuming that π is known without error.

Conclusions
The two-step approach of growth model fitting and genomic selection on model parameters and on predicted phenotype appeared to be a simple and reliable strategy. Despite that the Gompertz curve was not the true model used to simulate the data, the correlations between true and estimated breeding values at t600 were very high (Pearson and Spearman correlations above 0.93). The approach of estimating GEBVs for phenotype at a time of interest using GEBVs of the three parameters and evaluating the Gompertz function could be beneficial when GEBVs are needed for different time points. In this study, the proportion of SNP affecting Figure 3 Scatter plot of t600 GEBVs calculated from evaluating the Gompertz function using GEBVs of A, B, C (x-axis) and calculated from Genomic selection on the predicted phenotype at t600 (y-axis). The GEBV not scaled on the same mean. the trait was estimated from the data, contrasting with previous implementation of genomic selection where this proportion has been assumed to be known without error. The results from this study showed that separate implementation of the growth modelling process and genomic evaluation provided huge simplification of the methodology with no detrimental effect on the final results.
Additional file 1: Estimated means and standard errors for Gompertz model parameters and predicted weight at point 600 [ http://www.biomedcentral.com/content/supplementary/1753-6561-4-S1-S4-S1.pdf ] Additional file 2: Non-linear distribution of yield of individuals across time (dots) and the average growth curve obtained after fitting the Gompertz model to the yield data (solid line).