QTLMAS 2009: simulated dataset
© Coster et al; licensee BioMed Central Ltd. 2010
Published: 31 March 2010
Skip to main content
© Coster et al; licensee BioMed Central Ltd. 2010
Published: 31 March 2010
The simulation of the data for the QTLMAS 2009 Workshop is described. Objective was to simulate observations from a growth curve which was influenced by a number of QTL.
The data consisted of markers, phenotypes and pedigree. Genotypes of 453 markers, distributed over 5 chromosomes of 1 Morgan each, were simulated for 2,025 individuals. From those, 25 individuals were parents of the other 2,000 individuals. The 25 parents were genetically related. Phenotypes were simulated according to a logistic growth curve and were made available for 1,000 of the 2,000 offspring individuals. The logistic growth curve was specified by three parameters. Each parameter was influenced by six Quantitative Trait Loci (QTL), positioned at the five chromosomes. For each parameter, one QTL had a large effect and five QTL had small effects. Variance of large QTL was five times the variance of small QTL. Simulated data was made available at http://www.qtlmas2009.wur.nl/UK/Dataset/.
In this article, we describe the simulation and the resulting data for the QTLMAS (Quantitative Trait Loci mapping and Marker Assisted Selection) Workshop 2009. An objective of the Workshop was to compare methods for detection of Quantitative Trait Loci (QTL) and methods for calculating breeding values with markers distributed over the whole genome, MEBV. The data represented measurements of a time-dependent trait, influenced by QTL, which could represent body mass of growing animals or biomass accumulated during growth of a crop.
The data consisted of phenotypes, biallelic single nucleotide polymorphism (SNP) genotypes, and family information. Phenotypes were simulated and made available for a subset of the simulated individuals at five different time points along the growth trajectory. SNP were distributed over the whole simulated genome. Some SNP were in linkage disequilibrium (LD) with QTL, but QTL information was not provided in the dataset. SNP genotypes were made available for the phenotyped and for the non phenotyped individuals. Simulated individuals were genetically related due to the small number of parents used. Data was made available at http://www.qtlmas2009.wur.nl/UK/Dataset/. The simulated dataset is available to be used as benchmark for methods that attempt to model QTL or breeding values related to growth functions.
Simulated genomes consisted of five chromosomes of 1 Morgan each. At each Morgan, 10,000 loci were simulated. In the base population, allele frequency of 2,000 loci, equally distributed over the five chromosomes, was set at 0.5 and the remaining 48,000 loci were monomorphic in the base population. Fifty gametes were simulated for the base population according to these allele frequencies and these gametes were randomly combined into 25 genotypes.
Then, a thousand generations were simulated to create Linkage Disequilibrium (LD) between loci and to achieve a situation of mutation drift equilibrium. Two meioses were simulated for each genotype in a previous generation to maintain an effective population size of 49 . Genotypes for the next generation were formed by combining random pairs of these gametes, while avoiding selfing.
The total number of recombinations in each meiosis event was drawn from a Poisson(5) distribution and recombination positions were distributed over the whole genome without interference. Throughout the 1,000 generations, mutation rate for all loci was 1 · 10-5, regardless of whether a mutation had occurred previously at this locus or not. A mutation did always reverse the allele: mutation of a 1 allele produced a 0 allele and mutation of a 0 allele produced a 1 allele.
Eighteen QTL were assigned to loci with MAF above 0.10 in generation 1,001. Additive effects for each QTL were calculated according to the variance required for that QTL and the allele frequency of that locus . Variance of three QTL was set five times larger than variance of the remaining 15 QTL. Loci polymorphic in generation 1,001 to which no QTL was assigned became biallelic markers.
where Ø1 is the asymptotic yield, Ø2 is the inflection point of the curve and Ø3 is the slope of the curve. Figure 2 displays a growth curve and includes a graphical interpretation of the three parameters.
There were 453 polymorphic markers in the data, distributed over the genome. Average MAF of these polymorphic markers was 0.14 and average LD between flanking markers measured as r2 was 0.14. The fraction of QTL variance explained by the 10 markers in highest LD with each QTL was 0.83.
Objective of this simulation was to provide time-dependent phenotypes and marker data to the participants of the QTLMAS 2009 Workshop. Population size of 50 was used to fit with estimated population sizes of some common crop species. The density of polymorphic markers was much lower than currently assumed appropriate for genomic selection purposes [5, 6], and was aimed not to impede other types of analysis. Simulations were performed in R using package HaploSim. Integration of simulations in a computation environment as R facilitates evaluation of simulation results using graphical and statistical functions provided in the environment. Programming simulations in using package HaploSim is relatively straightforward and the programmer has a high level of control over simulation results. Gametes were the basic simulation unit in this simulation because of the design of HaploSim.
With the objective to achieve a mutation drift equilibrium in relatively few generations of random mating, allele frequency of 2,000 loci was set at 0.5 in the base population. The number of generations of random mating required before reaching mutation drift equilibrium would be importantly higher if all loci were monomorphic in the base generation. On the other hand, setting allele frequency of all loci at 0.5 was not practical because computation time of HaploSim increases with the number of polymorphic loci.
Mutation rate was 1 · 10-5 throughout the generations of random mating and this was justified as follows. In our simulations, loci correspond to base pairs. As an approximation, one centimorgan corresponds to 1 · 106 base pairs ; one locus in our simulations thus corresponds to 1 · 104 base pairs. Estimates for mutation rates in human are in the range 1 · 10-8 - 1 · 10-7[8, 9]. Mutation rate in our simulations was therefore a factor 10 to 100 lower than base pair mutation rate estimated in human, with the objective to maintain a relatively low number of heterozygous SNP loci (equation 7.2.4, page 323, ). Mutation rate was set to 0 after generation 1,001 to avoid a large number of SNP with MAF equal to 1/4,000 (4,000 haplotypes were simulated for generation 1,002).
Phenotype data were calculated according to a logistic growth curve. A correct identification of this growth curve is expected to be crucial for successful identification of QTL involved in the simulated phenotypes. Conclusions about QTL position and QTL effect can be expected to be dependent upon this identification method . This issue is further discussed in , in this issue.
Only additive QTL effects were simulated because methodology for QTL analyses with response curves is still limited (e.g. [10, 12–18]). Simulations could easily be extended to scenarios involving dominance or epistatic interactions, however the objective of this Workshop was to look at the impact of having a time dependent trait on QTL mapping and estimation of breeding values and we did not want to mix complications of variable nature.
The code used to simulate the data can be obtained from the authors. Package HaploSim can be downloaded from the repository of R packages CRAN, http://cran.r-project.org/package=HaploSim, following the usual method to install R packages.
This article has been published as part of BMC Proceedings Volume 4 Supplement 1, 2009: Proceedings of 13th European workshop on QTL mapping and marker assisted selection.
The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/4?issue=S1.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.