A selective genotyping approach identifies QTL in a simulated population

Background Identification of QTLs for important phenotypic traits, through the use of medium-density genome-wide SNP panels, is one of the most challenging areas in animal genetics, for preventing the time-consuming direct sequencing of putative candidate genes, when searching for the mutations that affect the trait. Appropriate statistical analyses allow the identification of genomic regions associated with the investigated trait in the genotyped population. Methods The selective genotyping technique was applied to 1000 genotyped animals with known phenotype. Sliding windows composed of five consecutive SNPs were created for each chromosome; we assumed that the QTLs were encoded by the windows showing the highest difference in the frequency of the same alleles between the most divergent productive groups (the two tails of the distribution). Results Ten windows affected at least one trait. For five of these windows, the highest and significant effect was given by one only SNP, which could therefore be taken as the QTL itself. Conclusions In this study we proposed a simple method to identify genomic regions associated to the phenotype under study. The identification of the DNA region is the first step to search for the mutation which is really responsible for the trait variability, through the direct sequencing of the genome regions that encode the QTL.


Background
The recent availability of genome-wide SNP panels, which offered the opportunity to evaluate the variation in SNP allele frequencies between populations, allowed the successful finding of genomic regions subject to positive selection in human and cattle [1][2][3][4][5]. For the identification of selection sweeps for milk traits, efficient application of the selective genotyping strategy for QTL mapping has been reported in dairy cattle [6], swine [7] and sheep [8]. In these cases, the extreme divergent individuals for a trait (the two tails of the distribution) are chosen and genotyped. Boligon et al. [9] compared selective genotyping strategies for prediction of breeding values in a population undergoing selection, and concluded that animals with extreme yield deviation values in a reference population are the most informative when training genomic selection models. Using the selective genotyping approach, Moioli et al. [8] identified two novel non-synonymous mutations associated with milk yield in sheep, and demonstrated their effect also in independent populations.
In the present study, we hypothesized that selection sweeps, detected in a simulated population, were useful to map QTLs for the trait under selection in the whole population.

Dataset
Three milk production traits were simulated in a population of 3,000 females, included in a data set of 4,100 individuals of 4 different generations (G0 to G4) having known pedigree. Females and parental genotypes at 10,000 SNPs equally distributed on 5 chromosomes were available. A detailed description of the population is reported by Usai et al. [10].

Statistical analysis
The selective genotyping technique was simulated on the females of generation 3 (1000 females), assuming that they were those who had better profited of the selection. Their production was reported on table 1. Allele frequencies at each SNP of each chromosome were calculated separately for the group the production of which was <-1 st dev for each trait, and the group the production of which was >1 st dev for each trait. The number of the animals of each group was also reported in table 1. The QTLs so hypothesized might be affected by the number of individuals included in the production tails, this depending on the additive-relationship between them, which might not represent the average relationship of the whole population. Habier et al. [11], in the context of predicting genomic breeding values (GEBV), advised that additive-genetic relationships between the training individuals and a selection candidate, captured by SNPs, affects the GEBV accuracy of that candidate. Therefore, in the present study, coefficient of relationship between the individuals of each tail portion, as well as the whole population were calculated as in Wright [12] using Proc Inbreeding in SAS [13].
The QTL effect was subsequently estimated with the use of sliding windows, composed of five consecutive SNPs and calculated for each of the five chromosomes. The number of markers in each window was established based on the consideration that the SNP density of the simulated population of the present study was similar to the average SNP density of the cattle panel used by Stella et al. [2]. These authors suggested that sliding windows of 5, 9, and 19 SNPs respectively give similar results when searching for selective sweeps in cattle.
For each window, the sum of the differences (in absolute value) of the allele frequencies, at each SNP, between the two productive groups, was calculated; the sliding windows were then ranked, according to this parameter, within each chromosome. We arbitrarily hypothesized that the potential QTL, for the considered trait, was located in the top ranking window. Because the selective genotyping was performed separately for the three traits, the potential QTLs could be located in different windows; for this reason, more than one window in the same chromosome were considered in the subsequent analyses.
The top ranking sliding windows, encoding the hypothesized QTL, as well as the potentially affected traits, are reported in table 2.
Estimation of the QTL effect for the whole window of 5 SNP The QTL effect was calculated on the whole recorded population as follows.
For each sliding window, the most probable haplotype alleles were calculated using the EM algorithm [14], through Proc Haplotype in SAS [13], and were assigned to each phenotyped individual (n = 3000).
For each haplotype allele showing allele frequency ≥ .07 in the recorded population, the allelic substitution effect was estimated as a covariate on each trait, as in Sherman et al. [15], with the following model: y = b(haplotype allele) + e Where y = trait1, trait2 and trait3 Alleles were coded as follows: 2 copies of the same allele = 2; one copy = 1; no copy = 0.
To account for multiple testing, the corrected probability of the effect was estimated using the False Discovery Rate test with Proc Multtest in SAS [13].  Under the hypothesis that one SNP of each haplotype was expected to have a major effect on the recorded trait, direct observation of those haplotype alleles that showed a highly significant effect (P < .00001) on one trait allowed to select one SNP where the two alleles showed opposite effects on that trait. For each of those SNPs, the substitution allelic effect was estimated as a covariate on each trait, similarly and with the same model as for the estimation of the allele haplotype effect.

Results
Because the selective genotyping strategy was performed separately for the three traits, the statistically significant windows varied depending on the considered trait ( Table 2). The average additive relationship values of each of the selected tails, for each trait, were very similar to each other's (Table 3), ranging from 4.26 to 4.37 %; but they were higher than the corresponding value calculated for the whole population (3.01%). For all tested haplotypes, the corrected probabilities, after consideration of the FDR, of the allelic substitution effects, were reported in table 4.
Through direct observation of those haplotype alleles that showed a significant effect on one trait, it was possible to make evident which SNP, within the haplotype allele, might have been directly responsible of the trait variability. In Table 5 only the SNPs that presented a highly significant (P < .0001) allelic substitution effect were reported. These SNPs, located on chromosomes 1, 3 and 4 might be themselves considered the QTLs influencing the relevant trait.

Discussion
In this study, two assumptions were arbitrarily made. The first was that the selective genotyping strategy was successful for QTL mapping. Although the literature reported evidence of the suitability of this strategy [9], the decision to what animals should be considered as highly divergent for each trait was a choice of the authors. Therefore, the results obtained, both in numbers and in the position of the QTLs, might have been different if more or less restrictive parameters had been chosen. The additive relationship values of each of the selected tails, for each trait, were very similar to each other's, ranging from 4.26 to 4.37 %; but they were higher than the corresponding value calculated for the whole population (3.01%). To appraise the extent of the difference in the average relationship between the tails and the whole population, it is useful to cite Vahlsten et al. [16] who reported that an increase by 0.96 % units of relationship, per generation, is to be considered slow, this value referring to Friesian bulls, born during 40 years, and belonging to a population of over 400,000   animals. It can therefore be inferred that the relationship differences observed in the present study reproduce the mere generational trend.
The second assumption was that the QTL was encoded by an haplotype of 5 consecutive SNPs. Weller and Ron [17] underlined how important is the extent of LD in the application of genome scans to breeding programs. These authors noted that population-wide linkage LD extends, in dairy cattle, over less than 1 cM, i.e. a much shorter extent than the genetic linkage within families, that extends over tens of centimorgans. It is therefore possible that the hypothesis that the QTL was encoded by the haplotype with the highest effect on each trait was not the most appropriate for this study, the analyzed population consisting in a simulated sample. However, because the sliding windows encompass consecutive markers, the choice to select the top ranking window for each trait seemed appropriate, because it allowed the identification of single SNPs (Table 5) having a very high significant effect on one trait, the probability for some of them being < .1.0E-16.

Conclusions
In this study we proposed a simple method to identify genomic regions associated to the phenotype under study, regions that could therefore be taken into account as the potential QTLs. The identification of the DNA region is the first step to identify the mutation which is really responsible for the variability of the trait, through the direct sequencing of the genomic regions that encode the QTL. The precision of the QTL estimation can vary depending on the deviations values established in the reference population to define which animals are extremely divergent.