Efficient detection of QTL with large effects in a simulated pig-type pedigree using selective genotyping.

BACKGROUND
The ultimate goal of QTL studies is to find causative mutations, which requires additional expression studies. Given the limited amount of time and funds, the smart option is to identify the most important QTL with minimal effort. A cost-effective solution is to genotype only those animals with high or low phenotypic values or DNA-pools of these individuals. A two-stage genotyping strategy was applied on samples in the tails of the distribution of breeding values.


RESULTS
The tail-analysis approach identified eight out of the 19 QTL in the first stage, explaining about half of 98% of the genetic variance. Four additional QTL with small effects were found in the second stage.


CONCLUSION
The two-stage genotyping strategy with selective genotyping detected regions with highly significant QTL useful for further fine-mapping. The large reduction in costs allows for follow-up expression and functional studies.


Background
Discovery and subsequent validation of causative mutations affecting complex traits require identification and fine-mapping of QTL followed by expression and functional studies. Given the limited amount of time and funds, the challenge is to identify the most important QTL with minimal effort.
A cost-effective strategy is to reduce genotyping costs by only genotyping individuals with high and low phenotypic values, or to genotype pools of these individuals. Tail analysis, bulked segregant analysis and selective DNA pooling have been advocated by Hillel et al. [1], Michelmore et al. [2] and Darvasi and Soller [3]. More recently Korol et al. [4] improved on the latter method by studying fractioned DNA pooling. Disadvantages to genotyping tails or pools are the number of traits that can be studied with the selected genotypes, separate high/low tails or pools have to be made for each trait, and non-optimal use of haplotype information. Wang et al. [5] improved on statistical methods developed by Dekkers [6] for interpretation of results obtained by DNA pooling.
Commercial breeding pedigrees present a situation where phenotypes are abundant, across many generations. In such a situation, selective genotyping is an important step in setting up a cost effective QTL study. This study implements a two-stage strategy. First, genotypes on a large SNP panel are obtained for highly informative individuals, that is, individuals with extreme breeding values. High and low phenotype animals are selected within each siredam pairing in order to control for stratification.
The objective is to identify major segregating QTL in a simulated pig-type pedigree with minimal effort both in terms of genotyping and analysis.

Methods
In a four generation pedigree, 45 sires produced 100 offspring each. Each sire was mated to 10 dams with 10 progeny each. Sires and dams of the base generation were unknown. All 4665 animals were phenotyped for a quantitative trait (TRT). Six thousand equally distributed (0.1 cM) SNPs were available for genotyping, located on 6 chromosomes of 100 cM each. A full description of the dataset can be found at the website of XII th QTLmas workshop [7]. Stage 1 For each sire, the offspring with the highest and the lowest EBV within a set of full sibs (i.e. per dam) were included into the high tail (H-tail) and the low tail (L-tail) respectively. Since there were 10 dams per sire there were 10 animals in either tail for each sire. Only sires with progeny that have phenotype records were used.

Genotyping strategy
For each SNP and for each sire, the frequencies of the '1' and '2'-alleles in the high and low tail were determined and submitted to a χ 2 (1) test. SNPs with a Pearson statistic exceeding 10 (nominal p-value < 0.0016) were considered putative. A Pearson χ 2 value exceeding 10 required that the counts of the allele in either tail differed by at least 10. A difference of 10 alleles suggested linkage between a QTL and this SNP in the sire, assuming equal contributions of the dam's alleles to both tails. A Chi-square test was appropriate under the null hypothesis of no association and the assumption that both sires and dams were sampled randomly from the population with respect to their SNP genotypes.

Stage 2
When multiple segregating SNPs occur in a small region then this region was considered likely to contain a QTL. Genotypes of all putative SNPs were subsequently obtained for all animals with phenotype records and an association was determined by applying the following model: For fine-mapping LDLA-software was used [8]. was applied to identify markers with a significant effect. This was done per region, where the regions were those identified in stage 1. In LDLA a QTL was fitted at the midpoint of each bracket formed by each pair of adjacent SNPs. Phased adjacent markers defined a haplotype. The genotypic data was already phased but with 100 progeny per sire phasing should be straightforward. LDLA utilizes the same model as described above except that SNP was now a random haplotype effect instead of a fixed individual SNP effect. Both linkage and segregation information from sires and dams contributed to indicated the best location per region by using the covariance among founder haplotypes to account for linkage information and covariance among parent and offspring haplotypes to account for segregation information [9]. At each bracket midpoint the likelihood of the model was compared to a model with a polygenic effect only to determine the significance. Threshold values were corrected for multiple testing [10].

Stage 1
114 putative markers significantly (p < 0.0016) differed in frequency between the high and low tail in at least one sire family (Table 1). Five markers were significantly different between tails in 2 sire families, but all other putative markers were discovered from the difference between pools in only a single sire family. The putative markers were identified in tails from 21 sires of which 8 sires segregated only for one putative marker. In 24 sire families no SNPs were identified as putative. Most of the 114 putative markers occurred in groups of positions, indicating regions where QTL might be segregating.

Stage 2
The next step was to obtain genotypes for all phenotyped animals for the putative markers identified in stage 1, in order to distinguish between truly associated markers and false positives. Individual marker association with the trait was calculated using model 1 (i.e. a model including each marker in turn as well as a polygenic effect). Table 2 summarizes these results. Table 3 shows the results of forward stepwise regression. In each subsequent analysis four SNPs with the most significant associations (F-statistics obtained after correcting for the previous entered SNPs) were added. The polygenic variance decreased indicating that 12 markers accounted for close to 30% of the genetic variance. The results of the third round indicate that on each of chromosomes 1, 2 and 4 there were regions with QTL. The size of the QTL can be deduced from the effects of the genotypes in round three ( Subsequently LDLA was applied to these 114 markers and the profiles of the likelihood ratio test are shown in Figure  1 for chromosomes 1, 2, 3, and 4. Given these graphs and results from Table 3, two QTL are expected on chromosome 1, one QTL on chromosome 2, three or four QTL on chromosome 4 and none on chromosomes 3, 5, and 6. Two very obvious candidates for further study were the regions between SNP 403 and SNP 466 on chromosome 1 and between SNP 3007 and SNP 3091 on chromosome 4. Both regions had a maximum log likelihood ratio greater than 80. QTLs with smaller effects are expected on chromosome 1 (to the left of SNP 232), on chromosome 2 (between SNP 1326 and 1483) and on chromosome 4 (between SNP 3646 and 3766 and around 3965). The region on chromosome 3 around SNP 2185 did not show a peak in the LDLA-analysis. In this region sire 1 and 1034 were segregating (Table 1). Unlike the other regions, analysis on all sire families combined indicated that a QTL did not segregate in this region. Although the 2 sires segregated for 21 putative markers in a small region, the data did not support the presence of a QTL in this region. This is a clear example of a false positive putative QTL.

Discussion
The most critical part in selective genotyping strategies is to decide which animals should be included in high and low tails, as well as the number of tails that will be screened. In this data set there were marginal differences if the choice of animals was based on absolute value or on estimated breeding value. Under practical circumstances however the latter would be preferred. In this balanced data set the 10 best progeny (one per dam) were included in the high tail and the 10 worst (one per dam) were in the low tail. By choosing high/low within dam instead of across dams within sires, the chances of picking up false putative markers are reduced. Many more were found choosing across dams (data not shown). An illustration is the box-plot of estimated breeding values of progeny of sire 389 shown in Figure 2.
The data allowed for 45 high/low tails to be made because there were 45 sires with 100 progeny each. All tails were analyzed but only 21 sires showed segregation of at least one marker, nine of which were segregating for one marker only. A relevant question is whether the segregating sires could have been identified beforehand. It would decrease the work load for preparation and testing considerably. An analysis of higher moment statistics in the distribution of the phenotypes in the offspring might prove useful.
True positions of QTL were revealed after the workshop had taken place [7]. In Table 5 the estimated and true positions were compared. Eight of the 19 QTL (explaining 98% of the genetic variation) were found using our twostage selective genotyping approach. About 54% of the genetic variance associated with these 19 QTL was covered by these eight QTL. Four additional QTL with smaller effects were also identified: S1 at 296 cM, S3/S4 at 513 cM, S21 at 3033 cM and S22 at 3048 cM. Additive QTL effects were not very well estimated, which might explain that  Likelihood ratio profiles for chromosomes 1, 2, 3 and 4 with adjusted threshold  some of the QTL with a smaller effect were not identified. The QTL at the beginning of chromosome 3 (SNP 2185), which was considered to be a false positive because it did not reach the significance level in the LDLA analysis, was in fact a QTL (M8) with a small effect.
The 2 stage approach reduced the number of genotypes from 28 million in the whole data set to 5.4 million in stage 1 plus 0.43 million in stage 2; a reduction of almost 80%. If SNP-genotyping allows for sufficient accurate estimation of allele frequency in pooled DNA, then only 540.000 genotypes have to be determined in the first stage, reducing the genotyping effort with another order of magnitude. The number of individuals to put into a pool depends on the accuracy of determining the allele frequency, which in turn depends on the method applied.
With AFLP-markers the typical choice is to put 10 individuals in each pool [11].

Conclusion
The two-stage genotyping strategy with selective genotyping detected regions with highly significant QTL useful for further fine-mapping. Large reduction of genotyping efforts saves costs which could be used for subsequent expression and functional analyses.