Look who is calling: a comparison of genotype calling algorithms

In genome-wide association studies, high-level statistical analyses rely on the validity of the called genotypes, and different genotype calling algorithms (GCAs) have been proposed. We compared the GCAs Bayesian robust linear modeling using Mahalanobis distance (BRLMM), Chiamo++, and JAPL using the autosomal single-nucleotide polymorphisms (SNPs) from the 500 k Affymetrix Array Set data of the Framingham Heart Study as provided for the Genetic Analysis Workshop 16, Problem 2, and prepared standard quality control (sQC) for each algorithm. Using JAPL, most individuals were retained for the analysis. The lowest number of SNPs that successfully passed sQC was observed for BRLMM and the highest for Chiamo++. All three GCAs fulfilled all sQC criteria for 79% of the SNPs but at least one GCA failed for 18% of the SNPs. Previously undetected errors in strand coding were identified by comparing genotype concordances between GCAs. Concordance dropped with the number of GCAs failing sQC. We conclude that JAPL and Chiamo++ are the GCAs of choice if the aim is to keep as many subjects and SNPs as possible, respectively.


Background
A crucial step in the data generation process of genomewide association studies is genotype calling. Here, qualitative genotypes are derived from measured signal intensities of the two alleles of a single-nucleotide polymorphism (SNP). Because missing or erroneous genotypes can flaw the high-level statistical association analysis, a series of different genotype-calling algorithms (GCAs) have been proposed [1].
The outcome of these GCAs can differ substantially [2]. We therefore compared different GCAs using the genotype data from participants of the Framingham Heart Study SNP Health Association Resource project. We investigated the influence of GCAs on autosomal SNPs that passed the filtering by errors in strand coding and standard quality control (sQC).

Methods
Hybridization probe intensity CEL data of 6,848 participants in the Framingham Heart Study was provided as Problem 2 for the Genetic Analysis Workshop 16 (GAW16) [3]. Genotyping was performed using the Affymetrix GeneChip® Human Mapping 500 k Array Set.

Open Access
We limited our analyses to the 2,466 participants of the 332 families with complete genotypes in the nuclear families.
Three different GCAs were considered for comparison. Bayesian robust linear modeling using Mahalanobis distance (BRLMM) has been recommended by the manufacturer for the 500 k Array Set [4]. Chiamo++ (Italian for "I call") uses a Bayesian hierarchical four-class mixture model [5]. JAPL (French for "I call") is based on an expectation-maximization (EM) clustering algorithm that was improved by Plagnol et al. [6]. Where probe intensities had to be normalized beforehand, CelQuantileNorm was used [7]. Normalization had to be split in two parts because of memory access errors when more than approximately 2,000 samples were used in one run. The data were split randomly in two batches of similar size. Chiamo++ and JAPL were run using default settings, BRLMM calls were used as provided for GAW16.
Only those SNPs provided in the GAW16 BRLMM data set were used for further analysis. Furthermore, Xchromosomal SNPs and SNPs with different strand codings in the GCAs were excluded.
For the remaining SNPs, deFinetti triangles presenting allele and genotype distributions were generated for all three GCAs (an example for a deFinetti triangle is given in Figure 1). In these, the two homozygote genotype frequencies for each SNP are read as length of the projections along the sides of the triangle, and the allele frequencies and the proportion of heterozygotes are given on the horizontal and vertical axes [8].
Samples with a call fraction <97% were excluded, and sQC was performed separately for all three GCAs. Specifically, SNPs were excluded if the exact lack-of-fit test for Hardy-Weinberg equilibrium (HWE) revealed p < 10 -4 , if the minor allele frequency (MAF) was <1%, or if the missing frequency (MiF) was <2%.
We defined seven different groups of SNPs after sQC according to Figure 2 and investigated the characteristics of SNPs in each group. A detailed analysis of all SNPs was computationally impossible because this would have required a comparison of >3 billion genotypes. We therefore drew a random sample of 10,000 SNPs from group p5 and a random sample of 1,000 SNPs for every other group.
We termed an individual to be concordant for the considered GCAs if the GCAs yielded the same result (genotype or missing) for the specific SNP. We then derived concordance fractions on the SNP level. Confidence intervals were estimated as 95% exact Blyth-Still-Casella confidence intervals (95% CI).
Analyses were performed in the statistical package R, version 2.7.1, with the GenABEL, version 1.4-1 library [9]. The analyses were carried out on an Intel Quad-Core Dual Xeon E5345 computer with a 2.33 GHz processor, 32 GB RAM, and a 64-bit SUSE Enterprise Linux operating system.

Results and discussion
A total of 486,605 SNPs were provided for GAW16. We excluded 63,950 SNPs because of different strand coding or allele flips, leaving 422,655 SNPs for further analysis. The deFinetti triangles ( Figure 3) give an overview of the genotype distribution of these SNPs. BRLMM and JAPL showed excess heterozygosity, i.e., more heterozygous subjects than expected under HWE, for a larger number of SNPs than Chiamo++ (BRLMM, 56.04%; Chiamo++, 53.66%; JAPL, 56.74%). Both algorithms revealed many SNPs with a high number of heterozygous subjects but low frequency for one of the homozygous genotypes.
In contrast, Chiamo++ more often led to genotype distributions at the boundary of the deFinetti triangle The deFinetti triangle. The deFinetti triangle shows genotype frequencies and allele frequencies for each SNP. For example, the purple point displays one SNP. The two homozygote genotype frequencies for a SNP are read as length of the projections along the sides of the triangle (shown in red and blue). The proportion of heterozygotes (green) is given either on the vertical axis or as difference between 1 and the homozygous frequencies. The allele frequencies are shown on the horizontal axis. The curve displays the genotype distributions that are exactly in HWE. Interestingly, Chiamo++ yielded SNPs with an extremely low heterozygosity (<2 × 10 -3 ) more often than BRLMM and JAPL (BRLMM, 1.84%; Chiamo++, 5.39%; JAPL, 1.95%). These SNPs fell in one of two groups: The first had a MAF lower than 20% but only 2.5% heterozygous subjects. For the second, one could imagine that they form the rudiment of a second curve with maximum at the point (0.5; 0.25). As in the first group of SNPs, this curve was only observed for SNPs with MAF<20%. Because this curve is usually seen for X-linked SNPs if males and females are pooled, we investigated the genotype frequencies for a series of these SNPs by sex but we detected no differences.
The call fraction was >0.97 for all individuals in JAPL. Five subjects were excluded using Chiamo++. BRLMM called these five participants and an additional four with a fraction of <0.97. This finding is in line with the conclusions drawn by the developers of JAPL, who state that their algorithm was specifically designed to deal with uncertain genotypes which are said to be missing by other GCAs [6].
Based on the subset of individuals who passed sQC for all three GCAs, the observations from the deFinetti triangles are confirmed by results of sQC (Table 1): for Chiamo++, twice as many SNPs failed the HWE criterion. Most SNPs failed due to MAF criterion in Chiamo++, but rates are comparable among all three algorithms. For  Interestingly, BRLMM removed almost twice as many SNPs as Chiamo++ through this criterion.
In total, the highest number of SNPs fulfilling all sQC criteria was obtained using Chiamo++ (77.36%) and the smallest number was obtained using BRLMM (74.05%). 351,207 SNPs (83.10%) passed the sQC in at least one algorithm. Of these SNPs, 78.55% fulfilled all sQC criteria for all three GCAs jointly ( Figure 2).
In summary, if the aim is to keep as many subjects as possible for analysis, which is of interest in genome-wide association studies with a small sample size or in familybased genome-wide association studies, JAPL would be the GCA of choice. Chiamo++ would be preferred if one aims at keeping a high number of SNPs for further analysis.
Results of the concordance estimation are summarized in In general, estimating concordance with one or more GCAs failing sQC led to considerably lower values. Specifically, we found dramatically low concordance fractions (minimum concordance fractions between 18% and 78%) for SNPs that did not pass sQC in all considered GCAs. This might be due to the fact of disagreement in calling genotypes as "missing".

Conclusion
Among the investigated GCAs, JAPL is recommended if the aim is to keep as many subjects as possible for analysis. Chiamo++ would be preferred if the number of SNPs for further analysis needs to be high. By comparing the concordances between different calling algorithms, otherwise-undetected errors in strand coding were identified. Considering SNPs that did not pass the sQC in at least one of the considered algorithms, the concordance frequency is considerably lower.