Evaluating the concordance between sequencing, imputation and microarray genotype calls in the GAW18 data

Genotype errors are well known to increase type I errors and/or decrease power in related tests of genotype-phenotype association, depending on whether the genotype error mechanism is associated with the phenotype. These relationships hold for both single and multimarker tests of genotype-phenotype association. To assess the potential for genotype errors in Genetic Analysis Workshop 18 (GAW18) data, where no gold standard genotype calls are available, we explored concordance rates between sequencing, imputation, and microarray genotype calls. Our analysis shows that missing data rates for sequenced individuals are high and that there is a modest amount of called genotype discordance between the 2 platforms, with discordance most common for lower minor allele frequency (MAF) single-nucleotide polymorphisms (SNPs). Some evidence for discordance rates that were different between phenotypes was observed, and we identified a number of cases where different technologies identified different bases at the variant site. Type I errors and power loss is possible as a result of missing genotypes and errors in called genotypes in downstream analysis of GAW18 data.


Background
Over the past decade, a large body of literature has been amassed related to genotype errors for SNP microarrays. We now have a clear understanding of the prevalence of such errors and of many potential sources of the errors, as well as an understanding of the downstream implications of genotype errors on the type I error rate and power of related single SNP tests of genotype-phenotype association [1]. In particular, nondifferential genotyping errors, that is, errors that are the result of a random process unrelated to the phenotype, decrease power [2][3][4]. However, differential genotyping errors, errors that occur according to different random processes according to the value of the phenotype, may inflate the type I error rate [5,6]. Additional work has confirmed that similar results hold for analysis of imputed genotypes using standard single-marker tests of genotypephenotype association [7].
With the advent of next-generation sequencing (NGS), multimarker analysis methods have increased in popularity. Recent papers demonstrate similar results (i.e., decreased power and increased type I error for nondifferential and differential genotyping errors) are true for multimarker tests as well. In particular, for collapsing tests [e.g., [8][9][10]], the effects of both differential and nondifferential genotyping errors can be exacerbated by the cumulative nature of genotyping errors across a set of markers [11,12]. The relationship for particular collapsing tests is anticipated to hold for the larger set of all collapsing (burden) and variance components tests based on structural similarities in these classes of tests [13]. To date, large error rates have been observed for sequence data [14][15][16], much larger than were typical in the early days of SNP microarrays [17]. Thus, there is the potential for substantial power loss and inflated type I error for multimarker tests involving NGS data.
For the typical researcher, it is often costly and impractical to invest in large-scale quality control studies to obtain study-specific estimates of genotype reliability. However, as was seen in the GAW18 data, it is reasonable to think that as more and more studies sequence existing samples, a typical quality control approach may involve evaluating the concordance between genotypes obtained on the samples using SNP microarrays with genotypes obtained using the new NGS technology. We conducted our analysis using sequencing data (measured with NGS technology or through imputation) and SNP microarray data. After evaluating the overall concordance levels between genotype calls, we evaluated which types of discordance are most common and the potential for concordance rates, which are related to the phenotype.

Methods
We used the following procedure to evaluate the concordance of sequence and microarray data. First, we considered all SNPs for which both sequence and microarray data were available in the distributed GAW18 files by matching SNP identification (rs) numbers. Prior to our analysis, each set of data went through separate data cleaning pipelines, which included cleaning observed mendelian errors within the pedigrees for both the sequence and microarray data and which are described in detail elsewhere [18]. This yielded a preliminary data set containing 297,197 SNPs. After eliminating SNPs for which the major and minor alleles present at the variant site differed between the 2 technologies (56,741 SNPs), the resulting final analysis data set consisted of 240,456 SNPs, spread across all oddnumbered autosomes. Even-numbered autosomes and sex chromosomes were not part of the GAW18 data release. Next, for each of the 240,456 SNPs in the analysis data set, we identified and recorded both the genome-wide association studies (GWAS) and NGS genotypes (including missing) for each of 959 people for whom both GWAS and NGS data was available. The 959 people include 464 individuals who were actually sequenced and 495 individuals for whom sequence data was imputed using MaCH as described elsewhere [18].

Statistical analysis
For each SNP in the analysis we computed a variety of statistics evaluating the concordance between genotype calls on the 3 different platforms (NGS, imputed, and SNP microarray). We started by counting the overall number of concordant and discordant genotypes for sequenced and microarray data. There are 16 possibilities for each individual-SNP combination: AA-AA, AA-AB, AA-BB, AA-XX, AB-AA, AB-AB, AB-BB, AB-XX, BB-AA, BB-AB, BB-BB, BB-XX, XX-AA, XX-AB, XX-BB, and XX-XX, where i-j indicates that the individual is identified as genotype i for sequence data and genotype j for microarray data. (Note that we use "A" to represent the reference allele for the NGS technology, "B" to represent the nonreference allele for the NGS technology, and × to represent missing throughout this article.) In addition to overall concordance, concordance rates were computed conditional on the observed genotype for the microarray technology. Concordance rates were also computed for individuals with different phenotype groups (males vs. females; hypertensive [systolic blood pressure >140 mm Hg or diastolic blood pressure >90 mm Hg at any of 4 exams]; vs. nonhypertensive smokers [self-identified at any of 4 waves] vs. nonsmokers). T-tests were used to compare average concordance rates between technologies and between phenotypes across the set of all SNPs. Table 1 cross-classifies all 240,456 SNPs for which both SNP microarray and sequence data are available, and which met our initial screening criteria (see Methods for details). Across the 230,597,304 (240,456 SNPs × 959 individuals) possible genotype calls, there are more than 500,00 discordant genotypes (both technologies call a genotype, and the genotypes are different), and more than 5 million genotypes that are missing on at least 1 of the 2 platforms. This means that the overall proportion of discordant genotypes (including missing) is 2.63%, while the proportion of discordant called genotypes is 0.23%.

Results
To gain a better understanding of the distribution of the discordant genotypes noted in Table 1, we computed conditional concordance rates (Table 2). In particular, we examined the probabilities that the sequence technology yields each different genotype (or missing) conditional on the genotype identified by the microarray technology. Table 2 provides separate conditional concordance rates for NGS and imputed genotypes.
As Table 2 shows, in a number of cases, the average concordance rates are substantially different between the 2 sequencing technologies. In fact, except for AA given AA, XX (missing) given AA and BB given XX (missing), all p-values from t-tests comparing the average rates between the 2 technologies are less than 2 × 10 −16 . When the microarray platform identifies the genotype as a nonreference allele homozygote, imputed sequence data shows higher concordance than NGS data. When the microarray identifies the genotype as a homozygote reference allele, rates of discordance for the homozygous nonreference allele genotype are also higher for imputed data compared to NGS genotypes. However, when the microarray platform calls the genotype a heterozygote, the NGS sequence genotypes are more concordant, as is the case when the microarray platform identifies the genotype as a reference allele homozygote and the discordance rates for the heterozygote genotype are compared. When the SNP microarray genotype is missing, the sequence data often identifies at least 1 reference allele at the site. There is also a strong association between MAF and concordance rates. In particular, SNPs with lower MAF have substantially lower concordance between platforms than do SNPs with larger MAFs (detailed results not shown).
Finally, Table 3 illustrates an analysis comparing the average conditional concordance rates for hypertensive individuals compared to nonhypertensive individuals. The strongest evidence for significant differences (p <2 × 10 −16 ) in average conditional concordance were observed between hypertensive and nonhypertensives for sequence AA or BB given GWAS AA, sequence AA or AB given GWAS AB, and sequence AA, AB, or BB given GWAS BB. Some evidence of differential average conditional concordance rates between males and females and smokers and nonsmokers also exists (detailed results not shown).

Discussion
Although most genotypes are the same for both technologies, there are still substantial numbers of discordant genotype pairs. The most common type of discordance comes from missing genotypes on the sequence technology, which occurred most frequently when the microarray technology identified at least 1 reference allele at the variant site. Power loss will occur when genotypes are not called, and so using sequence technology genotypes will, when analyzing single SNPs or sets of SNPs considered in this analysis, yield lower power overall than using microarray genotypes. We note, however, that overall power may still be higher when using sequenced genotypes as our analysis necessarily precludes the inclusion of less common SNPs, which are not measured by the microarray technology.   Although discordance between called genotypes is less common than with missing genotypes, the amount of discordance (500,000 discordant genotypes; 0.23% overall discordance rate) is still notable. Of particular note are the high proportions of heterozygote (microarray) to nonreference allele homozygote and reference allele homozygote (microarray) to heterozygote (sequence) discrepancies; more than 80% of all called genotype discordance is from these 2 types of discrepancies. The conditional discordance rates in Table 2 suggest that the majority of the discordance occurs in the imputed sequence (not the NGS sequence) data. Given that the NGS data comes from 60× coverage, the microarray genotype calling pipeline is well established, and the imputation procedure used in the GAW18 data is both novel and complex, we conclude it likely that aspects of the imputation procedure is what yielded the majority of observed discordant genotypes.
It is reasonable to view the conditional discordant genotype rates in Table 2 as conservative estimates of the genotype error rate because, if the technologies are applied independently, the vast majority of genotype errors of each technology will appear in a discordant genotype pair. However, if genotype errors are correlated between the 2 technologies (eg, at a particular variant site, similar samples are prone to error on both technologies), using the conditional discordant rate as an estimate of the genotype error rate may be substantially lower than the true genotype error rate.
As documented for single-marker tests, genotyping errors from the major homozygote to the minor homozygote are the most costly (in terms of power loss) [2,4,19]. Recently, Powers et al [11] documented potentially large declines in statistical power for collapsing tests in case-control designs, when genotype errors (particularly from more common to less common genotypes) are present, as is the case here. For example, with genotype error rates of 0.2% to 0.5% for more common to less common genotypes (as estimated in Table 2), power loss between 2% and 5% for most collapsing tests will occur. Genotype error rates of up to 2% from less common to more common genotypes have only modest impact on power (<0.5% decline). Because these results were for case-control studies, further research is needed to demonstrate similar effects of genotyping errors on family based collapsing tests. These papers [2,4,11,19] also found that power loss increases as the MAF decreases; because discordance was larger for lower MAF SNPs, power loss will be larger for lower MAF SNPs in GAW18 data. Furthermore, our analysis only considered SNPs with MAF above 5%, as rarer SNPs were not genotyped using the microarray technology. If the trend we observed continues, these SNPs may have even larger error rates and, hence, even more dramatic power loss.
These problems are further compounded when we consider that there was some evidence of differential discordance between phenotypes. Differential discordance can lead to inflated type I errors; for example, in line with our observation, for collapsing tests, error rates of 0.2% in cases and 0.1% in controls (or vice versa) can inflate the type I error rate from 5% to between 15% and 25% for most tests [5,6,12]. Although quality-control approaches (eg, Q-Q plots) can detect large-scale type I error deviations, isolated differential genotyping errors may escape typical quality control and manifest themselves as false positives. To minimize the effect of differential genotyping errors, random assignment of subjects across genotyping laboratories, laboratory assistants and plates should always be practiced. Although the precise cause of the modest differential genotyping errors observed here is unknown, it may be a result of familial aggregation of hypertension and nonrandom assignment of family members to sequencing runs, imputation quality differences between families, or other unmeasured covariates.
We note that our analysis did not consider many of the most egregious inconsistencies: namely, 56,741 SNPs where the allele calls were different. Further analysis is needed to identify whether the allele call differences yield substantially different genotype distributions in the cases and controls between the 2 technologies, or it is simply a matter of exchanging one base for another in the calling algorithms (eg, reverse and forward strand differences between the 2 technologies).
It is likely that, increasingly, sequence and microarray data will be available for the same sample as was the case here. In addition to providing an opportunity to empirically evaluate data quality, as was done here, discordant genotypes present an opportunity to utilize better quality data in downstream analyses. Recently, we showed that, under modest assumptions of independence of the 2 independent genotype mechanisms, a 50-50 weighting strategy of the 2 discordant genotypes should be used in analyses of phenotype-genotype association [20]. Thus, a reasonable choice for the genotype to be analyzed in cases of discordance is the dosage, where dosage = 0.5 is used for major homozygote-heterozygote discordant pairs and dosage = 1.5 is used for heterozygote-minor homozygote pairs. Further work is needed to confirm the optimality of this result for multimarker rare variant tests of association.

Conclusions
Despite sophisticated data-cleaning pipelines for all 3 technologies, a noticeable number of discordant genotypes remain in the GAW18 data. It is encouraging that the majority of discordant genotypes were identified as missing by one of the technologies; however, a substantial number of discordant called genotypes were still observed. Although the amounts and types of discordance observed here will likely lead to power loss and/or type I errors in downstream analysis, further research is needed to understand the impact of errors on tests of association in family based studies using sets of markers. However, it is reasonable to expect that errors and missing data will likely impact the type I and/or power for family based tests as they do for case-control tests.