Comparing strategies for combined testing of rare and common variants in whole sequence and genome-wide genotype data
- Dörthe Malzahn^{1}Email author,
- Stefanie Friedrichs^{1} and
- Heike Bickeböller^{1}
Published: 18 October 2016
Abstract
We used our extension of the kernel score test to family data to analyze real and simulated baseline systolic blood pressure in extended pedigrees. We compared the power for different kernels and for different weightings of genetic markers. Moreover, we compared the power of rare and common markers with 3 strategies for joint testing and on marker panels with different densities. Marker weights had much greater influence on power than the kernel chosen. Inverse minor allele frequency weights often increased power on common markers but could decrease power on rare markers. Furthermore, defining the gene region based on linkage disequilibrium blocks often yielded robust power of joint tests of rare and common markers.
Keywords
Background
The kernel score test is a global covariate-adjusted multilocus procedure that tests for overall association of sets of markers (see Schaid [1] for a review). This reduces the multiple-testing burden. Tested marker sets can, for example, belong to a pathway or candidate gene. The kernel score test can be applied to common and rare variants alike, as well as to data of genome-wide association studies (GWAS) or sequence data where it is named SKAT (sequence kernel association test). The kernel score test was developed for independent subjects [1]. Recent contributions by others and ourselves [2–6] extended the kernel score test to family data.
The kernel is chosen to describe genetic correlation among subjects. Different kernels have been suggested for genetic epidemiological applications. These kernels differ in whether marker–marker interactions are modeled and how complex the interaction effects may be. A frequent choice is to apply the kernel function on weighted minor allele dosage data (thus using an additive coding of minor allele effects). The dosage weights increase with decreasing minor allele frequency corresponding to the a priori assumption that less-frequent variants may have larger effects. Weighting allows rarer variants to contribute more to the overall test despite of their low frequencies.
With appropriate weighting, rare and common variants may be entered together into the kernel for joint testing. Recently however, Ionita-Laza et al. [7] proposed alternatives that can be more powerful. We explored these alternative joint tests on rare and common variants in the Genetic Analysis Workshop 19 (GAW19) family data. Moreover, we compared the power of different marker weights and kernels on sequence and GWAS panels. As we focused on genes, we also explored how size or positioning of a flanking region affects the test power.
Methods
Data
We analyzed baseline systolic blood pressure (SBP) and dosage data in the extended Mexican American pedigrees of the GAW19 family data, which are identical to the Genetic Analysis Workshop 18 data [8]. As before [6], we considered subjects with known baseline SBP and baseline diastolic blood pressure, sex, and age, who were not on blood pressure medication (real SBP: 706 subjects, excluding the first listed monozygotic twin of 2 observed twin pairs; simulated SBP: 740 to 781 subjects, numbers vary for 200 simulated study replicates because of inclusion criteria). For real SBP, we considered candidate gene AGTR1 [9] on chromosome (chr) 3 that tends to associate with SBP in the present family sample [6]. For simulated SBP, we selected from the simulation answers 5 strongly associated genes with various linkage disequilibrium (LD) structures: MAP4 (very homogeneous LD, chr3) and, in the order of increasing variability of LD, TNN (chr1), FLT3 (chr13), LEPR (chr1), and GSN (chr9). We used NCBI build 37, International Haplotype Map Project (HapMap) [10] reference data for Mexican Americans and the default algorithm in Haploview 4.2 [11] with a required fraction of strong LD of 0.7 and confidence interval limits of 0.5 and 0.8 to determine LD-blocks based on the D’ measure. Gene regions were defined as the LD-block(s) that contained the gene. For AGTR1, we also considered the region from the first to the last exonic position and flanking regions of 30 kb or 500 kb. For the same subjects, we used 2 single-nucleotide polymorphism (SNP) panels: sequence (allele dosage data) and GWAS (allele dosage data reduced to GWAS SNPs). Biallelic SNPs were included for testing if their Hardy-Weinberg equilibrium test p values were equal to or greater than 10^{−5} (rounding imputed dosages for this purpose only) and if at least 7 observations of the minor allele were present in the sample. The latter parallels minimum data requirements in parametric regression.
Kernel score test for family data
X, Z are the design matrices for fixed covariate effects and random family effects. h(G) = Ka ^{T} depends on a n × n dimensional kernel matrix K of genetic similarities between n subjects on markers G, and multivariate normally distributed random effects a ~ N(0,τK) [1]. One tests for a genetic covariance component τ.
R = P _{o} ^{1/2} Y are standard normally distributed residuals and matrix M = (P _{o} ^{1/2} K P _{o} ^{1/2})/2 incorporates the kernel [6]. P _{ o } = V _{ o } ^{ −1 }–V _{ o } ^{ −1 } X(X ^{ T } V _{ o } ^{ −1 } X) ^{ −1 } X ^{ T } V _{ o } ^{ −1 } is the null projection matrix with V _{o} = s^{2} _{o} I + s^{2} _{fam} ZZ ^{T}. The p values for test statistic (2) were calculated by Davies’ exact method [13] with the R package CompQuadForm from sample estimates Q and all eigenvalues of matrix M.
Kernels and single-nucleotide polymorphism weights
Strategies for combined testing of common and rare variants
By default, the kernel score test, Eq. (2), is performed with a kernel matrix K _{ all } computed on all dosages with a weighting of common and rare SNPs.
Under H_{0}, Q_{FISHER}/(1 + 0.25∙cov) is chi-square distributed with 16/(4 + cov) degrees of freedom [7]. With r = tr(M _{ rare }∙M _{ common })/(tr(M _{ rare }∙M _{ rare })∙tr(M _{ common }∙M _{ common }))^{1/2}, the covariance between p _{rare} and p _{common} is cov ≈ r∙(3.25 + 0.75∙r) for 0 ≤ r ≤1 and cov ≈ r∙(3.27 + 0.71∙r) for −0.5 ≤ r ≤0. Only test statistic (6) yields approximate p values; all other p values are obtained with Davies’ method and are exact.
Results and discussion
Analysis of real data: real SBP and candidate gene AGTR1
SNP panel | Weight | Common SNPs | Rare SNPs | Joint tests | ||||
---|---|---|---|---|---|---|---|---|
MAF >5 % | MAF ≤5 % | Default | WS | Fisher | ||||
N_{SNP} | p value | N_{SNP} | p value | p value | p value | p value | ||
AGTR1 with no flanking region, positions 148415571–148460795 | ||||||||
GWAS | equal | 11 | 0.189 | 7 | 0.097 | 0.177 | 0.102 | 0.101 |
1/ν | 11 | 0.113 | 7 | 0.050 | 0.054 | 0.044 | 0.043 | |
SEQ | equal | 74 | 0.203 | 138 | 0.060 | 0.173 | 0.076 | 0.076 |
1/ν | 74 | 0.160 | 138 | 0.098 | 0.083 | 0.088 | 0.090 | |
AGTR1 with 30 kb flanking region, positions 148385571–148490795 | ||||||||
GWAS | equal | 30 | 0.100 | 12 | 0.072 | 0.092 | 0.050 | 0.052 |
1/ν | 30 | 0.045 | 12 | 0.069 | 0.030 | 0.029 | 0.029 | |
SEQ | equal | 198 | 0.053 | 300 | 0.067 | 0.047 | 0.030 | 0.032 |
1/ν | 198 | 0.039 | 300 | 0.172 | 0.045 | 0.044 | 0.050 | |
AGTR1 with 500 kb flanking region, positions 147915571–148960795 | ||||||||
GWAS | equal | 277 | 0.206 | 51 | 0.048 | 0.196 | 0.061 | 0.065 |
1/ν | 277 | 0.151 | 51 | 0.064 | 0.102 | 0.059 | 0.066 | |
SEQ | equal | 2170 | 0.192 | 2244 | 0.069 | 0.173 | 0.080 | 0.085 |
1/ν | 2170 | 0.157 | 2244 | 0.051 | 0.062 | 0.057 | 0.060 | |
AGTR1 containing LD-block, positions 148344702–148568958 | ||||||||
GWAS | equal | 80 | 0.058 | 19 | 0.076 | 0.055 | 0.035 | 0.036 |
1/ν | 80 | 0.040 | 19 | 0.114 | 0.034 | 0.036 | 0.039 | |
SEQ | equal | 499 | 0.029 | 592 | 0.106 | 0.027 | 0.027 | 0.030 |
1/ν | 499 | 0.027 | 592 | 0.112 | 0.025 | 0.026 | 0.030 |
Conclusions
As the power of kernel methods increases through the exploitation of SNP correlations [2], this ability should be utilized fully by analyzing LD-blocks. SNP weights have a far greater impact on test power than the kernel chosen. Currently, the benefit of 1/ν-weights may be underestimated for common SNPs. On rare SNPs, 1/ν-weights often improve power, but can also be detrimental. Findings are consistent with both real and simulated data. Our results suggest using 1/ν-weights on all SNPs in a single kernel K _{ all } testing LD-blocks and only SNPs with sufficient minor allele observations. Alternatively, one may use WS with 1/ν-weights on common SNPs and equal weights on rare SNPs in the kernels. WS upweights the rare variant contribution globally; see Eq. (5).
Declarations
Acknowledgements
This work was supported by the Deutsche Forschungsgemeinschaft DFG (grant Klinische Forschergruppe [KFO] 241: TP5, BI 576/5-1; grant Research Training Group “Scaling Problems in Statistics” RTG 1644).
Declarations
This article has been published as part of BMC Proceedings Volume 10 Supplement 7, 2016: Genetic Analysis Workshop 19: Sequence, Blood Pressure and Expression Data. Summary articles. The full contents of the supplement are available online at http://bmcproc.biomedcentral.com/articles/supplements/volume-10-supplement-7. Publication of the proceedings of Genetic Analysis Workshop 19 was supported by National Institutes of Health grant R01 GM031575.
Authors’ contributions
Authors contributed as follows: study concept, DM and HB; data extraction and analysis, DM and SF; SNP mapping with NCBI build 37 and LD calculations, SF; and writing of the manuscript, DM. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
- Schaid DJ. Genomic similarity and kernel methods I: advancements by building on mathematical and statistical foundations. Hum Hered. 2010;70(2):109–31.View ArticlePubMedGoogle Scholar
- Schifano ED, Epstein MP, Bielak LF, Jhun MA, Kardia SL, Peyser P, Lin X. SNP set association analysis for familial data. Genet Epidemiol. 2012;36(8):797–810.PubMedPubMed CentralGoogle Scholar
- Chen H, Meigs JB, Dupuis J. Sequence kernel association test for quantitative traits in family samples. Genet Epidemiol. 2013;37(2):196–204.View ArticlePubMedGoogle Scholar
- Oualkacha K, Dastani Z, Li R, Cingolani PE, Spector TD, Hammond CJ, Richards JB, Ciampi A, Greenwood CM. Adjusted sequence kernel association test for rare variants controlling for cryptic and family relatedness. Genet Epidemiol. 2013;37(4):366–76.View ArticlePubMedGoogle Scholar
- Huang J, Chen Y, Swartz MD, Ionita-Laza I. Comparing the power of family-based association test for sequence data with applications in the GAW18 simulated data. BMC Proc. 2014;8 Suppl 1:S27.View ArticlePubMedPubMed CentralGoogle Scholar
- Malzahn D, Friedrichs S, Rosenberger A, Bickeböller H. Kernel score statistic for dependent data. BMC Proc. 2014;8 Suppl 1:S41.View ArticlePubMedPubMed CentralGoogle Scholar
- Ionita-Laza I, Lee S, Makarov V, Buxbaum JD, Lin X. Sequence kernel association tests for the combined effect of rare and common variants. Am J Hum Genet. 2013;92(6):841–53.View ArticlePubMedPubMed CentralGoogle Scholar
- Almasy L, Dyer TD, Peralta JM, Jun G, Wood AR, Fuchsberger C, Almeida MA, Kent Jr JW, Fowler S, Blackwell TW, et al. Data for Genetic Analysis Workshop 18: human whole genome sequence, blood pressure, and simulated phenotypes in extended pedigrees. BMC Proc. 2014;8 Suppl 1:S2.View ArticlePubMedPubMed CentralGoogle Scholar
- Baudin B. Polymorphism in angiotensin II receptor genes and hypertension. Exp Physiol. 2005;90(3):277–82.View ArticlePubMedGoogle Scholar
- The International HapMap Consortium. The International HapMap project. Nature. 2003;426(6968):789–96.View ArticleGoogle Scholar
- Barrett JC, Fry B, Maller J, Daly MJ. Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics. 2005;21(2):263–5.View ArticlePubMedGoogle Scholar
- Blom G. Statistical estimates and transformed beta variables. New York: John Wiley & Sons; 1958.Google Scholar
- Davies RB. Algorithm AS 155: the distribution of a linear combination of chi-2 random variables. J R Stat Soc: Ser C: Appl Stat. 1980;29(3):323–33.Google Scholar
- Cheverud JM. A simple correction for multiple comparisons in interval mapping genome scans. Heredity (Edinb). 2001;87(Pt 1):52–8.View ArticleGoogle Scholar