Comparing strategies for combined testing of rare and common variants in whole sequence and genome-wide genotype data

Malzahn, Dörthe; Friedrichs, Stefanie; Bickeböller, Heike

doi:10.1186/s12919-016-0042-9

Volume 10 Supplement 7

Genetic Analysis Workshop 19: Sequence, Blood Pressure and Expression Data. Proceedings.

Proceedings
Open access
Published: 18 October 2016

Comparing strategies for combined testing of rare and common variants in whole sequence and genome-wide genotype data

Dörthe Malzahn¹,
Stefanie Friedrichs¹ &
Heike Bickeböller¹

BMC Proceedings volume 10, Article number: 17 (2016) Cite this article

2274 Accesses
1 Citations
Metrics details

Abstract

We used our extension of the kernel score test to family data to analyze real and simulated baseline systolic blood pressure in extended pedigrees. We compared the power for different kernels and for different weightings of genetic markers. Moreover, we compared the power of rare and common markers with 3 strategies for joint testing and on marker panels with different densities. Marker weights had much greater influence on power than the kernel chosen. Inverse minor allele frequency weights often increased power on common markers but could decrease power on rare markers. Furthermore, defining the gene region based on linkage disequilibrium blocks often yielded robust power of joint tests of rare and common markers.

Background

The kernel score test is a global covariate-adjusted multilocus procedure that tests for overall association of sets of markers (see Schaid [1] for a review). This reduces the multiple-testing burden. Tested marker sets can, for example, belong to a pathway or candidate gene. The kernel score test can be applied to common and rare variants alike, as well as to data of genome-wide association studies (GWAS) or sequence data where it is named SKAT (sequence kernel association test). The kernel score test was developed for independent subjects [1]. Recent contributions by others and ourselves [2–6] extended the kernel score test to family data.

The kernel is chosen to describe genetic correlation among subjects. Different kernels have been suggested for genetic epidemiological applications. These kernels differ in whether marker–marker interactions are modeled and how complex the interaction effects may be. A frequent choice is to apply the kernel function on weighted minor allele dosage data (thus using an additive coding of minor allele effects). The dosage weights increase with decreasing minor allele frequency corresponding to the a priori assumption that less-frequent variants may have larger effects. Weighting allows rarer variants to contribute more to the overall test despite of their low frequencies.

With appropriate weighting, rare and common variants may be entered together into the kernel for joint testing. Recently however, Ionita-Laza et al. [7] proposed alternatives that can be more powerful. We explored these alternative joint tests on rare and common variants in the Genetic Analysis Workshop 19 (GAW19) family data. Moreover, we compared the power of different marker weights and kernels on sequence and GWAS panels. As we focused on genes, we also explored how size or positioning of a flanking region affects the test power.

Methods

Data

We analyzed baseline systolic blood pressure (SBP) and dosage data in the extended Mexican American pedigrees of the GAW19 family data, which are identical to the Genetic Analysis Workshop 18 data [8]. As before [6], we considered subjects with known baseline SBP and baseline diastolic blood pressure, sex, and age, who were not on blood pressure medication (real SBP: 706 subjects, excluding the first listed monozygotic twin of 2 observed twin pairs; simulated SBP: 740 to 781 subjects, numbers vary for 200 simulated study replicates because of inclusion criteria). For real SBP, we considered candidate gene AGTR1 [9] on chromosome (chr) 3 that tends to associate with SBP in the present family sample [6]. For simulated SBP, we selected from the simulation answers 5 strongly associated genes with various linkage disequilibrium (LD) structures: MAP4 (very homogeneous LD, chr3) and, in the order of increasing variability of LD, TNN (chr1), FLT3 (chr13), LEPR (chr1), and GSN (chr9). We used NCBI build 37, International Haplotype Map Project (HapMap) [10] reference data for Mexican Americans and the default algorithm in Haploview 4.2 [11] with a required fraction of strong LD of 0.7 and confidence interval limits of 0.5 and 0.8 to determine LD-blocks based on the D’ measure. Gene regions were defined as the LD-block(s) that contained the gene. For AGTR1, we also considered the region from the first to the last exonic position and flanking regions of 30 kb or 500 kb. For the same subjects, we used 2 single-nucleotide polymorphism (SNP) panels: sequence (allele dosage data) and GWAS (allele dosage data reduced to GWAS SNPs). Biallelic SNPs were included for testing if their Hardy-Weinberg equilibrium test p values were equal to or greater than 10⁻⁵ (rounding imputed dosages for this purpose only) and if at least 7 observations of the minor allele were present in the sample. The latter parallels minimum data requirements in parametric regression.

Kernel score test for family data

Here we briefly summarize our method introduced in [6], denoting vectors and matrices by bold letters. Baseline SBP is right-skewed distributed and was therefore rank-normalized by Blom transformation [12] to standard normally distributed target variables Y = (Y₁,…,Y_n). Y depend on fixed covariate effects b (intercept, age, sex, age × sex interaction), random effects c that adjust for familial polygenic background, a semiparametric model h(G) of genetic markers G, and regression residuals e ~ N(0,s² I) with residual variance s².

$$ \mathbf{Y}=\mathbf{X}{\mathbf{b}}^{\mathrm{T}}+\mathbf{Z}{\mathbf{c}}^{\mathrm{T}}+\mathbf{h}\left(\mathbf{G}\right)+\mathbf{e} $$

(1)

X, Z are the design matrices for fixed covariate effects and random family effects. h(G) = Ka ^T depends on a n × n dimensional kernel matrix K of genetic similarities between n subjects on markers G, and multivariate normally distributed random effects a ~ N(0,τK) [1]. One tests for a genetic covariance component τ.

The kernel score test is computed from restricted maximum likelihood parameter estimates of the genetic null model (where h(G) = 0). Thus, the null model estimates fixed covariate effects b _o, random pedigree effects c _o, the variance s² _fam of the polygenic familial component, and the residual variance s² _o. The null model was adjusted for polygenic familial background based on the kinship coefficient matrix Φ_kin = ZZ ^T using R-packages kinship2 and coxme with R-function lmekin. The kernel score test statistic is.

$$ \mathrm{Q} = {\mathbf{R}}^{\mathrm{T}}\mathbf{M}\mathbf{R} $$

(2)

R = P _o ^1/2 Y are standard normally distributed residuals and matrix M = (P _o ^1/2 K P _o ^1/2)/2 incorporates the kernel [6]. P _o = V _o ⁻¹–V _o ⁻¹ X(X ^T V _o ⁻¹ X) ⁻¹ X ^T V _o ⁻¹ is the null projection matrix with V _o = s² _o I + s² _fam ZZ ^T. The p values for test statistic (2) were calculated by Davies’ exact method [13] with the R package CompQuadForm from sample estimates Q and all eigenvalues of matrix M.

Kernels and single-nucleotide polymorphism weights

We applied all kernel functions on allele dosage data g _i, g _j (for pairs of subjects i, j) on N_SNP biallelic SNP markers. The kernel matrix entries are

$$ \mathrm{Linear}\ \mathrm{kernel}\kern0.5em {\mathbf{K}}_{\mathrm{i}\mathrm{j}}={{\mathbf{g}}_{\mathrm{i}}}^{\mathrm{T}}\mathbf{W}{\mathbf{g}}_{\mathrm{j}} $$

(3)

$$ \mathrm{Radial}\ \mathrm{basis}\ \mathrm{function}\ \left(\mathrm{R}\mathrm{B}\mathrm{F}\right)\ \mathrm{kernel}\kern0.5em {\mathbf{K}}_{\mathrm{i}\mathrm{j}} = \exp \left(-{\upmu^{-}}^1\cdotp {\left({\mathbf{g}}_{\mathrm{i}}-{\mathbf{g}}_{\mathrm{j}}\right)}^{\mathrm{T}}\mathbf{W}\left({\mathbf{g}}_{\mathrm{i}}-{\mathbf{g}}_{\mathrm{j}}\right)\right) $$

(4)

with diagonal weight matrix W. The linear kernel (3) does not allow for SNP interactions opposed to the RBF kernel (4), which yields polynomial models. Dosage weights are normed W _mm = f(ν_m)/∑_mf(ν_m) for any chosen SNP set m = 1,…,N_SNP and depend on the minor allele frequency (MAF) ν of the respective SNP. We considered: f(ν_m) = 1 (treating SNPs alike), f(ν_m) = 1/ν_m, as well as f(ν_m) = Beta(ν_m,1,25) for ν_m equal to or less than 5 % and f(ν_m) = Beta(ν_m,0.5,0.5) for ν_m greater than 5 % as suggested earlier [7]. Beta-density weights distinguish MAFs more moderately than 1/ν-weights. For the RBF kernel (4), the scale parameter μ was the average weighted squared genetic difference between subjects Σ_i,j((g _i-g _j)^T W(g _i-g _j))/n² multiplied by the effective number of independent SNPs in the tested set [14].

Strategies for combined testing of common and rare variants

By default, the kernel score test, Eq. (2), is performed with a kernel matrix K _all computed on all dosages with a weighting of common and rare SNPs.

In contrast, Ionita-Laza et al. [7] recently suggested computing the kernel separately for rare SNPs (K _rare) and for common SNPs (K _common), respectively, in a region of interest. Analogous to Eq. (2), this yields matrices M _rare, M _common, test statistics Q_rare, Q_common, and p values p _rare, p _common. The null model, P _o and R were always the same. The weighted sum test (WS) on common and rare variants has test statistic [7].

$$ {\mathrm{Q}}_{\mathrm{WS}} = \left(1\hbox{--} \upvarphi \right)\cdot {\mathrm{Q}}_{\mathrm{rare}} + \upvarphi \cdot {\mathrm{Q}}_{\mathrm{common}} $$

(5)

Weight φ = (tr(M _rare ∙M _rare)/(tr(M _rare ∙M _rare) + tr(M _common ∙M _common)))^1/2 may be chosen such that (1 − φ)∙Q_rare and φ∙Q_common have the same variance. P values are obtained by Davies’ exact method from sample estimates Q_WS and all eigenvalues of matrix ((1 − φ)∙M _rare + φ∙M _common). Alternatively, Fishers p value pooling can be applied.

$$ {\mathrm{Q}}_{\mathrm{FISHER}}=-2 \ln \left({p}_{\mathrm{rare}}\right)-2 \ln \left({p}_{\mathrm{common}}\right) $$

(6)

Under H₀, Q_FISHER/(1 + 0.25∙cov) is chi-square distributed with 16/(4 + cov) degrees of freedom [7]. With r = tr(M _rare∙M _common)/(tr(M _rare∙M _rare)∙tr(M _common∙M _common))^1/2, the covariance between p _rare and p _common is cov ≈ r∙(3.25 + 0.75∙r) for 0 ≤ r ≤1 and cov ≈ r∙(3.27 + 0.71∙r) for −0.5 ≤ r ≤0. Only test statistic (6) yields approximate p values; all other p values are obtained with Davies’ method and are exact.

Results and discussion

Our test extension to families holds the nominal significance level and correctly adjusts for a polygenic familial variance component (as demonstrated in [6]). Table 1 lists the p values obtained for association testing of AGTR1 on real SBP, considering common SNPs (MAF >5 %) and rare SNPs (MAF ≤5 %) as well as 3 joint tests (default test K _all, WS, Fisher). Beta-weights (not shown) performed between equal weights and 1/ν-weights. The 1/ν-weight lowered p values particularly on common SNPs. AGTR1 association is suggested by common as well as rare SNPs. Joint testing of rare and common SNPs was beneficial. In particular, WS and Fisher test p values were often smaller (and otherwise close to) the smallest p value of the separate rare and common SNP tests. When using ad hoc definitions of the AGTR1 flanking region, Fisher and WS p values remained relatively stable and were also smaller compared to the default test K _all. However, on the AGTR1 containing LD-block all joint tests performed highly similar, p values were the smallest and also relatively stable regardless of SNP weights and SNP density.

Table 1 Analysis of real data: real SBP and candidate gene AGTR1

Full size table

Next, we analyzed LD-blocks that contain the genes MAP4, TNN, LEPR, GSN, or FLT3. Figure 1 displays the average test power on 200 data replicates of simulated SBP. Sequence-derived variants were often more powerful than GWAS with some exceptions (Fig. 1 left and middle panels, black solid lines vs. gray dashed lines). The best were often 1/ν-weights (circle), otherwise equal weights (diamond) were favored. Particularly 1/ν-weights may be beneficial on common SNPs (LEPR) and occasionally detrimental on rare SNPs (MAP4). The latter is an exceptional finding but consistent with Table 1 on candidate gene AGTR1. On rare MAP4 SNPs, 1/ν-weights lowered the power, especially when testing also extremely rare SNPs (encircled plus), but less so when testing only MAF equal to or less than 5 % SNPs that had at least 7 observations of the minor allele (filled circle; sequence data). On gene-containing LD-blocks, all joint tests (default test K _all, WS, Fisher) often had similar power (Fig. 1, right panel: LEPR, FLT3, TNN with highly similar results [only TNN shown]; GSN sequence). However, default test K _all was the most powerful test on the gene with homogeneous strong LD (MAP4: sequence [Fig. 1, right] and GWAS [not shown]) and on the gene with the most variable LD structure (GSN: when using GWAS SNPs, not shown). Then, K _all likely exploited SNP correlations better. When LD-blocks were enlarged by flanking regions, WS and Fisher often were slightly more powerful than K _all (results not shown). The linear kernel had always similar or better power than the RBF kernel (results not shown).

Conclusions

As the power of kernel methods increases through the exploitation of SNP correlations [2], this ability should be utilized fully by analyzing LD-blocks. SNP weights have a far greater impact on test power than the kernel chosen. Currently, the benefit of 1/ν-weights may be underestimated for common SNPs. On rare SNPs, 1/ν-weights often improve power, but can also be detrimental. Findings are consistent with both real and simulated data. Our results suggest using 1/ν-weights on all SNPs in a single kernel K _all testing LD-blocks and only SNPs with sufficient minor allele observations. Alternatively, one may use WS with 1/ν-weights on common SNPs and equal weights on rare SNPs in the kernels. WS upweights the rare variant contribution globally; see Eq. (5).

References

Schaid DJ. Genomic similarity and kernel methods I: advancements by building on mathematical and statistical foundations. Hum Hered. 2010;70(2):109–31.
Article PubMed Google Scholar
Schifano ED, Epstein MP, Bielak LF, Jhun MA, Kardia SL, Peyser P, Lin X. SNP set association analysis for familial data. Genet Epidemiol. 2012;36(8):797–810.
PubMed PubMed Central Google Scholar
Chen H, Meigs JB, Dupuis J. Sequence kernel association test for quantitative traits in family samples. Genet Epidemiol. 2013;37(2):196–204.
Article PubMed Google Scholar
Oualkacha K, Dastani Z, Li R, Cingolani PE, Spector TD, Hammond CJ, Richards JB, Ciampi A, Greenwood CM. Adjusted sequence kernel association test for rare variants controlling for cryptic and family relatedness. Genet Epidemiol. 2013;37(4):366–76.
Article PubMed Google Scholar
Huang J, Chen Y, Swartz MD, Ionita-Laza I. Comparing the power of family-based association test for sequence data with applications in the GAW18 simulated data. BMC Proc. 2014;8 Suppl 1:S27.
Article PubMed PubMed Central Google Scholar
Malzahn D, Friedrichs S, Rosenberger A, Bickeböller H. Kernel score statistic for dependent data. BMC Proc. 2014;8 Suppl 1:S41.
Article PubMed PubMed Central Google Scholar
Ionita-Laza I, Lee S, Makarov V, Buxbaum JD, Lin X. Sequence kernel association tests for the combined effect of rare and common variants. Am J Hum Genet. 2013;92(6):841–53.
Article CAS PubMed PubMed Central Google Scholar
Almasy L, Dyer TD, Peralta JM, Jun G, Wood AR, Fuchsberger C, Almeida MA, Kent Jr JW, Fowler S, Blackwell TW, et al. Data for Genetic Analysis Workshop 18: human whole genome sequence, blood pressure, and simulated phenotypes in extended pedigrees. BMC Proc. 2014;8 Suppl 1:S2.
Article PubMed PubMed Central Google Scholar
Baudin B. Polymorphism in angiotensin II receptor genes and hypertension. Exp Physiol. 2005;90(3):277–82.
Article CAS PubMed Google Scholar
The International HapMap Consortium. The International HapMap project. Nature. 2003;426(6968):789–96.
Article Google Scholar
Barrett JC, Fry B, Maller J, Daly MJ. Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics. 2005;21(2):263–5.
Article CAS PubMed Google Scholar
Blom G. Statistical estimates and transformed beta variables. New York: John Wiley & Sons; 1958.
Google Scholar
Davies RB. Algorithm AS 155: the distribution of a linear combination of chi-2 random variables. J R Stat Soc: Ser C: Appl Stat. 1980;29(3):323–33.
Google Scholar
Cheverud JM. A simple correction for multiple comparisons in interval mapping genome scans. Heredity (Edinb). 2001;87(Pt 1):52–8.
Article CAS Google Scholar

Download references

Acknowledgements

This work was supported by the Deutsche Forschungsgemeinschaft DFG (grant Klinische Forschergruppe [KFO] 241: TP5, BI 576/5-1; grant Research Training Group “Scaling Problems in Statistics” RTG 1644).

Declarations

This article has been published as part of BMC Proceedings Volume 10 Supplement 7, 2016: Genetic Analysis Workshop 19: Sequence, Blood Pressure and Expression Data. Summary articles. The full contents of the supplement are available online at http://bmcproc.biomedcentral.com/articles/supplements/volume-10-supplement-7. Publication of the proceedings of Genetic Analysis Workshop 19 was supported by National Institutes of Health grant R01 GM031575.

Authors’ contributions

Authors contributed as follows: study concept, DM and HB; data extraction and analysis, DM and SF; SNP mapping with NCBI build 37 and LD calculations, SF; and writing of the manuscript, DM. All authors read and approved the final manuscript.

Competing interests

The authors declare that they have no competing interests.

Author information

Authors and Affiliations

Department of Genetic Epidemiology, University Medical Center, Georg-August University Göttingen, Humboldtallee 32, 37073, Göttingen, Germany
Dörthe Malzahn, Stefanie Friedrichs & Heike Bickeböller

Authors

Dörthe Malzahn
View author publications
You can also search for this author in PubMed Google Scholar
Stefanie Friedrichs
View author publications
You can also search for this author in PubMed Google Scholar
Heike Bickeböller
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dörthe Malzahn.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Malzahn, D., Friedrichs, S. & Bickeböller, H. Comparing strategies for combined testing of rare and common variants in whole sequence and genome-wide genotype data. BMC Proc 10 (Suppl 7), 17 (2016). https://doi.org/10.1186/s12919-016-0042-9

Download citation

Published: 18 October 2016
DOI: https://doi.org/10.1186/s12919-016-0042-9

Genetic Analysis Workshop 19: Sequence, Blood Pressure and Expression Data. Proceedings.

Comparing strategies for combined testing of rare and common variants in whole sequence and genome-wide genotype data

Abstract

Background

Methods

Data

Kernel score test for family data

Kernels and single-nucleotide polymorphism weights

Strategies for combined testing of common and rare variants

Results and discussion

Conclusions

References

Acknowledgements

Declarations

Authors’ contributions

Competing interests

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

BMC Proceedings

Contact us

Genetic Analysis Workshop 19: Sequence, Blood Pressure and Expression Data. Proceedings.

Comparing strategies for combined testing of rare and common variants in whole sequence and genome-wide genotype data

Abstract

Background

Methods

Data

Kernel score test for family data

Kernels and single-nucleotide polymorphism weights

Strategies for combined testing of common and rare variants

Results and discussion

Conclusions

References

Acknowledgements

Declarations

Authors’ contributions

Competing interests

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Proceedings

Contact us