- Proceedings
- Open access
- Published:

# Comparing strategies for combined testing of rare and common variants in whole sequence and genome-wide genotype data

*BMC Proceedings*
**volume 10**, Article number: 17 (2016)

## Abstract

We used our extension of the kernel score test to family data to analyze real and simulated baseline systolic blood pressure in extended pedigrees. We compared the power for different kernels and for different weightings of genetic markers. Moreover, we compared the power of rare and common markers with 3 strategies for joint testing and on marker panels with different densities. Marker weights had much greater influence on power than the kernel chosen. Inverse minor allele frequency weights often increased power on common markers but could decrease power on rare markers. Furthermore, defining the gene region based on linkage disequilibrium blocks often yielded robust power of joint tests of rare and common markers.

## Background

The kernel score test is a global covariate-adjusted multilocus procedure that tests for overall association of sets of markers (see Schaid [1] for a review). This reduces the multiple-testing burden. Tested marker sets can, for example, belong to a pathway or candidate gene. The kernel score test can be applied to common and rare variants alike, as well as to data of genome-wide association studies (GWAS) or sequence data where it is named SKAT (sequence kernel association test). The kernel score test was developed for independent subjects [1]. Recent contributions by others and ourselves [2–6] extended the kernel score test to family data.

The kernel is chosen to describe genetic correlation among subjects. Different kernels have been suggested for genetic epidemiological applications. These kernels differ in whether marker–marker interactions are modeled and how complex the interaction effects may be. A frequent choice is to apply the kernel function on weighted minor allele dosage data (thus using an additive coding of minor allele effects). The dosage weights increase with decreasing minor allele frequency corresponding to the *a priori* assumption that less-frequent variants may have larger effects. Weighting allows rarer variants to contribute more to the overall test despite of their low frequencies.

With appropriate weighting, rare and common variants may be entered together into the kernel for joint testing. Recently however, Ionita-Laza et al. [7] proposed alternatives that can be more powerful. We explored these alternative joint tests on rare and common variants in the Genetic Analysis Workshop 19 (GAW19) family data. Moreover, we compared the power of different marker weights and kernels on sequence and GWAS panels. As we focused on genes, we also explored how size or positioning of a flanking region affects the test power.

## Methods

### Data

We analyzed baseline systolic blood pressure (SBP) and dosage data in the extended Mexican American pedigrees of the GAW19 family data, which are identical to the Genetic Analysis Workshop 18 data [8]. As before [6], we considered subjects with known baseline SBP and baseline diastolic blood pressure, sex, and age, who were not on blood pressure medication (real SBP: 706 subjects, excluding the first listed monozygotic twin of 2 observed twin pairs; simulated SBP: 740 to 781 subjects, numbers vary for 200 simulated study replicates because of inclusion criteria). For real SBP, we considered candidate gene *AGTR1* [9] on chromosome (chr) 3 that tends to associate with SBP in the present family sample [6]. For simulated SBP, we selected from the simulation answers 5 strongly associated genes with various linkage disequilibrium (LD) structures: *MAP4* (very homogeneous LD, chr3) and, in the order of increasing variability of LD, *TNN* (chr1), *FLT3* (chr13), *LEPR* (chr1), and *GSN* (chr9). We used NCBI build 37, International Haplotype Map Project (HapMap) [10] reference data for Mexican Americans and the default algorithm in Haploview 4.2 [11] with a required fraction of strong LD of 0.7 and confidence interval limits of 0.5 and 0.8 to determine LD-blocks based on the D’ measure. Gene regions were defined as the LD-block(s) that contained the gene. For *AGTR1*, we also considered the region from the first to the last exonic position and flanking regions of 30 kb or 500 kb. For the same subjects, we used 2 single-nucleotide polymorphism (SNP) panels: sequence (allele dosage data) and GWAS (allele dosage data reduced to GWAS SNPs). Biallelic SNPs were included for testing if their Hardy-Weinberg equilibrium test *p* values were equal to or greater than 10^{−5} (rounding imputed dosages for this purpose only) and if at least 7 observations of the minor allele were present in the sample. The latter parallels minimum data requirements in parametric regression.

### Kernel score test for family data

Here we briefly summarize our method introduced in [6], denoting vectors and matrices by bold letters. Baseline SBP is right-skewed distributed and was therefore rank-normalized by Blom transformation [12] to standard normally distributed target variables **Y** = (Y_{1},…,Y_{n}). **Y** depend on fixed covariate effects **b** (intercept, age, sex, age × sex interaction), random effects **c** that adjust for familial polygenic background, a semiparametric model **h**(**G**) of genetic markers **G**, and regression residuals **e** ~ *N*(0,s^{2}
**I**) with residual variance s^{2}.

**X**, **Z** are the design matrices for fixed covariate effects and random family effects. **h**(**G**) = **Ka**
^{T} depends on a *n* × *n* dimensional kernel matrix **K** of genetic similarities between *n* subjects on markers **G**, and multivariate normally distributed random effects **a** ~ *N*(0,τ**K**) [1]. One tests for a genetic covariance component τ.

The kernel score test is computed from restricted maximum likelihood parameter estimates of the genetic null model (where **h**(**G**) = **0**). Thus, the null model estimates fixed covariate effects **b**
_{
o
}, random pedigree effects **c**
_{
o
}, the variance s^{2}
_{fam} of the polygenic familial component, and the residual variance s^{2}
_{o}. The null model was adjusted for polygenic familial background based on the kinship coefficient matrix Φ_{kin} = **ZZ**
^{T} using R-packages kinship2 and coxme with R-function lmekin. The kernel score test statistic is.

**R** = **P**
_{o}
^{1/2}
**Y** are standard normally distributed residuals and matrix **M =** (**P**
_{o}
^{1/2}
**K P**
_{o}
^{1/2})/2 incorporates the kernel [6]. **P**
_{
o
}
**= V**
_{
o
}
^{−1}–**V**
_{
o
}
^{−1}
**X(X**
^{T}
**V**
_{
o
}
^{−1}
**X)**
^{−1}
**X**
^{T}
**V**
_{
o
}
^{−1} is the null projection matrix with **V**
_{o} = s^{2}
_{o}
**I** + s^{2}
_{fam}
**ZZ**
^{T}. The *p* values for test statistic (2) were calculated by Davies’ exact method [13] with the R package CompQuadForm from sample estimates Q and all eigenvalues of matrix **M**.

### Kernels and single-nucleotide polymorphism weights

We applied all kernel functions on allele dosage data **g**
_{i,}
**g**
_{j} (for pairs of subjects i, j) on N_{SNP} biallelic SNP markers. The kernel matrix entries are

with diagonal weight matrix **W**. The linear kernel (3) does not allow for SNP interactions opposed to the RBF kernel (4), which yields polynomial models. Dosage weights are normed **W**
_{mm} = f(ν_{m})/∑_{m}f(ν_{m}) for any chosen SNP set m = 1,…,N_{SNP} and depend on the minor allele frequency (MAF) ν of the respective SNP. We considered: f(ν_{m}) = 1 (treating SNPs alike), f(ν_{m}) = 1/ν_{m}, as well as f(ν_{m}) = *Beta*(ν_{m},1,25) for ν_{m} equal to or less than 5 % and f(ν_{m}) = *Beta*(ν_{m},0.5,0.5) for ν_{m} greater than 5 % as suggested earlier [7]. *Beta*-density weights distinguish MAFs more moderately than 1/ν-weights. For the RBF kernel (4), the scale parameter μ was the average weighted squared genetic difference between subjects Σ_{i,j}((**g**
_{i}-**g**
_{j})^{T}
**W**(**g**
_{i}-**g**
_{j}))/n^{2} multiplied by the effective number of independent SNPs in the tested set [14].

### Strategies for combined testing of common and rare variants

By default, the kernel score test, Eq. (2), is performed with a kernel matrix **K**
_{
all
} computed on all dosages with a weighting of common and rare SNPs.

In contrast, Ionita-Laza et al. [7] recently suggested computing the kernel separately for rare SNPs (**K**
_{
rare
}) and for common SNPs (**K**
_{
common
}), respectively, in a region of interest. Analogous to Eq. (2), this yields matrices **M**
_{
rare
}, **M**
_{
common
}, test statistics Q_{rare}, Q_{common}, and *p* values *p*
_{rare}, *p*
_{common}. The null model, **P**
_{
o
} and **R** were always the same. The weighted sum test (WS) on common and rare variants has test statistic [7].

Weight φ = (tr(**M**
_{
rare
}
**∙M**
_{
rare
})/(tr(**M**
_{
rare
}
**∙M**
_{
rare
}) + tr(**M**
_{
common
}
**∙M**
_{
common
})))^{1/2} may be chosen such that (1 − φ)∙Q_{rare} and φ∙Q_{common} have the same variance. *P* values are obtained by Davies’ exact method from sample estimates Q_{WS} and all eigenvalues of matrix ((1 − φ)∙**M**
_{
rare
} + φ∙**M**
_{
common
}). Alternatively, Fishers *p* value pooling can be applied.

Under H_{0}, Q_{FISHER}/(1 + 0.25∙*cov*) is chi-square distributed with 16/(4 + *cov*) degrees of freedom [7]. With r = tr(**M**
_{
rare
}∙**M**
_{
common
})/(tr(**M**
_{
rare
}∙**M**
_{
rare
})∙tr(**M**
_{
common
}∙**M**
_{
common
}))^{1/2}, the covariance between *p*
_{rare} and *p*
_{common} is *cov* ≈ r∙(3.25 + 0.75∙r) for 0 ≤ r ≤1 and *cov* ≈ r∙(3.27 + 0.71∙r) for −0.5 ≤ r ≤0. Only test statistic (6) yields approximate *p* values; all other *p* values are obtained with Davies’ method and are exact.

## Results and discussion

Our test extension to families holds the nominal significance level and correctly adjusts for a polygenic familial variance component (as demonstrated in [6]). Table 1 lists the *p* values obtained for association testing of *AGTR1* on real SBP, considering common SNPs (MAF >5 %) and rare SNPs (MAF ≤5 %) as well as 3 joint tests (default test **K**
_{
all
}, WS, Fisher). *Beta*-weights (not shown) performed between equal weights and 1/ν-weights. The 1/ν-weight lowered *p* values particularly on common SNPs. *AGTR1* association is suggested by common as well as rare SNPs. Joint testing of rare and common SNPs was beneficial. In particular, WS and Fisher test *p* values were often smaller (and otherwise close to) the smallest *p* value of the separate rare and common SNP tests. When using ad hoc definitions of the *AGTR1* flanking region, Fisher and WS *p* values remained relatively stable and were also smaller compared to the default test **K**
_{
all
}. However, on the *AGTR1* containing LD-block all joint tests performed highly similar, *p* values were the smallest and also relatively stable regardless of SNP weights and SNP density.

Next, we analyzed LD-blocks that contain the genes *MAP4*, *TNN*, *LEPR*, *GSN*, or *FLT3*. Figure 1 displays the average test power on 200 data replicates of simulated SBP. Sequence-derived variants were often more powerful than GWAS with some exceptions (Fig. 1 left and middle panels, black solid lines vs. gray dashed lines). The best were often 1/ν-weights (circle), otherwise equal weights (diamond) were favored. Particularly 1/ν-weights may be beneficial on common SNPs *(LEPR)* and occasionally detrimental on rare SNPs *(MAP4)*. The latter is an exceptional finding but consistent with Table 1 on candidate gene *AGTR1*. On rare *MAP4* SNPs, 1/ν-weights lowered the power, especially when testing also extremely rare SNPs (encircled plus), but less so when testing only MAF equal to or less than 5 % SNPs that had at least 7 observations of the minor allele (filled circle; sequence data). On gene-containing LD-blocks, all joint tests (default test **K**
_{
all
}, WS, Fisher) often had similar power (Fig. 1, right panel: *LEPR*, *FLT3*, *TNN* with highly similar results [only *TNN* shown]; *GSN* sequence). However, default test **K**
_{
all
} was the most powerful test on the gene with homogeneous strong LD (*MAP4*: sequence [Fig. 1, right] and GWAS [not shown]) and on the gene with the most variable LD structure (*GSN*: when using GWAS SNPs, not shown). Then, **K**
_{
all
} likely exploited SNP correlations better. When LD-blocks were enlarged by flanking regions, WS and Fisher often were slightly more powerful than **K**
_{
all
} (results not shown). The linear kernel had always similar or better power than the RBF kernel (results not shown).

## Conclusions

As the power of kernel methods increases through the exploitation of SNP correlations [2], this ability should be utilized fully by analyzing LD-blocks. SNP weights have a far greater impact on test power than the kernel chosen. Currently, the benefit of 1/ν-weights may be underestimated for common SNPs. On rare SNPs, 1/ν-weights often improve power, but can also be detrimental. Findings are consistent with both real and simulated data. Our results suggest using 1/ν-weights on all SNPs in a single kernel **K**
_{
all
} testing LD-blocks and only SNPs with sufficient minor allele observations. Alternatively, one may use WS with 1/ν-weights on common SNPs and equal weights on rare SNPs in the kernels. WS upweights the rare variant contribution globally; see Eq. (5).

## References

Schaid DJ. Genomic similarity and kernel methods I: advancements by building on mathematical and statistical foundations. Hum Hered. 2010;70(2):109–31.

Schifano ED, Epstein MP, Bielak LF, Jhun MA, Kardia SL, Peyser P, Lin X. SNP set association analysis for familial data. Genet Epidemiol. 2012;36(8):797–810.

Chen H, Meigs JB, Dupuis J. Sequence kernel association test for quantitative traits in family samples. Genet Epidemiol. 2013;37(2):196–204.

Oualkacha K, Dastani Z, Li R, Cingolani PE, Spector TD, Hammond CJ, Richards JB, Ciampi A, Greenwood CM. Adjusted sequence kernel association test for rare variants controlling for cryptic and family relatedness. Genet Epidemiol. 2013;37(4):366–76.

Huang J, Chen Y, Swartz MD, Ionita-Laza I. Comparing the power of family-based association test for sequence data with applications in the GAW18 simulated data. BMC Proc. 2014;8 Suppl 1:S27.

Malzahn D, Friedrichs S, Rosenberger A, Bickeböller H. Kernel score statistic for dependent data. BMC Proc. 2014;8 Suppl 1:S41.

Ionita-Laza I, Lee S, Makarov V, Buxbaum JD, Lin X. Sequence kernel association tests for the combined effect of rare and common variants. Am J Hum Genet. 2013;92(6):841–53.

Almasy L, Dyer TD, Peralta JM, Jun G, Wood AR, Fuchsberger C, Almeida MA, Kent Jr JW, Fowler S, Blackwell TW, et al. Data for Genetic Analysis Workshop 18: human whole genome sequence, blood pressure, and simulated phenotypes in extended pedigrees. BMC Proc. 2014;8 Suppl 1:S2.

Baudin B. Polymorphism in angiotensin II receptor genes and hypertension. Exp Physiol. 2005;90(3):277–82.

The International HapMap Consortium. The International HapMap project. Nature. 2003;426(6968):789–96.

Barrett JC, Fry B, Maller J, Daly MJ. Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics. 2005;21(2):263–5.

Blom G. Statistical estimates and transformed beta variables. New York: John Wiley & Sons; 1958.

Davies RB. Algorithm AS 155: the distribution of a linear combination of chi-2 random variables. J R Stat Soc: Ser C: Appl Stat. 1980;29(3):323–33.

Cheverud JM. A simple correction for multiple comparisons in interval mapping genome scans. Heredity (Edinb). 2001;87(Pt 1):52–8.

## Acknowledgements

This work was supported by the Deutsche Forschungsgemeinschaft DFG (grant Klinische Forschergruppe [KFO] 241: TP5, BI 576/5-1; grant Research Training Group “Scaling Problems in Statistics” RTG 1644).

### Declarations

This article has been published as part of *BMC Proceedings* Volume 10 Supplement 7, 2016: Genetic Analysis Workshop 19: Sequence, Blood Pressure and Expression Data. Summary articles. The full contents of the supplement are available online at http://bmcproc.biomedcentral.com/articles/supplements/volume-10-supplement-7. Publication of the proceedings of Genetic Analysis Workshop 19 was supported by National Institutes of Health grant R01 GM031575.

### Authors’ contributions

Authors contributed as follows: study concept, DM and HB; data extraction and analysis, DM and SF; SNP mapping with NCBI build 37 and LD calculations, SF; and writing of the manuscript, DM. All authors read and approved the final manuscript.

### Competing interests

The authors declare that they have no competing interests.

## Author information

### Authors and Affiliations

### Corresponding author

## Rights and permissions

**Open Access** This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

## About this article

### Cite this article

Malzahn, D., Friedrichs, S. & Bickeböller, H. Comparing strategies for combined testing of rare and common variants in whole sequence and genome-wide genotype data.
*BMC Proc* **10**
(Suppl 7), 17 (2016). https://doi.org/10.1186/s12919-016-0042-9

Published:

DOI: https://doi.org/10.1186/s12919-016-0042-9