Comparing strategies for combined testing of rare and common variants in whole sequence and genome-wide genotype data

We used our extension of the kernel score test to family data to analyze real and simulated baseline systolic blood pressure in extended pedigrees. We compared the power for different kernels and for different weightings of genetic markers. Moreover, we compared the power of rare and common markers with 3 strategies for joint testing and on marker panels with different densities. Marker weights had much greater influence on power than the kernel chosen. Inverse minor allele frequency weights often increased power on common markers but could decrease power on rare markers. Furthermore, defining the gene region based on linkage disequilibrium blocks often yielded robust power of joint tests of rare and common markers.


Background
The kernel score test is a global covariate-adjusted multilocus procedure that tests for overall association of sets of markers (see Schaid [1] for a review). This reduces the multiple-testing burden. Tested marker sets can, for example, belong to a pathway or candidate gene. The kernel score test can be applied to common and rare variants alike, as well as to data of genome-wide association studies (GWAS) or sequence data where it is named SKAT (sequence kernel association test). The kernel score test was developed for independent subjects [1]. Recent contributions by others and ourselves [2][3][4][5][6] extended the kernel score test to family data.
The kernel is chosen to describe genetic correlation among subjects. Different kernels have been suggested for genetic epidemiological applications. These kernels differ in whether marker-marker interactions are modeled and how complex the interaction effects may be. A frequent choice is to apply the kernel function on weighted minor allele dosage data (thus using an additive coding of minor allele effects). The dosage weights increase with decreasing minor allele frequency corresponding to the a priori assumption that lessfrequent variants may have larger effects. Weighting allows rarer variants to contribute more to the overall test despite of their low frequencies.
With appropriate weighting, rare and common variants may be entered together into the kernel for joint testing. Recently however, Ionita-Laza et al. [7] proposed alternatives that can be more powerful. We explored these alternative joint tests on rare and common variants in the Genetic Analysis Workshop 19 (GAW19) family data. Moreover, we compared the power of different marker weights and kernels on sequence and GWAS panels. As we focused on genes, we also explored how size or positioning of a flanking region affects the test power.

Data
We analyzed baseline systolic blood pressure (SBP) and dosage data in the extended Mexican American pedigrees of the GAW19 family data, which are identical to the Genetic Analysis Workshop 18 data [8]. As before [6], we considered subjects with known baseline SBP and baseline diastolic blood pressure, sex, and age, who were not on blood pressure medication (real SBP: 706 subjects, excluding the first listed monozygotic twin of 2 observed twin pairs; simulated SBP: 740 to 781 subjects, numbers vary for 200 simulated study replicates because of inclusion criteria). For real SBP, we considered candidate gene AGTR1 [9] on chromosome (chr) 3 that tends to associate with SBP in the present family sample [6]. For simulated SBP, we selected from the simulation answers 5 strongly associated genes with various linkage disequilibrium (LD) structures: MAP4 (very homogeneous LD, chr3) and, in the order of increasing variability of LD, TNN (chr1), FLT3 (chr13), LEPR (chr1), and GSN (chr9). We used NCBI build 37, International Haplotype Map Project (HapMap) [10] reference data for Mexican Americans and the default algorithm in Haploview 4.2 [11] with a required fraction of strong LD of 0.7 and confidence interval limits of 0.5 and 0.8 to determine LD-blocks based on the D' measure. Gene regions were defined as the LD-block(s) that contained the gene. For AGTR1, we also considered the region from the first to the last exonic position and flanking regions of 30 kb or 500 kb. For the same subjects, we used 2 single-nucleotide polymorphism (SNP) panels: sequence (allele dosage data) and GWAS (allele dosage data reduced to GWAS SNPs). Biallelic SNPs were included for testing if their Hardy-Weinberg equilibrium test p values were equal to or greater than 10 −5 (rounding imputed dosages for this purpose only) and if at least 7 observations of the minor allele were present in the sample. The latter parallels minimum data requirements in parametric regression.

Kernel score test for family data
Here we briefly summarize our method introduced in [6], denoting vectors and matrices by bold letters. Baseline SBP is right-skewed distributed and was therefore rank-normalized by Blom transformation [12] to standard normally distributed target variables Y = (Y 1 ,…,Y n ). Y depend on fixed covariate effects b (intercept, age, sex, age × sex interaction), random effects c that adjust for familial polygenic background, a semiparametric model h(G) of genetic markers G, and regression residuals eÑ (0,s 2 I) with residual variance s 2 .
X, Z are the design matrices for fixed covariate effects and random family effects. h(G) = Ka T depends on a n × n dimensional kernel matrix K of genetic similarities between n subjects on markers G, and multivariate normally distributed random effects a~N(0,τK) [1]. One tests for a genetic covariance component τ.
The kernel score test is computed from restricted maximum likelihood parameter estimates of the genetic null model (where h(G) = 0). Thus, the null model estimates fixed covariate effects b o , random pedigree effects c o , the variance s 2 fam of the polygenic familial component, and the residual variance s 2 o . The null model was adjusted for polygenic familial background based on the kinship coefficient matrix Φ kin = ZZ T using R-packages kinship2 and coxme with R-function lmekin. The kernel score test statistic is.
The p values for test statistic (2) were calculated by Davies' exact method [13] with the R package CompQuadForm from sample estimates Q and all eigenvalues of matrix M.

Kernels and single-nucleotide polymorphism weights
We applied all kernel functions on allele dosage data g i, g j (for pairs of subjects i, j) on N SNP biallelic SNP markers. The kernel matrix entries are Radial basis function RBF ð with diagonal weight matrix W. The linear kernel (3) does not allow for SNP interactions opposed to the RBF kernel (4), which yields polynomial models. Dosage weights are normed W mm = f(ν m )/∑ m f(ν m ) for any chosen SNP set m = 1,…,N SNP and depend on the minor allele frequency (MAF) ν of the respective SNP. We considered: f(ν m ) = 1 (treating SNPs alike), f(ν m ) = 1/ν m , as well as f(ν m ) = Beta(ν m ,1,25) for ν m equal to or less than 5 % and f(ν m ) = Beta(ν m ,0.5,0.5) for ν m greater than 5 % as suggested earlier [7]. Beta-density weights distinguish MAFs more moderately than 1/ν-weights. For the RBF kernel (4), the scale parameter μ was the average weighted squared genetic difference between subjects Σ i,j ((g i -g j ) T W(g i -g j ))/n 2 multiplied by the effective number of independent SNPs in the tested set [14].

Strategies for combined testing of common and rare variants
By default, the kernel score test, Eq. (2), is performed with a kernel matrix K all computed on all dosages with a weighting of common and rare SNPs. In contrast, Ionita-Laza et al. [7] recently suggested computing the kernel separately for rare SNPs (K rare ) and for common SNPs (K common ), respectively, in a region of interest. Analogous to Eq. (2), this yields matrices M rare , M common , test statistics Q rare , Q common , and p values p rare , p common . The null model, P o and R were always the same. The weighted sum test (WS) on common and rare variants has test statistic [7].
Weight φ = (tr(M rare •M rare )/(tr(M rare •M rare ) + tr (M common •M common ))) 1/2 may be chosen such that (1 − φ)• Q rare and φ•Q common have the same variance. P values are obtained by Davies' exact method from sample estimates Q WS and all eigenvalues of matrix ((1 − φ)•M rare + φ•M common ). Alternatively, Fishers p value pooling can be applied.

Results and discussion
Our test extension to families holds the nominal significance level and correctly adjusts for a polygenic familial variance component (as demonstrated in [6]). Table 1 lists the p values obtained for association testing of AGTR1 on real SBP, considering common SNPs (MAF >5 %) and rare SNPs (MAF ≤5 %) as well as 3 joint tests (default test K all , WS, Fisher). Beta-weights (not shown) performed between equal weights and 1/ν-weights. The 1/ν-weight lowered p values particularly on common SNPs. AGTR1 association is suggested by common as well as rare SNPs. Joint testing of rare and common SNPs was beneficial. In particular, WS and Fisher test p values were often smaller (and otherwise close to) the smallest p value of the separate rare and common SNP tests. When using ad hoc definitions of the AGTR1 flanking region, Fisher and WS p values remained relatively stable and were also smaller compared to the default test K all . However, on the AGTR1 containing LD-block all joint tests performed highly similar, p values were the smallest and also relatively stable regardless of SNP weights and SNP density.