Fast genome-wide pedigree quantitative trait loci analysis using MENDEL

The linkage era left a rich legacy of pedigree samples that can be used for modern genome-wide association sequencing (GWAS) or next-generation sequencing (NGS) studies. Family designs are naturally equipped to detect rare variants, control for population stratification, and facilitate the study of parent-of-origin effects. Unfortunately, pedigree likelihoods are notoriously hard to compute, and current software for association mapping in pedigrees is prohibitively slow in processing dense marker maps. In a recent release of the comprehensive genetic analysis software MENDEL, we implemented an ultra-fast score test for association mapping with pedigree-based GWAS or NGS study data. Our implementation (a) works for random sample data, pedigree data, or a mix of both;(b) allows for covariate adjustment, including correction for population stratification;(c) accommodates both univariate and multivariate quantitative traits; and (d) allows missing values in multivariate traits. In this paper, we assess the capabilities of MENDEL on the Genetic Analysis Workshop 18 sequencing data. For instance, when jointly testing the 4 longitudinally measured diastolic blood pressure traits, it takes MENDEL less than 51 minutes on a standard laptop computer to read, quality check, and analyze a data set with 959 individuals and 8.3 million single-nucleotide polymorphisms (SNPs). Our analysis reveals association of one SNP in the q32.2 region of chromosome 1. MENDEL is freely available on http://www.genetics.ucla.edu/software.


Background
Pedigree data are attractive in modern association studies because they permit control of population substructure and study of parent-of-origin effects [1]. Related affecteds are also more likely to share the same disease-predisposing gene than unrelated affecteds. The classical variance component model has been a powerful tool for mapping quantitative trait loci in pedigrees [2].Polygenic effects are effectively captured by the kinship coefficient matrix as a variance component. In genome-wide association sequencing (GWAS), two alleles of a single nucleotide polymorphism (SNP) shift trait means and can be tested as a fixed effect. However, fitting a variance component model with pedigrees is computationally challenging, especially when it has to be done for a huge number of markers.We reexamine the computational bottlenecks and implement an ultra-fast score test when pedigree structure is explicitly given. Score tests require no additional iteration under the alternative model.All that is needed is evaluation of a quadratic form combining the score vector and the expected information matrix at the maximum likelihood estimates under the null model. Fast pedigree GWAS is now implemented in our software package MENDEL [3] for easy use by the genetics community. In this paper, we demonstrate the capabilities of MENDEL on the Genetic Analysis Workshop 18 (GAW18) sequencing data.

Methods
Quantitative trait locus (QTL) association mapping typically invokes the multivariate Gaussian distribution to model the observed trait values y = (y i ) over a pedigree. The standard model (2, Chapter 8) collects the corresponding means into a vector ν and the corresponding covariances into a matrix and represents the loglikelihood of a pedigree as L = − where the covariance matrix is typically parameterized as Here the variance component is the global kinship coefficient matrix capturing additive polygenic effects, and 7 is a condensed identity coefficient matrix capturing dominance genetic effects. The household effect matrix H has entries h ij = 1 if individuals i and j are in the same household and 0 otherwise. Individual environmental contributions and trait measurement errors are incorporated via the identity matrix I. When one tests multiple traits, the covariance matrix has to be properly augmented by matrix Kronecker products. QTL fixed effects are captured through the mean component v = Aβ for some predictor matrix A and vector of regression coefficients β.
To implement likelihood ratio testing, iterative maximum likelihood estimation must be undertaken for each and every SNP under the alternative hypothesis. This unfortunate requirement is the major stumbling block retarding pedigree analysis. Score tests serve as convenient substitutes for likelihood ratio tests. A careful analysis shows that the basic elements of the score statistic can be quickly assembled. In MENDEL [3], SNPs with the most impressive score test p-values (top 50 by default) are further tested by the more accurate likelihood ratio method, thus achieving a good compromise of speed and power for large-scale QTL analysis.

Data description
Our analysis is based on the genotype calls for 959 individuals (464 directly sequenced and the rest imputed) provided in the chrX-geno.csv.gz files. Simulated traits in all 200 replicates (SIMPHEN.1-200) were used for size and power studies in the first example. The second example presents results from a pedigree GWAS performed on chromosome 3 using the traits in the first simulation replicate (SIMPHEN.1). A whole genome QTL analysis for the real phenotype diastolic blood pressure (DBP) is presented in the final example.

Adjustment for environmental effects
Both the traits (blood pressures) and some environmental factors are measured (or simulated) on study individuals at 3 or 4 visits. To adjust for the environmental effects of Age, BPMed, Smoke, and Sex, we model the systolic blood pressure (SBP) by a linear mixed model (LMM): where i indexes individuals, t indexes 3 time points, β s are the fixed effects, μ i is an individual level random intercept assumed to be normal with covariance cov μ i , μ j = 2ϕ ij , and ε i,t are independent standard normal errors. If we stack the traits SBP i,t into a column, this corresponds to a variance component model with a genetic component 2σ 2 g (1 3 where is the kinship coefficient matrix, and an environmental component σ 2 e I 3n . LMM is fitted by maximum likelihood (ML).
The estimated fixed effects for traits in simulation replicate 1 are summarized in Table 1. Estimates under the linear model (LM) are listed for comparison. Results from LMM imply significant additive genetic effects. The estimated heritability is 0.65 for SBP, 0.55 for DBP, and 0.63 for Q1. Residuals from LMM will be used as the multiple traits in QTL association mapping. Two types of residuals can be used. Residuals r , whereμ i are the best linear unbiased estimate (BLUE) of the random intercept μ i, are decorrelated from the polygenic effects. QTL mapping can be performed on r (1) i,t without the additive and dominant genetic components in (1). However, this strategy ignores the correlation between the longitudinal traits. Residuals r (2) , whereμ is the estimate for the grand intercept, yield the adjusted traits still containing the polygenic effects. QTL mapping using r (2) i,t needs to keep the genetic components to properly capture the correlation between traits. In the following, we refer to the former as the decorrelated residuals (method 1) and to the latter as the correlated residuals (method 2).
Size and power study (using SIMPHEN.  Powers for detecting the 6 functional variants in the MAP4 gene on chromosome 3 are evaluated based on the provided 200 simulation replicates. Figure 1 displays the box plots of the 200 −log 10 (p-values) for each variant using either the decorrelated (method1) or the correlated residuals (method2). Type I errors are evaluated based on the provided Q1 trait which is not genetically influenced. In general, we found that the decorrelated residuals (method1) lead to higher power but also inflated type I error. The test using the correlated residuals (method2) has well-controlled type I error, high power (0.78 ∼ 0.90) for detecting the common variants 47957996 and 48040283 but low power for the rare variants 47913455 and 47957741. For comparison, we also list the power and the size of likelihood ratio test (LRT) using correlated residuals. LRT edges out the score test in a few cases, but the difference is not significant.LRT is practically infeasible for a large number of SNPs. In the following two pedigree GWAS examples, we present only the results of the score test using correlated residuals (method 2).

Pedigree Genetic Analysis Workshopon chromosome 3 (using SIMPHEN.1)
We performed pedigree GWAS on all available sequence variants on chromosome 3 using the correlated residuals from the traits in SIMPHEN.1.A total of 1,213,657 SNPs passed the filtering and were subject to testing. Figure 2 displays the run times and the Manhattan and quartile-quartile(QQ) plots for jointly testing the multivariate traits SPB. No variants passed the genome-wide significance level (horizontal line). For the null trait Q1, 5.29% of SNPs have p-values less than 0.05, corroborating the correct size of the score test. Results for trait SBP are similar and not displayed.

Analysis of real phenotypes diastolic blood pressure
The phenotypes (SBP and DBP measured at 4 time points) are available for 1389 members from 20 extended families. The largest family contains 107 individuals; the smallest, 27. Genotypes at 8,348,674 SNPs were available on 959 of the individuals. For brevity, we only present results for the multivariate DBP trait here. We adopted the strategy discussed earlierto adjust the multivariate traits for the environmental factors. The table in Figure 3 summarizes the effects of environmental effects estimated by LM and LMM (2). The estimated heritability of the DBP traits is 0.2564. We analyzed all SNPs and pedigrees together for the multivariate traits (DBP 1 , DBP 2 , DBP 3 , DBP 4 ). To read in all the data and run standard QC procedures took 10 minutes and 14 seconds. This QC excluded 10,603 SNPs and 124 individuals based on genotyping success rates below 98%. The subsequent ped-GWAS analysis ran in 40 minutes and 55 seconds and included all of the results plotted in Figures 3. The complete run never used more than 3.2 GB of RAM.
The most significant p-value found by whole genome analysis was 1 × 10 −10.5 on chromosome 1 q32.2 region at 210,338,112 base pairs. No other SNPs reached genome-wide significance.

Conclusions
By supplying a comprehensive, fast, and easy-to-use package for GWAS on quantitative traits in general pedigrees, we hope to encourage exploitation of family-based data sets for gene mapping. A gene mapping study should collect as large a sample as possible consistent with economic constraints and consistent trait phenotyping.If the sample includes pedigrees, all the better. Here we have argued that score tests can efficiently handle unrelated individuals, pedigrees, or a mixture. For human studies, in whichcontrolling breeding is forbidden, nature has provided pedigrees segregating every conceivable genetic trait. Many of these pedigrees are known from previous linkage studies and should be treasured as valuable resources.