Volume 8 Supplement 1

## Genetic Analysis Workshop 18: Human sequence data in extended pedigrees

# Fast genome-wide pedigree quantitative trait loci analysis using MENDEL

- Hua Zhou
^{1}Email author, - Jin Zhou
^{2}, - Eric M Sobel
^{3}and - Kenneth Lange
^{3, 4, 5}

**8(Suppl 1)**:S93

**DOI: **10.1186/1753-6561-8-S1-S93

© Zhou et al.; licensee BioMed Central Ltd. 2014

**Published: **17 June 2014

## Abstract

The linkage era left a rich legacy of pedigree samples that can be used for modern genome-wide association sequencing (GWAS) or next-generation sequencing (NGS) studies. Family designs are naturally equipped to detect rare variants, control for population stratification, and facilitate the study of parent-of-origin effects. Unfortunately, pedigree likelihoods are notoriously hard to compute, and current software for association mapping in pedigrees is prohibitively slow in processing dense marker maps. In a recent release of the comprehensive genetic analysis software MENDEL, we implemented an ultra-fast score test for association mapping with pedigree-based GWAS or NGS study data. Our implementation (a) works for random sample data, pedigree data, or a mix of both;(b) allows for covariate adjustment, including correction for population stratification;(c) accommodates both univariate and multivariate quantitative traits; and (d) allows missing values in multivariate traits. In this paper, we assess the capabilities of MENDEL on the Genetic Analysis Workshop 18 sequencing data. For instance, when jointly testing the 4 longitudinally measured diastolic blood pressure traits, it takes MENDEL less than 51 minutes on a standard laptop computer to read, quality check, and analyze a data set with 959 individuals and 8.3 million single-nucleotide polymorphisms (SNPs). Our analysis reveals association of one SNP in the q32.2 region of chromosome 1. MENDEL is freely available on http://www.genetics.ucla.edu/software.

## Background

Pedigree data are attractive in modern association studies because they permit control of population substructure and study of parent-of-origin effects [1]. Related affecteds are also more likely to share the same disease-predisposing gene than unrelated affecteds. The classical variance component model has been a powerful tool for mapping quantitative trait loci in pedigrees [2].Polygenic effects are effectively captured by the kinship coefficient matrix as a variance component. In genome-wide association sequencing (GWAS), two alleles of a single nucleotide polymorphism (SNP) shift trait means and can be tested as a fixed effect. However, fitting a variance component model with pedigrees is computationally challenging, especially when it has to be done for a huge number of markers.We reexamine the computational bottlenecks and implement an ultra-fast score test when pedigree structure is explicitly given. Score tests require no additional iteration under the alternative model.All that is needed is evaluation of a quadratic form combining the score vector and the expected information matrix at the maximum likelihood estimates under the null model. Fast pedigree GWAS is now implemented in our software package MENDEL [3] for easy use by the genetics community. In this paper, we demonstrate the capabilities of MENDEL on the Genetic Analysis Workshop 18 (GAW18) sequencing data.

## Methods

Quantitative trait locus (QTL) association mapping typically invokes the multivariate Gaussian distribution to model the observed trait values $y=\left({y}_{i}\right)$ over a pedigree. The standard model (2, Chapter 8) collects the corresponding means into a vector $\nu $ and the corresponding covariances into a matrix $\Sigma $ and represents the loglikelihood of a pedigree as

$L=-\frac{1}{2}\phantom{\rule{2.77695pt}{0ex}}\text{lndet}\phantom{\rule{2.77695pt}{0ex}}\Sigma -\frac{1}{2}{\left(y-\nu \right)}^{t}{\Sigma}^{-1}\left(y-\nu \right),$

Here the variance component $\Phi $ is the global kinship coefficient matrix capturing additive polygenic effects, and ${\Delta}_{7}$ is a condensed identity coefficient matrix capturing dominance genetic effects. The household effect matrix $H$ has entries ${h}_{ij}=1$ if individuals $i$ and $j$ are in the same household and 0 otherwise. Individual environmental contributions and trait measurement errors are incorporated via the identity matrix $i$. When one tests multiple traits, the covariance matrix has to be properly augmented by matrix Kronecker products. QTL fixed effects are captured through the mean component $v=A\beta $ for some predictor matrix $A$ and vector of regression coefficients $\beta $.

To implement likelihood ratio testing, iterative maximum likelihood estimation must be undertaken for each and every SNP under the alternative hypothesis. This unfortunate requirement is the major stumbling block retarding pedigree analysis. Score tests serve as convenient substitutes for likelihood ratio tests. A careful analysis shows that the basic elements of the score statistic can be quickly assembled. In MENDEL [3], SNPs with the most impressive score test *p*-values (top 50 by default) are further tested by the more accurate likelihood ratio method, thus achieving a good compromise of speed and power for large-scale QTL analysis.

## Results

### Data description

Our analysis is based on the genotype calls for 959 individuals (464 directly sequenced and the rest imputed) provided in the chrX-geno.csv.gz files. Simulated traits in all 200 replicates (SIMPHEN.1-200) were used for size and power studies in the first example. The second example presents results from a pedigree GWAS performed on chromosome 3 using the traits in the first simulation replicate (SIMPHEN.1). A whole genome QTL analysis for the real phenotype diastolic blood pressure (DBP) is presented in the final example.

### Adjustment for environmental effects

where $i$ indexes individuals, $t$ indexes 3 time points, $\beta $ s are the fixed effects, ${\mu}_{i}$ is an individual level random intercept assumed to be normal with covariance $cov\left({\mu}_{i},{\mu}_{j}\right)=2{\phi}_{ij}$, and ${\epsilon}_{i,t}$ are independent standard normal errors. If we stack the traits $SB{P}_{i,t}$ into a column, this corresponds to a variance component model with a genetic component $2{\sigma}_{g}^{2}\left({1}_{3}{1}_{3}^{t}\otimes \Phi \right)$, where $\Phi $ is the kinship coefficient matrix, and an environmental component ${\sigma}_{e}^{2}{I}_{3n}$. LMM is fitted by maximum likelihood (ML).

*decorrelated*from the polygenic effects. QTL mapping can be performed on ${r}_{i,t}^{\left(1\right)}$ without the additive and dominant genetic components in (1). However, this strategy ignores the correlation between the longitudinal traits. Residuals ${r}_{i,t}^{\left(2\right)}=SB{P}_{i,t}-\left(\widehat{\mu}+{x}_{i,t}^{t}\widehat{\beta}\right)$, where $\widehat{\mu}$ is the estimate for the grand intercept, yield the adjusted traits still containing the polygenic effects. QTL mapping using ${r}_{i,t}^{\left(2\right)}$ needs to keep the genetic components to properly capture the correlation between traits. In the following, we refer to the former as the decorrelated residuals (method 1) and to the latter as the correlated residuals (method 2).

Summary of environmental effects for traits systolic blood pressure(top), diastolic blood pressure(middle) and Q1 (bottom) in simulation replicate SIMPHEN.1

| $\mu $ | ${\beta}_{Age}$ | ${\beta}_{BPMed}$ | ${\beta}_{Smoke}$ | ${\beta}_{Sex}$ | ${\beta}_{Age\times Sex}$ | ${\sigma}_{g}^{2}$ | ${\sigma}_{e}^{2}$ | ${R}^{2}$ |
---|---|---|---|---|---|---|---|---|---|

LM LMM | 119.360 (0) 119.739 (0) | 0.135 (2 × 10 0.168 (1 × 10 | 13.088 (7 × 10 6.981 (0) | 0.284 (6 × 10 0.556 (4 × 10 | −19.547 (1 × 10 −20.985 (0) | 0.387 (4 × 10 0.418 (0) | -- 112.58 | 139.558 58.128 | 42.4% 74.46% |

| $\mu $ | ${\beta}_{Age}$ | ${\beta}_{BPMed}$ | ${\beta}_{Smoke}$ | ${\beta}_{Sex}$ | ${\beta}_{Age\times Sex}$ | ${\sigma}_{g}^{2}$ | ${\sigma}_{e}^{2}$ | ${R}^{2}$ |

LM LMM | 75.781 (0) 75.382 (0) | −0.052 (7 × 10 −0.032 (8 × 10 | 1.893 (7 × 10 −0.751 (1 × 10 | −0.109 (8 × 10 −0.087 (9 × 10 | −8.201 (2 × 10 −8.305 (7 × 10 | 0.124 (5 × 10 0.131 (3 × 10 | -- 49.848 | 81.632 40.395 | 4.8% 54.8% |

| $\mu $ | ${\beta}_{Age}$ | ${\beta}_{BPMed}$ | ${\beta}_{Smoke}$ | ${\beta}_{Sex}$ | ${\beta}_{Age\times Sex}$ | ${\sigma}_{g}^{2}$ | ${\sigma}_{e}^{2}$ | ${R}^{2}$ |

LM LMM | 38.642 (0) 39.115 (0) | −0.087 (2 × 10 −0.079 (1 × 10 | −2.508 (3 × 10 −2.211 (3 × 10 | 0.270 (7 × 10 0.239 (7 × 10 | 8.904 (4 × 10 8.809 (1 × 10 | 0.005 (9 × 10 0.000 (9 × 10 | -- 53.373 | 85.260 31.615 | 21.9% 76.8% |

### Size and power study (using SIMPHEN.1-200)

*MAP4*gene on chromosome 3 are evaluated based on the provided 200 simulation replicates. Figure 1 displays the box plots of the 200 $-lo{g}_{10}$(

*p*-values) for each variant using either the decorrelated (method1) or the correlated residuals (method2). Type I errors are evaluated based on the provided Q1 trait which is not genetically influenced. In general, we found that the decorrelated residuals (method1) lead to higher power but also inflated type I error. The test using the correlated residuals (method2) has well-controlled type I error, high power (0.78 $~$ 0.90) for detecting the common variants 47957996 and 48040283 but low power for the rare variants 47913455 and 47957741. For comparison, we also list the power and the size of likelihood ratio test (LRT) using correlated residuals. LRT edges out the score test in a few cases, but the difference is not significant. LRT is practically infeasible for a large number of SNPs. In the following two pedigree GWAS examples, we present only the results of the score test using correlated residuals (method 2).

### Pedigree Genetic Analysis Workshopon chromosome 3 (using SIMPHEN.1)

*p*-values less than 0.05, corroborating the correct size of the score test. Results for trait SBP are similar and not displayed.

### Analysis of real phenotypes diastolic blood pressure

The phenotypes (SBP and DBP measured at 4 time points) are available for 1389 members from 20 extended families. The largest family contains 107 individuals; the smallest, 27. Genotypes at 8,348,674 SNPs were available on 959 of the individuals. For brevity, we only present results for the multivariate DBP trait here.

The most significant *p*-value found by whole genome analysis was 1 × 10^{−10.5} on chromosome 1 q32.2 region at 210,338,112 base pairs. No other SNPs reached genome-wide significance.

## Conclusions

By supplying a comprehensive, fast, and easy-to-use package for GWAS on quantitative traits in general pedigrees, we hope to encourage exploitation of family-based data sets for gene mapping. A gene mapping study should collect as large a sample as possible consistent with economic constraints and consistent trait phenotyping.If the sample includes pedigrees, all the better. Here we have argued that score tests can efficiently handle unrelated individuals, pedigrees, or a mixture. For human studies, in whichcontrolling breeding is forbidden, nature has provided pedigrees segregating every conceivable genetic trait. Many of these pedigrees are known from previous linkage studies and should be treasured as valuable resources.

## Declarations

### Acknowledgements

The authors gratefully acknowledge the National Institutes of Health (NIH) grants GM053275 (EMS and KL) and HG006139 (HZ, EMS, and KL) and National Science Foundation (NSF) grant DMS-1310319 (HZ).The GAW18 WGS data were provided by the T2D-GENES (Type 2 Diabetes Genetic Exploration by Next-generation sequencing in Ethnic Samples)Consortium, which is supported by National Institutes of Health (NIH) grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The GAW is supported by NIH grant R01 GM031575.

This article has been published as part of *BMC Proceedings* Volume 8 Supplement 1, 2014: Genetic Analysis Workshop 18. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S1. Publication charges for this supplement were funded by the Texas Biomedical Research Institute.

## Authors’ Affiliations

## References

- Ott J, Kamatani Y, Lathrop M: Family-based designs for genome-wide association studies. Nat Rev Genet. 2011, 12: 465-474.View ArticlePubMedGoogle Scholar
- Lange K: Mathematical and Statistical Methods for Genetic Analysis Statistics for Biology and Health. 2002, New York, Springer-Verlag, 2View ArticleGoogle Scholar
- Lange K, Papp JC, Sinsheimer JS, Sripracha R, Zhou H, Sobel ES: Mendel: the Swiss army knife of genetic analysis programs. Bioinformatics. 2013, 29: 1568-1570. 10.1093/bioinformatics/btt187.PubMed CentralView ArticlePubMedGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.