Volume 5 Supplement 9

## Genetic Analysis Workshop 17: Unraveling Human Exome Data

# Distance-based phenotypic association analysis of DNA sequence data

- Doyoung Chung
^{1}Email author, - Qunyuan Zhang
^{1}, - Aldi T Kraja
^{1}, - Ingrid B Borecki
^{1}and - Michael A Province
^{1}

**5(Suppl 9)**:S54

**DOI: **10.1186/1753-6561-5-S9-S54

© Chung et al; licensee BioMed Central Ltd. 2011

**Published: **29 November 2011

## Abstract

As the cost of sequencing decreases, the demand for association tests that use exhaustive DNA sequence information increases. One such association test is multivariate distance matrix regression (MDMR). We explore some of the features of MDMR using Genetic Analysis Workshop 17 simulated data in search of potential improvements in distance measures. We used genotype data from 697 unrelated individuals, in 200 replications, to test the power of MDMR to detect 13 trait Q2 causative genes based on the Euclidean distance metric. We also estimated the false-positive rate of MDMR using 508 control genes. In addition, we compared MDMR with Mantel’s test and collapsing analysis for rare variants. MDMR performed comparably well even with the Euclidean distance measure.

## Background

High-throughput sequencing technology allows identification of new rare alleles in a human population, but the sparseness of these alleles in samples becomes an important obstacle to detecting the true effects of rare variants under a single-variant-based paradigm. The increasing number of identified genetic variants also requires more statistical tests, thereby aggravating the issue of low statistical power. Several methods, based on collapsing or grouping rare variants, have been developed to alleviate this problem. However, these methods arbitrarily define common and rare variants and treat them differently for analysis. This artificial classification may fail to capture important biological reality. For example, it has been shown that a common variant can act as a modifier of a rare variant’s effect [1, 2]. It is quite possible that collapsing multiple rare variants may dilute the true genetic effect.

Multivariate distance matrix regression (MDMR) can provide a flexible platform for a phenotypic association test of DNA sequence [3]. In this method, for a region of DNA sequence, genotype dissimilarities between individuals are associated with phenotype dissimilarities. MDMR does not test a single variant but instead aggregates each genetic variant’s information to build a distance matrix. For this reason, once an appropriate distance matrix is assumed, the issues of multiple testing and sparseness of data become less critical for MDMR. The most critical problem in MDMR is how to define a distance metric for summarizing genetic differences between individuals in relation to their influence on trait(s) or disease(s). Different distance measures can be used; one of them is the Euclidean distance. Similarity-based association tests such as MDMR can be as powerful as some traditional tests of association when common variations are involved [4]. However, the utility of these methods for rare variant analysis is not well understood and needs to be interrogated [5].

## Methods

### Multivariate distance matrix regression

**Y**is an

*n*×

*p*data matrix, in which

*n*is the number of subjects and

*p*is the number of unknown biologically relevant genetic variables; and

**X**is an

*n*×

*m*model matrix, in which

*m*is the number of predictor variables.

*β*is an

*m*×

*p matrix of beta coefficients and ε*is an

*n*×

*p*residual matrix. Because genotype is regressed using a combination of traits and covariates under this model,

*m*regressors can be phenotypes (e.g., Q1, Q2, and Q4 in the Genetic Analysis Workshop 17 [GAW17] data) and covariates (e.g., Sex, Age, and Smoke in the GAW17 data). A pseudo-

*F*statistic can be constructed to test the null hypothesis of

*β*= 0:

and the **G** matrix is Gower’s centered matrix, which can be calculated from an *n* × *n* distance matrix **D**. The **G** matrix replaces **YY′** and can be calculated from any symmetric distance matrix allowing for nonmetric dissimilarity measures. The **I** matrix is the *n* × *n* identity matrix, and tr stands for the trace of matrix. To assess statistical significance of the pseudo-*F* statistic, one can use permutation tests.

### MDMR as a gene-based association test

where the Euclidean distance is defined as the L2 norm between two individual genotype vectors **a** and **b**. Genotypes were coded as the number of minor alleles with no weighting of single-nucleotide polymorphisms (SNPs) was applied. For each gene and each Q2 simulation, we constructed a 697 × 697 genotypic distance matrix **D** and a 697 × 1 phenotype matrix **X**, which consists of the individual Q2 trait values, and used them to calculate a pseudo-*F* statistic under the regression model that includes the Q2 trait as the sole independent variable. Each of the 13 × 200 tests underwent 1,000 permutations in which the rows and columns of its raw genotype matrix (i.e., the individual-by-SNP matrix) were shuffled at random. The empirical *p*-value was determined as the frequency of observing more extreme pseudo-*F* statistics in permutations than in the actual gene case. MDMRs were performed either using all variants within a gene or using only rare variants with minor allele frequency (MAF) less than 0.01. Similarly, we selected 508 noncausative (i.e., control) genes for Q2 and tested them using all 200 replications. We omitted a subset containing 125 genes from these 508 control genes for the rare-variant-only analyses because they contained no rare variants.

### Mantel test

The Mantel test measures the correlation between two distance matrices [8]. In our application, we calculated a phenotypic distance matrix and a genotypic distance matrix based on the Euclidean distance measure. The two distance matrices were then tested for correlation [9]. The genotypic distance matrix for the Mantel test was identical with that of the MDMR, whereas a 697 × 697 distance matrix was calculated for each Q2 simulated replicate. Mantel tests were performed for the 13 Q2 risk genes using either all variants or only rare variants. Similarly, 508 control genes were tested for association using all variants, among which 383 genes continued to be tested using only rare variants. *P-values* were empirically determined using 1,000 permutations. We estimated the power and false-positive rates on the basis of the significance threshold value of 0.05 and compared them with the values from MDMR and collapsing analysis.

### Collapsing analysis

Collapsing analysis is a simple regression analysis that uses a collapsed variable [10] into which rare variants are collapsed in a binary manner based on the presence of any rare variant. Because our collapsing analysis excluded all “common” variants (defined by MAF > 0.01), we also removed common variants in the other analyses to facilitate comparison. This allowed 12 Q2 risk genes to be compared, because one risk gene had no rare variants. Similarly, we tested 380 selected genes, simulated under the null hypothesis for Q2, for association with Q2 using all three methods. No correction for population structure or hidden relatedness was applied throughout this study.

## Results

*VNN1*was significantly (

*p*< 0.05) associated with Q2 in 94% of the total replicates when MDMR using all variants was used. The Mantel test, however, discovered association in only 25% of

*VNN1*replicates and failed to find any risk gene with a detection rate greater than 50% regardless of the MAF-based SNP filtering.

True positive rates of five different strategies for the 13 Q2 risk genes

Gene | Setting | MDMR using all variants | Mantel test using all variants | MDMR using only rare variants | Mantel test using only rare variants | Collapsing analysis using only rare variants |
---|---|---|---|---|---|---|

| 1c + 28r (13s) | 0.045 | 0.170 | 0.320 | 0.310 | 0.455 |

| 1c (1s) | 0.405 | 0.150 | NA | NA | NA |

| 1c + 4r (3s) | 0.090 | 0.020 | 0.040 | 0.040 | 0.035 |

| 5c (1s) + 15r (2s) | 0.045 | 0.135 | 0.060 | 0.125 | 0.040 |

| 5c + 6r (4s) | 0.065 | 0.035 | 0.685 | 0.290 | 0.745 |

| 4c + 25r (8s) | 0.035 | 0.030 | 0.055 | 0.040 | 0.110 |

| 2c + 9r (2s) | 0.105 | 0.145 | 0.410 | 0.115 | 0.155 |

| 1c + 23r (9s) | 0.365 | 0.285 | 0.605 | 0.320 | 0.330 |

| 3c + 21r (10s) | 0.030 | 0.110 | 0.380 | 0.205 | 0.690 |

| 4c + 23r (8s) | 0.055 | 0.065 | 0.140 | 0.140 | 0.140 |

| 1c (1s) + 6r (1s) | 0.940 | 0.250 | 0.200 | 0.085 | 0.050 |

| 6c (3s) + 9r (4s) | 0.190 | 0.175 | 0.025 | 0.055 | 0.030 |

| 2c + 6r (2s) | 0.180 | 0.080 | 0.285 | 0.080 | 0.190 |

Mean | 0.196 | 0.127 | 0.267 | 0.150 | 0.248 |

When variants with MAF > 0.01 were removed, MDMR identified two genes, *PDGFD* and *SIRT1*, with power greater than 50%. *VNN1* did not survive the cutoff this time, presumably because one of its causal SNPs was common and thus removed from the analysis. The common variants of *PDGFD* and *SIRT1* were all noncausal, and removal of these variants may have enhanced performance of MDMR on these genes. *PDGFD* was also found by collapsing analysis using the cutoff value of 50%, along with *SREBF1*. The estimated power for the 12 Q2 risk genes was comparable between the Euclidean MDMR and collapsing analysis, whereas the Mantel test appeared to be less sensitive than the other methods.

## Discussion

MDMR is a statistical method to test the aggregate effect of genetic variants based on a pairwise genotypic distance matrix. We applied the Euclidean MDMR to the GAW17 simulated data to compute the power of the method for detecting causative Q2 genes. We estimated the power and false-positive rate of MDMR using 200 simulated replicates of the Q2 trait for 13 Q2 causative genes and 508 control genes, respectively. MDMR was compared with collapsing analysis and the Mantel test for its performance on rare variants. We observed that the Euclidean MDMR performs comparably to collapsing analysis. The Mantel test, another distance-based method, seemed to behave more conservatively than the other methods. Our study also suggests that MDMR will perform better than the Mantel test and collapsing analysis in some genetic settings, presumably depending on the number of causal and noncausal variants under interrogation, their respective effect sizes, and any dependencies among them. To dissect the effect of each individual factor on performance, we need to examine various rare variant analysis methods in more contrived, simulated settings that confine confounding variables.

Originally designed for a region of genome, MDMR anticipates multiple causal SNPs with joint actions on the phenotype(s). Thus the unit of test can be easily reduced to a functional domain or a small set of adjacent SNPs [5] if we expect multiple causal variants in that unit. Our results imply that variant selection can affect the performance of MDMR. Along with determining biologically relevant variants to analyze, calculating distances using selected variants is crucial to any distance-based method. The performance of MDMR would be improved significantly if one could invent a measure of genetic distance that closely reflects phenotypic dissimilarity. MDMR may be inappropriate for large studies because it is computationally intensive. This is the reason we sampled a subset of control genes to estimate false-positive rates. Our MDMR program was implemented in the R programming language. Thus the analysis could be expedited by using a faster language, such as C or Java.

## Conclusions

The Euclidean MDMR performed comparably to collapsing analysis to detect the Q2 causal genes. The Mantel test was less sensitive than these methods with a slightly reduced type I error rate. Potential progress can be made because the distance matrix appreciates genotypic dissimilarities relevant only to phenotypic dissimilarities.

## Declarations

### Acknowledgments

The Genetic Analysis Workshop is supported by National Institutes of Health grant R01 GM031575 from the National Institute of General Medical Sciences.

This article has been published as part of *BMC Proceedings* Volume 5 Supplement 9, 2011: Genetic Analysis Workshop 17. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/5?issue=S9.

## Authors’ Affiliations

## References

- Thein SL, Menzel S: Discovering the genetics underlying foetal haemoglobin production in adults. Br J Haematol. 2009, 145: 455-467. 10.1111/j.1365-2141.2009.07650.x.View ArticlePubMed
- Zaghloul NA, Liu Y, Gerdes JM, Gascue C, Oh EC, Leitch CC, Bromberg Y, Binkley J, Leibel RL, Sidow A, et al: Functional analyses of variants reveal a significant role for dominant negative and common alleles in oligogenic Bardet-Biedl syndrome. Proc Natl Acad Sci USA. 2010, 107: 10602-10607. 10.1073/pnas.1000219107.PubMed CentralView ArticlePubMed
- Schork NJ, Wessel J, Malo N: DNA sequence-based phenotypic association analysis. Adv Genet. 2008, 60: 195-217.View ArticlePubMed
- Lin WJ, Schaid DJ: Power comparisons between similarity-based multilocus association methods, logistic regression, and score tests for haplotypes. Genet Epidemiol. 2009, 33: 183-197. 10.1002/gepi.20364.PubMed CentralView ArticlePubMed
- Bansal V, Libiger O, Torkamani A, Schork NJ: An application and empirical comparison of statistical analysis methods for associating rare variants to a complex phenotype. Pac Symp Biocomput. 2011, 76-87.
- McArdle BH, Anderson MJ: Fitting multivariate models to community data: a comment on distance-based redundancy analysis. Ecology. 2001, 82: 290-297. 10.1890/0012-9658(2001)082[0290:FMMTCD]2.0.CO;2.View Article
- Schaid DJ: Genomic similarity and kernel methods I: advancements by building on mathematical and statistical foundations. Hum Hered. 2010, 70: 109-131. 10.1159/000312641.View ArticlePubMed
- Mantel N: The detection of disease clustering and a generalized regression approach. Cancer Res. 1967, 27: 209-220.PubMed
- Dray S, Dufour AB: The ade4 package: implementing the duality diagram for ecologists. J Stat Softw. 2007, 22: 1-20.View Article
- Li B, Leal SM: Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008, 83: 311-321. 10.1016/j.ajhg.2008.06.024.PubMed CentralView ArticlePubMed

## Copyright

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.