Volume 8 Supplement 1

## Genetic Analysis Workshop 18

# Entropy-based method for assessing the influence of genetic markers and covariates on hypertension: application to Genetic Analysis Workshop 18 data

- Jun Liu
^{1}and - Joseph Beyene
^{1}Email author

**8(Suppl 1)**:S97

https://doi.org/10.1186/1753-6561-8-S1-S97

© Liu and Beyene; licensee BioMed Central Ltd. 2014

**Published: **17 June 2014

## Abstract

Many complex diseases are related to genetics, and it is of great interest to evaluate the association between single-nucleotide polymorphisms (SNPs) and disease outcome. The association of genetics with outcome can be modified by covariates such as age, sex, smoking status, and membership to the same pedigree. In this paper, we propose a block entropy method to separate two classes of SNPs, for which the association with hypertension is either sensitive or insensitive to the covariates. We also propose a consistency entropy method to further reduce the number of SNPs that might be associated with the outcome. Based on the data provided by the organizers of Genetic Analysis Workshop 18, we calculated the block entropies for six different blocking strategies. Using block entropy and consistency entropy, we identified 230 SNPs on chromosome 9 that are most likely to be associated with the outcome and whose associations with hypertension are sensitive to the covariates.

## Introduction

Hypertension is the leading cause of cardiovascular disease worldwide. In the period 1999 to 2002, 28.6% of the U.S. population had hypertension, and several million people in the world currently have this condition [1]. This disease is often called the "silent killer" because it may not cause symptoms until the patient has sustained serious damage to the arteries, brain, and kidney. With advances in technology, more genetic information is generated, and association studies of genetic variants with disease are routinely investigated. Here we focus on the combined role of genetics and covariates [2] in the development of hypertension.

In this paper, we propose a method to separate two classes of single-nucleotide polymorphisms (SNPs) from all SNPs, for which the associations with hypertension are either sensitive or insensitive to the covariates by calculating the block entropy or block entropy gain of SNPs of people who are split into different blocks according to their covariates. We also propose a method to check the consistency of the proportions of the outcome grouped by the main genotype among the blocks by an entropy method that will help us identify the SNPs that are probably associated with the disease outcome. Based on the combined results of the block entropy and consistency entropy, we can find the SNPs that are most probably associated with the outcome.

We believe that our method is useful for detecting important associations between outcome and genetic markers as well as covariates. For example, if hypertension is associated with a covariate, the proportion of hypertension will be different for different covariate values. In that case, we say that hypertension is sensitive to the covariate. Similarly, if hypertension is associated with a SNP, the proportion of hypertension in different genotypes will be different. Therefore, we say that hypertension is sensitive to the SNP.

## Materials and methods

### Genotype, covariates, and phenotype data

Our methods for investigating the relationship between SNPs and hypertension are based on data provided by the organizers of Genetic Analysis Workshop 18 (GAW18). Genotype data are provided for 11 odd-numbered chromosomes. We use all the SNPs on chromosomes 3 and 9 from genome-wide association studies (GWAS) files, including all people with genotype data. A total of 20 pedigrees are provided with information on father, mother and sex. The phenotype (hypertension) is measured for a maximum of 4 points. In this paper, we use the baseline hypertension status. We include the following covariates at baseline: Sex, Smoke, and Age.

### Block entropy and block entropy gain

Shannon entropy [3] is used as a measure of genetic diversity [4]. For a binary random variable $E$ with probability $p=Prob\left\{E=1\right\}$, Shannon entropy is defined as $-plo{g}_{2}p.$ We define the entropy of $E$, which equals entropy(*p*, 1*-p*) in information [5],as $G\left(E\right)=-plo{g}_{2}p-\left(1-p\right)lo{g}_{2}\left(1-p\right)$.

*p*in the block of sample. Suppose we have

*m*blocks;

*n*observations are split into these

*m*blocks ${B}_{1},{B}_{2},\dots {\text{B}}_{m}$ with block size of ${n}_{1},{n}_{2},\dots ,{n}_{m},$ respectively. Block ${B}_{i}$ (

*i*= 1, 2, ...,

*m*) has ${q}_{i}$ observations with the outcome. The observations in one of the blocks ${B}_{i}$ s (

*i*= 1, 2, ...,

*m*) are assumed to share something in common in which "similarity" might be defined based on covariates and relatedness using covariates such as age, sex, smoking status, and membership to the same pedigree. Because each block has 3 genotype classes for each SNP, block ${B}_{i}$ can be classified into 3 groups of observations, ${B}_{i0},{B}_{i1},$ and ${B}_{i2}$, with sizes given by ${n}_{i0},{n}_{i1},$ and ${n}_{i2}$, respectively. All of the observations in ${B}_{ij}$ have the genotype

*j*, for

*j*= 0, 1, 2. The group ${B}_{ij}$ has ${q}_{ij}$ observations with the outcome. For block ${B}_{i}$, the entropy is

And the block entropy gain (I) is defined as ${G}_{a}={G}^{*}-G$.

where ${n}_{.j}$ (*j* = 0, 1, 2 ) is the number of observations with genotype *j* and ${q}_{.j}$ (*j* = 0, 1, 2) is the number of observations with genotype *j* with the outcome.

Consider that the study population is split according to one attribute, for example,age. We calculate ${G}^{*}$ the entropy for this attribute without considering SNPs. We then calculate ${G}^{*}$ for another attribute, say Smoke. If ${G}^{*}$ is smaller for Age, then it demonstrates that Age has stronger associations with hypertension than does Smoke. We calculate the block entropy, *G*, for the blocks split by Age at each locus. The lower value of G shows that adjusting for Age is informative for the SNP. In contrast, the block entropy gain (I) or (II) goes in the opposite way. The higher the block entropy gain is, the more informative the SNP is.

### Associations of genotypes with covariates

*j*with the outcome is ${p}_{ij}=\frac{{q}_{ij}}{{n}_{ij}}$, and the proportions of the outcome in the block is ${p}_{i}=\frac{{q}_{i}}{{n}_{i}}$. For all the

*m*blocks and a genotype

*j*, there exists ${\epsilon}_{1j},{\epsilon}_{2j},\dots ,{\epsilon}_{mj}$ and ${\epsilon}_{kj}=0$ for $k=\left\{\text{l}:{p}_{l}=\text{ma}{\text{x}}_{\text{i}\u03f5\left\{1,2,\dots ,m\right\}}\left\{{p}_{i}\right\}\right\}$ such that the following proportion equation is satisfied,that is, ${p}_{1}:{p}_{2}:\dots :{p}_{m}=\left({p}_{1j}+{\epsilon}_{1j}\right):\left({p}_{2j}+{\epsilon}_{2j}\right):\dots :\left({p}_{mj}+{\epsilon}_{mj}\right)$. Therefore, $\frac{ma{x}_{k\u03f5\left\{1,2,\dots ,m\right\}}\left\{{p}_{k}\right\}}{{p}_{1}}\left({p}_{1j}+{\epsilon}_{1j}\right)=\frac{ma{x}_{k\u03f5\left\{1,2,\dots ,m\right\}}\left\{{p}_{k}\right\}}{{p}_{2}}\left({p}_{2j}+{\epsilon}_{2j}\right)=\dots =\frac{ma{x}_{k\u03f5\left\{1,2,\dots ,m\right\}}\left\{{p}_{k}\right\}}{{p}_{m}}\left({p}_{mj}+{\epsilon}_{mj}\right).$ It is clear that if the genotype

*j*is fully associated with the covariates and the outcome is fully associated with the covariates, then ${\epsilon}_{1j},{\epsilon}_{2j},\dots ,{\epsilon}_{mj}=0$. We can say the genotype

*j*in this extreme condition is consistent. If there is no relationship between the genotype and the covariates, ${\epsilon}_{ij}$ can be any value. Let the size of the block belarge enough to have the blocking effects, say ${n}_{0}$. We define the standardized proportion of the associations of the genotype

*j*with the outcome in block ${B}_{i}$ as

where ${n}_{0}$ is a constant based on the overall block sizes.

The standardized proportion of the total associations of a genotype *j* with the outcome at a locus is defined as $F\left(j\right)=\sum _{i=1}^{m}\frac{{n}_{i}}{n}F\left(i,j\right).$

### The consistency of the main genotype

Because a block includes people with 3 genotypes of SNPs, the associations of a genotype *j* with the outcome may have a higher standardized proportion in one block but lower in another block. This genotype is not consistent in the 2 blocks. Although the entropy is low for this SNP, which means that the associations of the SNP and covariates with the outcome are high, the nonconsistency of the genotype across the blocks leads to a weak conclusion that these informative SNPs are related tothe covariates.

*j*with the outcome is F(

*j*). We choose

*j*satisfying $\text{F}\left(j\right)=ma{x}_{k\u03f5\left\{0,1,2\right\}}\left\{F\left(k\right)\right\}$. We define entropy to check the consistency of the standardized proportion of the associations of this main genotype j.

The greater the consistency entropy of the main genotype *j* is, the more powerful the consistency of the genotype *j*.

### The block entropy or entropy gain algorithm

We first identify observations with complete genotype data; obtain the values of the covariates Sex, Smoke, and Age as well as classes of pedigree and hypertension at baseline $\text{HT}{\text{N}}_{1}$. Then we choose a blocking strategy, for example, the blocking strategy based on the attribute Age.

We next calculate the block entropy or block entropy gain according to the above blocking strategy at each locus and order the block entropies in ascending order for all loci. The upper *s*% (e.g., 15%) of SNPs in ascending order of the block entropies are those whose associations with the outcome are related to the covariates. The lower *i*% (e.g.,15%) part of loci define the SNPs whose effects are insensitive to the covariates, among which many loci include rare genotypes.

For the upper *s*% part of loci, we find the main genotype *j* and calculate the consistency entropy for the standardized proportion of the associations of the genotype *j* with the outcome. We remove the SNPs with the lower consistency entropies; the SNPs associated with the outcome remain. We can combine some block entropies and consistency entropies for these SNPs to find the important SNPs associated with the outcome.

## Results

### Application to real data

Our first blocking strategy takes all observations as a block. As a crude estimation, the block entropy or entropy gain can classify the SNPs at loci as either associated with the outcome or not associated with the outcome. We next choose the blocking strategy based on a single attribute (Sex, Smoke, Age, and Pedigree) and calculate the block entropy and block entropy gain for all the SNPs, respectively. For the attribute Sex, the people are blocked based on Sex = 1 and Sex = 2. For the attribute Smoke, the people are blocked based on Smoke = 0 and Smoke = 1. For the attribute Age, the people are blocked based on Age < 60 and Age ≥ 60. And for the attribute Pedigree, there are 20 pedigrees splitting the entire sample. Therefore, the people are grouped in 20 blocks to calculate the block entropy and block entropy gain for the attribute Pedigree.

Finally, we choose the blocking strategy based on all the attributes such that all observations are split into the blocks for all the values of these attributes. For example, the people with Sex =1, Smoke = 0,Age< 60, and Pedigree =2 are in one block. The people with the attributes Sex= 1, Smoke = 0, Age ≥ 60, and Pedigree = 3 form another block.

Using the block entropy approach, we classify the SNPs related to the disease outcome as either sensitive or insensitive to the covariates. Here we focus on the SNPs sensitive to the covariates. We also calculate the entropies for checking the consistency of the main genotype among the blocks. The rules are $\text{BE}<{V}_{BE}$ and $\text{CE}>{V}_{CE},$ where BE is a block entropy, CE is a consistency entropy, and ${V}_{BE}$ and ${V}_{CE}$ are limit values of BE and CE, respectively. For each one of consistency entropies, we remove around 20% of inconsistent SNPs. For CE_Ped, we keep around 40% of SNPs. And ${V}_{BE}=B{E}_{min}+k*\left(B{E}_{max}-B{E}_{min}\right),$ where $B{E}_{min}$ is the minimum block entropy based on the covariate and $B{E}_{max}$ is the maximum one. k = 0.45 for chromosome 9,and k is between 0.5 and 0.7 for chromosome 3.

Single-nucleotide polymorphisms in chromosome 9 that are sensitive to the covariates

No-Attr | Sex | Smoke | Age | Ped | All-Attr | |||||
---|---|---|---|---|---|---|---|---|---|---|

ID | BE | BE | CE | BE | CE | BE | CE | BE | CE | BE |

rs7030214 | 0.667 | 0.663 | 0.982 | 0.659 | 0.991 | 0.567 | 0.943 | 0.587 | 0.295 | 0.320 |

rs4962043 | 0.668 | 0.665 | 0.999 | 0.662 | 0.999 | 0.580 | 0.987 | 0.593 | 0.295 | 0.328 |

rs6833 | 0.670 | 0.667 | 0.999 | 0.662 | 0.991 | 0.579 | 0.994 | 0.597 | 0.308 | 0.341 |

rs489504 | 0.666 | 0.657 | 0.964 | 0.659 | 0.958 | 0.577 | 0.987 | 0.597 | 0.300 | 0.334 |

rs3118667 | 0.665 | 0.661 | 0.994 | 0.659 | 0.997 | 0.579 | 0.993 | 0.599 | 0.295 | 0.353 |

rs3094375 | 0.669 | 0.667 | 0.999 | 0.662 | 0.990 | 0.581 | 0.996 | 0.600 | 0.294 | 0.348 |

rs1752337 | 0.669 | 0.663 | 0.996 | 0.663 | 0.992 | 0.575 | 0.979 | 0.601 | 0.291 | 0.308 |

rs652600 | 0.663 | 0.661 | 0.999 | 0.658 | 0.996 | 0.578 | 0.999 | 0.604 | 0.292 | 0.354 |

rs3124768 | 0.668 | 0.665 | 0.996 | 0.662 | 0.999 | 0.581 | 0.996 | 0.604 | 0.296 | 0.358 |

rs7867300 | 0.666 | 0.663 | 0.999 | 0.658 | 0.999 | 0.581 | 0.999 | 0.605 | 0.292 | 0.343 |

rs4877972 | 0.668 | 0.665 | 0.995 | 0.661 | 0.998 | 0.578 | 0.997 | 0.605 | 0.302 | 0.341 |

rs10781268 | 0.669 | 0.664 | 0.983 | 0.659 | 0.984 | 0.579 | 0.997 | 0.606 | 0.301 | 0.342 |

rs10811664 | 0.666 | 0.662 | 0.999 | 0.657 | 0.985 | 0.573 | 0.979 | 0.609 | 0.309 | 0.337 |

rs774227 | 0.668 | 0.663 | 0.960 | 0.661 | 0.999 | 0.574 | 0.999 | 0.619 | 0.293 | 0.359 |

rs557749 | 0.669 | 0.662 | 0.999 | 0.658 | 0.983 | 0.576 | 0.995 | 0.619 | 0.306 | 0.337 |

rs2422493 | 0.669 | 0.665 | 0.997 | 0.661 | 0.999 | 0.579 | 0.998 | 0.611 | 0.300 | 0.339 |

rs2184026 | 0.669 | 0.664 | 0.997 | 0.661 | 0.952 | 0.574 | 0.998 | 0.611 | 0.295 | 0.346 |

Single-nucleotide polymorphismsin chromosome 3 that are sensitive to the covariates

No-Attr | Sex | Smoke | Age | Ped | All-Attr | |||||
---|---|---|---|---|---|---|---|---|---|---|

ID | BE | BE | CE | BE | CE | BE | CE | BE | CE | BE |

rs6804033 | 0.669 | 0.663 | 0.997 | 0.659 | 0.992 | 0.577 | 0.979 | 0.590 | 0.291 | 0.341 |

rs399703 | 0.669 | 0.665 | 0.999 | 0.659 | 0.978 | 0.576 | 0.997 | 0.594 | 0.293 | 0.338 |

rs2844347 | 0.670 | 0.665 | 0.999 | 0.662 | 0.991 | 0.577 | 0.996 | 0.592 | 0.295 | 0.339 |

rs2555239 | 0.666 | 0.662 | 0.999 | 0.659 | 0.999 | 0.578 | 0.999 | 0.595 | 0.290 | 0.330 |

rs1918026 | 0.663 | 0.657 | 0.997 | 0.654 | 0.983 | 0.575 | 0.996 | 0.602 | 0.305 | 0.349 |

rs9813198 | 0.674 | 0.666 | 0.988 | 0.667 | 0.998 | 0.577 | 0.997 | 0.603 | 0.292 | 0.345 |

### Application to simulated phenotype data

Counts of "functional"single-nucleotide polymorphismsin the top 5% single-nucleotide polymorphisms in chromosome 9 for the 200 simulations

ID | No-Attr BE | Sex BE | Age BE | Ped BE | All-Attr BE | All-Attr BE+CE | SSA BE | SSA BEG |
---|---|---|---|---|---|---|---|---|

rs2776859 | 5 | 4 | 4 | 44 | 12 | 18 | 6 | 15 |

rs2182870 | 3 | 3 | 7 | 12 | 17 | 34 | 23 | 24 |

rs1197774 | 8 | 5 | 5 | 37 | 32 | 32 | 26 | 31 |

rs11791740 | 14 | 16 | 21 | 0 | 0 | 0 | 5 | 3 |

rs2230287 | 64 | 63 | 56 | 1 | 0 | 0 | 34 | 8 |

Counts of "functional" single-nucleotide polymorphismsin the top 5% single-nucleotide polymorphismsin chromosome 3 for the 200 simulations

ID | No-Attr BE | Sex BE | Age BE | Ped BE | All-Attr BE | All-Attr BE+CE | SSA BE | SSA BEG |
---|---|---|---|---|---|---|---|---|

rs6442089 | 74 | 72 | 74 | 66 | 35 | 55 | 57 | 24 |

rs1060407 | 76 | 66 | 58 | 50 | 18 | 41 | 46 | 13 |

13rs3772219 | 2 | 2 | 6 | 20 | 52 | 44 | 9 | 13 |

rs1131356 | 20 | 28 | 13 | 25 | 60 | 67 | 41 | 29 |

rs4679394 | 22 | 21 | 6 | 4 | 0 | 0 | 6 | 2 |

rs4683602 | 10 | 7 | 28 | 7 | 0 | 0 | 16 | 16 |

rs16851435 | 34 | 34 | 44 | 0 | 0 | 0 | 19 | 14 |

rs304079 | 4 | 6 | 8 | 16 | 38 | 32 | 19 | 21 |

rs373572 | 7 | 4 | 14 | 17 | 16 | 24 | 19 | 28 |

rs2322142 | 21 | 24 | 33 | 55 | 70 | 34 | 36 | 33 |

## Discussion

We use the block entropy to classify SNPs whose associations with hypertension are either sensitive or insensitive to covariates. From the calculations based on the data provided by GAW18, we show that entropy-based methods might be useful in separating these two classes of SNPs. The entropy without considering the SNPs can show that different covariates have different associations with hypertension. The associations of the SNPs and the covariates with hypertension are shown in the block entropy. Comparing the ordered block entropies for the entire observations as a block with those for observation blocks by sex, we can see that the blocking strategy by the covariate Sex reduces the block entropy. This means that the attribute Sex together with the SNPs enriches the information of their associations with hypertension. The attribute Age is very important because the block entropies based on Age are much lower than those based on the whole observations in a block without split as the blocking strategy. Clearly,Pedigree is the most informative and complex attribute. It has much impact on the outcome. Different pedigrees affect the outcome differently. Because of the complexity of the pedigrees, the entropy for the consistency among the block of people split based on the pedigrees is small compared with the other covariates.

Results for the simulated data also show the effectiveness of the block entropy, the block entropy gain (II), and the consistency entropy. These methods can be used independently or in combination.

Following a suggestion by a referee and as proof-of-principle, we used a permutation framework to assess statistical significance. The process is computer intensive-- it took 12 days to summarize results of 100 permutations for only 28 of 200 simulated phenotype data. The results from this small-scale permutation showed that the number of true positives detected in Tables 3 and 4 are noteworthy and that the false-positive rate is also unlikely to be high.

## Declarations

### Acknowledgements

JB would like to acknowledge Discovery Grant funding from the Natural Sciences and Engineering Research Council of Canada (NSERC) (grant number293295-2009) and Canadian Institutes of Health Research (CIHR) (grant number 84392). JB holds the John D. Cameron Endowed Chair in the Genetic Determinants of Chronic Diseases, Department of Clinical Epidemiology and Biostatistics, McMaster University. The GAW18 whole genome sequence data were provided by the T2D-GENES Consortium, which is supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575. We would like to thank two anonymous reviewers and the editor for insightful comments that improved the presentation and clarity of our manuscript.

The GAW18 whole genome sequence data were provided by the T2D-GENES Consortium, which is supported by National Institutes of Health (NIH) grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575.

This article has been published as part of *BMC Proceedings* Volume 8 Supplement 1, 2014: Genetic Analysis Workshop 18. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S1. Publication charges for this supplement were funded by the Texas Biomedical Research Institute.

## Authors’ Affiliations

## References

- Hajjar I, Kotchen JM, Kotchen TA: Hypertension:trends in prevalence, incidence, and control. Annu Rev Public Health. 2006, 27: 465-490. 10.1146/annurev.publhealth.27.021405.102132.View ArticlePubMedGoogle Scholar
- Ruiz-Marín M, Matilla-García M, Cordoba JA, Susillo-González JL, Romo-Astorga A, González-Pérez A, Ruiz A, Gayán J: An entropy test for single-locus genetic association analysis. BMC Genet. 2010, 11: 19-PubMed CentralView ArticlePubMedGoogle Scholar
- Shannon CE: A mathematical theory of communication. Bell Systems Tech J. 1948, 379-423. 27Google Scholar
- Manzour A, Saraee M: Entropy-based epistasy search in SNP case-control studies [abstract]. Fuzzy Systems and Knowledge DiscoveryHaikou, China, FSKD. 2007, 21-26. 3Google Scholar
- Witten IH, Frank E, Hall MA: Data Mining: Practical Machine Learning Tools and Techniques. The Morgan Kaufmann Series in Data Management Systems. 2011, 3Google Scholar
- Vasan RS, Larson MG, Aragam J, Wang TJ, Mitchell GF, Kathiresan S, Newton-Cheh C, Vita JA, Keyes MJ, O'Donnell CJ, Levy D, Benjamin EJ: Genome-wide association of echocardiographic dimensions, brachial artery endothelial function and treadmill exercise responses in the Framingham Heart Study. BMC Med Genet. 2007, 8 (suppl): S2-PubMed CentralView ArticlePubMedGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.