Entropy-based method for assessing the influence of genetic markers and covariates on hypertension: application to Genetic Analysis Workshop 18 data

Many complex diseases are related to genetics, and it is of great interest to evaluate the association between single-nucleotide polymorphisms (SNPs) and disease outcome. The association of genetics with outcome can be modified by covariates such as age, sex, smoking status, and membership to the same pedigree. In this paper, we propose a block entropy method to separate two classes of SNPs, for which the association with hypertension is either sensitive or insensitive to the covariates. We also propose a consistency entropy method to further reduce the number of SNPs that might be associated with the outcome. Based on the data provided by the organizers of Genetic Analysis Workshop 18, we calculated the block entropies for six different blocking strategies. Using block entropy and consistency entropy, we identified 230 SNPs on chromosome 9 that are most likely to be associated with the outcome and whose associations with hypertension are sensitive to the covariates.


Introduction
Hypertension is the leading cause of cardiovascular disease worldwide. In the period 1999 to 2002, 28.6% of the U.S. population had hypertension, and several million people in the world currently have this condition [1]. This disease is often called the "silent killer" because it may not cause symptoms until the patient has sustained serious damage to the arteries, brain, and kidney. With advances in technology, more genetic information is generated, and association studies of genetic variants with disease are routinely investigated. Here we focus on the combined role of genetics and covariates [2] in the development of hypertension.
In this paper, we propose a method to separate two classes of single-nucleotide polymorphisms (SNPs) from all SNPs, for which the associations with hypertension are either sensitive or insensitive to the covariates by calculating the block entropy or block entropy gain of SNPs of people who are split into different blocks according to their covariates. We also propose a method to check the consistency of the proportions of the outcome grouped by the main genotype among the blocks by an entropy method that will help us identify the SNPs that are probably associated with the disease outcome. Based on the combined results of the block entropy and consistency entropy, we can find the SNPs that are most probably associated with the outcome.
We believe that our method is useful for detecting important associations between outcome and genetic markers as well as covariates. For example, if hypertension is associated with a covariate, the proportion of hypertension will be different for different covariate values. In that case, we say that hypertension is sensitive to the covariate. Similarly, if hypertension is associated with a SNP, the proportion of hypertension in different genotypes will be different. Therefore, we say that hypertension is sensitive to the SNP.

Materials and methods
Genotype, covariates, and phenotype data Our methods for investigating the relationship between SNPs and hypertension are based on data provided by the organizers of Genetic Analysis Workshop 18 (GAW18). Genotype data are provided for 11 odd-numbered chromosomes. We use all the SNPs on chromosomes 3 and 9 from genome-wide association studies (GWAS) files, including all people with genotype data. A total of 20 pedigrees are provided with information on father, mother and sex. The phenotype (hypertension) is measured for a maximum of 4 points. In this paper, we use the baseline hypertension status. We include the following covariates at baseline: Sex, Smoke, and Age.

Block entropy and block entropy gain
Shannon entropy [3] is used as a measure of genetic diversity [4]. For a binary random variable E with probability p = Prob{E = 1}, Shannon entropy is defined as −plog 2 p. We define the entropy of E, which equals entropy (p, 1-p) in information [5], as The variable E represents an event in a subset of the complete observations, such as hypertension status in a group of people with a proportion of the outcome p in the block of sample. Suppose we have m blocks; n observations are split into these m blocks B 1 , B 2 , . . . B m with block size of n 1 , n 2 , . . . , n m , respectively. Block B i (i = 1, 2, ..., m) has q i observations with the outcome. The observations in one of the blocks B i s (i = 1, 2, ..., m) are assumed to share something in common in which "similarity" might be defined based on covariates and relatedness using covariates such as age, sex, smoking status, and membership to the same pedigree. Because each block has 3 genotype classes for each SNP, block B i can be classified into 3 groups of observations, B i0 , B i1 , and B i2 , with sizes given by n i0 , n i1 , and n i2, respectively. All of the observations in B ij have the genotype j, for j = 0, 1, 2. The group B ij has q ij observations with the outcome. For block B i , the entropy is Therefore, we can get the block entropy to represent the entropy with SNPs.
We also can get the entropy without considering SNPs: And the block entropy gain (I) is defined as G a = G * − G.
If we want to show the gain of the block entropy from information of SNPs without considering the covariates, we can get the block entropy gain (II) where n .j (j = 0, 1, 2 ) is the number of observations with genotype j and q .j (j = 0, 1, 2) is the number of observations with genotype j with the outcome.
Consider that the study population is split according to one attribute, for example,age. We calculate G * the entropy for this attribute without considering SNPs. We then calculate G * for another attribute, say Smoke. If G * is smaller for Age, then it demonstrates that Age has stronger associations with hypertension than does Smoke. We calculate the block entropy, G, for the blocks split by Age at each locus. The lower value of G shows that adjusting for Age is informative for the SNP. In contrast, the block entropy gain (I) or (II) goes in the opposite way. The higher the block entropy gain is, the more informative the SNP is. It is clear that if the genotype j is fully associated with the covariates and the outcome is fully associated with the covariates, then ε 1j , ε 2j , . . . , ε mj = 0. We can say the genotype j in this extreme condition is consistent. If there is no relationship between the genotype and the covariates, ε ij can be any value. Let the size of the block belarge enough to have the blocking effects, say n 0. We define the standardized proportion of the associations of the genotype j with the outcome in block B i as

Associations of genotypes with covariates
where n 0 is a constant based on the overall block sizes.
The standardized proportion of the total associations of a genotype j with the outcome at a locus is defined The consistency of the main genotype Because a block includes people with 3 genotypes of SNPs, the associations of a genotype j with the outcome may have a higher standardized proportion in one block but lower in another block. This genotype is not consistent in the 2 blocks. Although the entropy is low for this SNP, which means that the associations of the SNP and covariates with the outcome are high, the nonconsistency of the genotype across the blocks leads to a weak conclusion that these informative SNPs are related tothe covariates. Suppose for a SNP at a locus, the standardized proportion of the total associations of a genotype j with the outcome is F(j). We choose j satisfying F j = max k {0,1,2} {F(k)}. We define entropy to check the consistency of the standardized proportion of the associations of this main genotype j.
The greater the consistency entropy of the main genotype j is, the more powerful the consistency of the genotype j.

The block entropy or entropy gain algorithm
We first identify observations with complete genotype data; obtain the values of the covariates Sex, Smoke, and Age as well as classes of pedigree and hypertension at baseline HTN 1 . Then we choose a blocking strategy, for example, the blocking strategy based on the attribute Age.
We next calculate the block entropy or block entropy gain according to the above blocking strategy at each locus and order the block entropies in ascending order for all loci. The upper s% (e.g., 15%) of SNPs in ascending order of the block entropies are those whose associations with the outcome are related to the covariates. The lower i% (e.g.,15%) part of loci define the SNPs whose effects are insensitive to the covariates, among which many loci include rare genotypes.
For the upper s% part of loci, we find the main genotype j and calculate the consistency entropy for the standardized proportion of the associations of the genotype j with the outcome. We remove the SNPs with the lower consistency entropies; the SNPs associated with the outcome remain. We can combine some block entropies and consistency entropies for these SNPs to find the important SNPs associated with the outcome.

Application to real data
Our first blocking strategy takes all observations as a block. As a crude estimation, the block entropy or entropy gain can classify the SNPs at loci as either associated with the outcome or not associated with the outcome. We next choose the blocking strategy based on a single attribute (Sex, Smoke, Age, and Pedigree) and calculate the block entropy and block entropy gain for all the SNPs, respectively. Using the block entropy approach, we classify the SNPs related to the disease outcome as either sensitive or insensitive to the covariates. Here we focus on the SNPs sensitive to the covariates. We also calculate the entropies for checking the consistency of the main genotype among the blocks. The rules are BE < V BE and CE > V CE , where BE is a block entropy, CE is a consistency entropy, and V BE and V CE are limit values of BE and CE, respectively. For each one of consistency entropies, we remove around 20% of inconsistent SNPs. For CE_Ped, we keep around 40% of SNPs. And V BE = BE min + k * (BE max − BE min ) , where BE min is the minimum block entropy based on the covariate and BE max is the maximum one. k = 0.45 for chromosome 9, and k is between 0.5 and 0.7 for chromosome 3.
Selected by the low block entropy with high consistency entropy, we use the rules of BE_All-Attr < 0.36 and BE_No-Attr< 0.67 and CE_Ped > 0.29 and CE_Sex > 0.9 and CE_Smoke > 0.9 and CE_Age > 0.9 (Note:BE_Age is block entropy for blocking strategy based on Age). We choose 230 from 42,178 SNPs in the chromosome 9 GWAS file to be the SNPs sensitive to the covariates. The results shown in Table 1 are 17 SNPs among the 230 loci associated with hypertension that were also found by other researchers in the literature, for example rs774227 [6]; rs6833 is related to hypertension for pregnancy. From Table 1, we know that block entropy of rs6833 for All-Attr is 0.341. It is a small value but not very small compared with all other 42,177 SNPs. We select it because its consistency entropies are high as well.
We use the rules ofBE_All-Attr < 0.36 and BE_Pedigree < 0.605 and BE_Age < 0.579 and BE_Sex < 0.667 and BE_No-Attr < 0.677 and CE_Ped > 0.29 and CE_Sex > 0.9 and CE_Smoke > 0.9 and CE_Age > 0.9. We choose 156 from 65,519 SNPs in the chromosome 3 GWAS file to be the SNPs sensitive to the covariates. The results shown in Table 2 are 6 SNPs among the 156 loci associated with hypertension that were also found by other researchers in the literature.

Application to simulated phenotype data
Two hundred simulated phenotype data files are provided by GAW18. For each simulated phenotype data, we calculate the block entropy, block entropy gain (II), and consistency entropy for the SNPs under 8 situations. The first one takes all observations as a block to calculate the block entropy. Then we choose the blocking strategy based on Sex, Age, and Pedigree, respectively, in the next 3 situations. The fifth calculates the block entropy with split blocks according to all the values of Sex, Smoke, Age, and Pedigree. The sixth situation uses the blocking strategy of the fifth one plus a consistency entropy for pedigree in that the consistency entropy for the selected SNPs should be bigger than the average of the whole consistency entropies. The seventh situation uses the blocking strategy, splitting the people based on all the values of Sex, Smoke, and Age. The last situation uses the seventh blocking strategy but calculates the block entropy gain (II). We pick the values of the above 8 situations for the SNPs of "functional" genes for the GAW 18 phenotype simulations to see whether they are in the top 5% of the SNPs in the simulated data or not. We count the number of the simulations going into the top 5% for these SNPs of functional genes. The results are summarized in Tables 3 and 4. In Table 1 Single-nucleotide polymorphisms in chromosome 9 that are sensitive to the covariates

Discussion
We use the block entropy to classify SNPs whose associations with hypertension are either sensitive or insensitive to covariates. From the calculations based on the data provided by GAW18, we show that entropy-based methods might be useful in separating these two classes of SNPs. The entropy without considering the SNPs can show that different covariates have different associations with hypertension. The associations of the SNPs and the covariates with hypertension are shown in the block entropy. Comparing the ordered block entropies for the entire observations as a block with those for observation blocks by sex, we can see that the blocking strategy by the covariate Sex reduces the block entropy. This means that the attribute Sex together with the SNPs enriches the information of their associations with hypertension. The attribute Age is very important because the block entropies based on Age are much lower than those based on the whole observations in a block without split as the blocking strategy. Clearly,Pedigree is the most informative and complex attribute. It has much impact on the outcome. Different pedigrees affect the outcome differently. Because of the complexity of the pedigrees, the entropy for the consistency among the block of people split based on the pedigrees is small compared with the other covariates. Results for the simulated data also show the effectiveness of the block entropy, the block entropy gain (II), and the consistency entropy. These methods can be used independently or in combination.
Following a suggestion by a referee and as proof-ofprinciple, we used a permutation framework to assess statistical significance. The process is computer intensive-it took 12 days to summarize results of 100 permutations for only 28 of 200 simulated phenotype data. The results from this small-scale permutation showed that the number of true positives detected in Tables 3  and 4 are noteworthy and that the false-positive rate is also unlikely to be high. Table 3 Counts of "functional"single-nucleotide polymorphismsin the top 5% single-nucleotide polymorphisms in chromosome 9 for the 200 simulations