### Data

Data for Problem 1 of GAW 15 come from 14 three-generation CEPH (Centre d'Etude du Polymorphisme Humain) Utah families. All 194 members of these families were genotyped at 2882 autosomal and X-linked SNPs. Expression levels in lymphoblastoid cells were measured for 8500 genes. 3554 transcripts were selected by Morley and colleagues [5] because of greater variation among individuals than between replicated measurements. For all analyses, we used log-transformed expression levels.

### Mutual information

The concept of MI is based on the entropy of a random variable. Shannon [6] defined entropy as a measure of uncertainty in predicting a future value of a random variable *X*. In other words, the entropy can be viewed as a quantity representing the information associated with an outcome of *X*. In case of a discrete random variable *X* with outcomes *x* ∈ \mathcal{X}, the entropy is given by

H(X)=-{\displaystyle \sum _{x\in \mathcal{X}\phantom{\rule{0.5em}{0ex}}}{p}_{X}}(x){\mathrm{log}}_{2}{p}_{X}(x)={\displaystyle \sum _{x\in \mathcal{X}\phantom{\rule{0.5em}{0ex}}}{p}_{X}}(x){\mathrm{log}}_{2}\frac{1}{{p}_{X}(x)},

where *p*_{
X
}(*x*) = *P*_{
X
}(*X* = *x*), and \mathcal{X} denotes the sample space of *X*. The entropy is non-negative, disappears in case of a degenerate random variable and is maximal for a uniformly distributed *X*. For estimation, we use empirical frequencies {\widehat{p}}_{X}(x=i)={\scriptscriptstyle \frac{{n}_{i}}{n}}, where *n*_{
i
}is the number of observations with *X* = *i*, and *n* is the total sample size. A natural estimator of *H*(*X*) is thus given by \stackrel{\_}{H}(X)=-{\displaystyle {\sum}_{i=1}^{\left|\mathcal{X}\right|}{\scriptscriptstyle \frac{{n}_{i}}{n}}}{\mathrm{log}}_{2}{\scriptscriptstyle \frac{{n}_{i}}{n}}, where |\mathcal{X}| is the cardinality of \mathcal{X}.

MI extends the latter concept to random variables *X* and *Y*, and measures their mutual dependence. In the discrete case, *MI*(*X*, *Y*)

is defined as

MI(X,Y)={\displaystyle \sum _{x\in \mathcal{X}}{\displaystyle \sum _{y\in \mathcal{Y}}{p}_{XY}}}(x,y){\mathrm{log}}_{2}\frac{{p}_{XY}(x,y)}{{p}_{X}(x){p}_{Y}(y)},

where *p*_{
XY
}(*x*, *y*) denotes the joint probability of *X* and *Y*, and *p*_{
X
}(*x*) and *p*_{
Y
}(*y*) are the corresponding marginal probabilities. *MI*(*X*, *Y*) quantifies the amount of information that one random variable contains about the other. Equivalently, *MI*(*X*, *Y*) measures the reduction in uncertainty of *X* if *Y* is known, and vice versa. Thus, for two independent variables *X* and *Y*, observing one of them does not lead to an information gain concerning the other one. In this case, *p*_{
XY
}(*x*, *y*) = *p*_{
X
}(*x*)·*p*_{
Y
}(*y*), so that the ratio equals one and, subsequently, *MI*(*X*, *Y*) = 0. The other extreme case occurs if *X* and *Y* are completely linearly related. Then, all information conveyed by *X* can be found within *Y*. *MI*(*X*,*Y*) can then be calculated using the entropy of one of both variables.

Moreover, *MI*(*X*, *Y*) is symmetric, non-negative and is related to the entropy via

*MI*(*X*, *Y*) = *H*(*X*) - *H*(*X* | *Y*) = *H*(*X*) + *H*(*Y*) - *H*(*X*, *Y*),

where *H*(*X* | *Y*) denotes the conditional entropy, and *H*(*X*, *Y*) is the joint entropy of *X* and *Y*. Because *H*(*X* | *X*) = 0, we obtain *H*(*X*) = *MI*(*X*, *X*), and *H*(*X* | *Y*) ≥ 0 yields *MI*(*X*, *X*) ≥ *MI*(*X*, *Y*). In other words, no other variable can contain more information about *X* than *X* itself.

For discrete random variables, the distribution of *MI*(*X*, *Y*) can be approximated by a second-order Taylor series. In case of stochastically independent random variables *X* and *Y*, the resulting expression is related to a *χ*^{2}-test of independence and follows a gamma distribution [7]. As a consequence, optimal asymptotic statistical properties of the *χ*^{2}-test also hold for MI [2]. However, the required regularity conditions are not always met. Thus, the approximation to the *χ*^{2} distribution is imprecise if sample sizes are small, if the numbers per class differ substantially, or if the number of classes is small. In these cases, the use of empirical *p*-values as obtained from a permutation procedure is preferable.

If one of the variables, say *Y*, is continuous with observations *y* ∈ ℝ, a QMIS can be calculated [4]. To this end, a threshold *q* is determined, leading to a dichotomous variable *Y*_{
q
}such that *Y*_{
q
}= *I*(*Y* <*q*), where *I*(·) denotes the indicator function. *QMIS*(*X*, *Y*) is then defined as the maximum possible MI:

*QMIS*(*X*, *Y*) = max_{q∈ℝ }*MI*(*X*, *Y*_{
q
}).

Analogously to the entropy, relative frequencies can be used to determine \stackrel{\_}{MI}(X,{Y}_{q}). For a given threshold *q* and using the right hand side of eq. (1), an estimator is given by

\stackrel{\_}{MI}(X,{Y}_{q})={\displaystyle \sum _{i=1}^{\left|\mathcal{X}\right|}{\displaystyle \sum _{j=1}^{2}\frac{{n}_{ij}}{n}}}{\mathrm{log}}_{2}\frac{{n}_{ij}}{n}-{\displaystyle \sum _{i=1}^{\left|\mathcal{X}\right|}\frac{{n}_{i.}}{n}}{\mathrm{log}}_{2}\frac{{n}_{i.}}{n}-{\displaystyle \sum _{j=1}^{2}\frac{{n}_{.j}}{n}}{\mathrm{log}}_{2}\frac{{n}_{.j}}{n},

(3)

where {n}_{i.}={\displaystyle {\sum}_{j=1}^{2}{n}_{ij}} and {n}_{.j}={\displaystyle {\sum}_{i=1}^{\left|\mathcal{X}\right|}{n}_{ij}} and *n*_{
ij
}is the number of observations in group *j* defined by *Y*_{
q
}with *X* = *i*.

The estimator given in Eq. (3) is solely based on frequencies and does not involve any continuous measurements such as squared deviations from a mean value. As a consequence, the QMIS approach is robust against outliers and does not require any specific shape of distribution.

### Application 1: comparison of QMIS with ANOVA and KW

In the first application, we compared the permutation-based QMIS with the standard parametric ANOVA and the nonparametric KW for detecting associations between a single gene expression level and a single SNP. We systematically investigated the causes for differences in statistical test results. We calculated QMIS according to Eq. (2) using estimates obtained from Eq. (3). To specify a final threshold for each gene, the range of expression values was divided into 10 equidistant intervals, yielding 11 potential thresholds within the maximization process. The number of intervals was limited to reduce computation time.

Empirical *p*-values were computed using a permutation procedure. In doing so, for each of the *N* = 10^{4} permutations, genotypes were fixed, while gene expressions were permuted. In cases where no permuted test statistic was at least as extreme as the test statistic using the original data, calculations were repeated with *N=*10^{8} permutations. To allow for heteroscedastic errors, the Welch modification of the parametric ANOVA was used. Finally, we denoted an association of gene *i* and transcript *j* to be significant if the *p*-value is less than *α*_{local} = 10^{-4}.

To avoid within-family dependencies, we only used founder data for model comparisons. In addition, SNPs with less than 5% of samples for each genotype and more than 5% of missing genotypes were removed, yielding 3554 genes and 1089 SNPs.

Differences in *p*-values between the different approaches were visualized using hexagon binning [8]. All computations were done using the computer program R including special parts implemented in C.

### Application 2: feature selection in classification

In the second application, we aimed at classifying the molecular phenotype data using a small set of features.

Within the first step, we created clusters of individuals with similar gene expression profiles. The data were modeled as a finite mixture of Gaussian distributions, and the most likely cluster partition was determined by an expectation-maximization (EM) algorithm using a 10-fold cross-validation procedure to determine the optimal number of clusters. After the clustering step we assigned a label to each individual, which was used to perform a supervised learning task on the basis of the genotype information.

In what followed, we considered SNPs and clusters as random variables and computed the MI between them. We evaluated the significance of MI in two ways. First, we focused on a parametric approach and used the relation between MI and the *χ*^{2}-test of independence. More precisely, we considered a *χ*^{2}-test in which the degrees of freedom (d.f.) are equal to (number of genotypes - 1)·(number of classes - 1) at a significance level of *α* = 0.001. Second, we applied a novel nonparametric approach based on a permutation method inspired by the significance analysis of microarrays (SAM) technique [9]. This approach allowed us to evaluate the expected false discovery rate (FDR) on the empirical "null" distribution of MI values by choosing a threshold for the MI. Conversely, by defining a desired FDR, we could choose a suitable corresponding threshold value for MI (Figure 1).

The step following feature selection was classification. We estimated the goodness-of-fit of the model using a 10-fold cross-validation. We determined the accuracy of the classification procedure as the proportion of correctly classified individuals. A person was classified correctly if the classification step led to that class that was specified in the previous cluster analysis. In general, a cross-validation resolves the problem of obtaining a test set, but it is affected by a selection bias that is often ignored. Here, the selection bias may result from performing feature selection on the entire data set before running a cross-validation. In this way, all the cross-validation learning steps are affected by the bias provided by feature selection, because the entire data set was already used for selecting the most interesting features. To overcome this problem, we included the feature selection step in the cross-validation procedure, as recommended by Simon et al. [10], to select the most informative SNPs on each of the 10 training sets and to use them only on the corresponding *j*^{th} test set.

Because feature selection was made in each iteration step of the cross-validation, a different SNP list was generated in each run of the procedure. To derive a consensus SNP list, we extracted those SNPs that were selected in each run of the cross-validation process. Therefore, this set contained the most important SNPs regardless of small changes in the training set.

The WEKA software version 3.4 [11] was employed for both clustering with 10-fold cross-validation and classification using the learning machines naïve Bayes, support vector machine (SVM) and *k*-nearest-neighbor (KNN). For the number of neighbors in KNN, we chose *k* = 1, 5, 10, and for all learning machines we used the default settings of the package.