The GAW15 Problem 1 data set, from Morley et al. [6], includes single-nucleotide polymorphism (SNP) genotypes at 2882 loci across the genome and 3554 gene expression phenotypes in lymphoblastoid cells, for 14 three-generation CEPH (Centre d'Etude du Polymorphisme Humain) Utah families.
To select phenotypes for ANN analysis, we carried out Haseman-Elston (H-E) regression [7] using the Statistical Analysis for Genetic Epidemiology package (S.A.G.E., Release 5.2: http://genepi.cwru.edu/) and variance-components (VC) linkage analysis using Merlin [8]. The former analysis is the same as that used by Morley et al. [6].
We used two different genetic maps. We compiled Map 1 from the SNP Consortium map data http://snp.cshl.org/linkage_maps/[9]. Map 2 was provided by Ellen Wijsman's group http://faculty.washington.edu/wijsman/gaw15.shtml[10]. Comparing the two maps, we found that Map 2 included a greater number of the GAW15 markers. Early ANN analyses used Map 1 and were then updated using Map 2; we report the latter results, and describe some contrasts with Map 1 results in the Discussion. Identity-by-descent (IBD) sharing was estimated across the genome for 378 sib pairs using Merlin. IBD sharing at 10-cM gridpoints was presented as input data to the ANNs, coded as a continuous variable between 0 and 1. Phenotype data was used as the output values at two output nodes: the first for the squared trait difference, and the second for the mean-corrected trait sum. This coding thus contains the key phenotypic information also used by the "new" H-E method [11].
We used the Stuttgart Neural Network Simulator (SNNS), Version 4.2 http://www-ra.informatik.uni-tuebingen.de/SNNS/[12]. ANN training used standard back-propagation with weight decay. Initial runs used all sib pairs for both training and validation, similar to the method of Lucek and colleagues [1, 3]; after a fixed number of cycles, the trained net having lowest sum-of-squares-error (SSE) between network outputs and data-specified target outputs underwent further analysis of its weights. Subsequently, we used five-fold cross-validated training to avoid overtraining and improve the generalization ability of the ANN models; we also hoped that cross-validated models would lead to more stable linkage results. An important feature of our previous ANN work was our use of cross-validation to select models before the ANN weights were analyzed for linkage evidence [2]. We hypothesized that cross-validation would also help address concerns about unstable ANN linkage results such as reported in [13].
We used a network architecture with two hidden layers. We found that two hidden layers performed better than a single layer, attaining lower error and more appropriate output values. Note that the phenotype coding allows the squared trait sums and differences to range freely. Thus, the activation function at the output nodes was the identity function rather than the usual logistic sigmoid function, since the latter would restrict outputs to range between 0 and 1. We believe the second hidden layer is therefore performing some rescaling needed to obtain appropriate output values. Thus, the primary ANNs contained 362 input nodes, two hidden layers of 50 nodes each, and two output nodes.
To identify input loci that are important in determining phenotypic status, we used an algorithm similar to those in our previous work [2]. Consider a trained network with I input nodes, two hidden layers with H nodes each, and O output nodes, such that each node in a given layer is connected to every node in the next layer. Suppose the ith input x
i
is connected to the jth hidden node via a weight u
ij
, the jth hidden node is in turn connected to the kth hidden node in the next layer via a weight v
jk
, and this kth hidden node is connected to the lth output node via w
kl
. Then we calculate an "importance measure" for the ith input: . To combine results across the five separate ANNs generated by cross-validated analyses, we standardized the importance measure by subtracting the mean and dividing by the standard deviation, then averaged across the five ANNs to rank the signals at each input.
For our novel analysis of the interaction between a given pair of loci, we considered the weights for each pair of connections leading from the two loci to the same node in the next, hidden layer. We calculated the Pearson correlation for the resulting data set of 50 pairs of weights. A strong positive correlation is interpreted as cumulative or synergistic action of the loci, and a strong negative correlation may be interpreted as either antagonistic or complementary action. The rationale for this correlation analysis relies on the following details of an ANN's operation. For the generic network described above, a linear combination of the I inputs x
i
forms the argument for the activation function f
j
of the jth node of the next layer, with weights u
ij
as the coefficients, as follows: , where a
j
is an intercept term. We hypothesized that if two inputs have positively correlated weights leading to the next layer, similar input values will have cumulative effect on the arguments of the activation functions; for two inputs with negatively correlated weights, similar input values will have antagonistic effect. A simplified example, in which the weights from one input node are a fixed multiple of those from a second, unlinked input node and thus are perfectly correlated with those latter weights, demonstrates this most clearly. Letting c be the scaling factor for the weights and indexing these two inputs with 1 and 2, then u2j= cu1jfor all j, and the jth activation function value is: . In the extreme case in which |c| = 1, we see that when c is positive, similar input values will have cumulative effects across all activation functions in the hidden layer, and when c is negative, subtraction will cancel out the effects of two similar input values.
To further evaluate potential interactions detected by our correlation analysis, we used both VC (SOLAR with the "-epistasis" option [14]) and a simple categorical "bin" analysis. For the latter, we subdivided the sibpairs into a three-by-three table of low (0 ≤ IBD < 1/3) medium (1/3 ≤ IBD < 2/3) and high (2/3 ≤ IBD ≤ 1) IBD categories at each locus in the pair. In each cell the proportion of "trait dissimilar" sib pairs was calculated, where pairs were labeled dissimilar if their squared trait difference was greater than the overall average. This descriptive analysis helped us assess our hypotheses about the effects of positively versus negatively correlated weights.