Neural network analysis (NNA)
NNA provides powerful tools for modeling pre-specified responses to complex, multidimensional input stimuli. It is the specific advantage of NNA that no causal relationship between stimuli and response is required. NNA connects the "neurons" of input and output layers via one or more "hidden" layers. All outputs are computed using sigmoid thresholding of the scalar product of the corresponding weight and input vectors. All outputs at stage "s" are connected to each input of stage "s + 1". The most popular learning strategy, the back-propagation algorithm, looks for the minimum of the error function in the weight space (goodness of fit) using the method of gradient descent. The basic algorithm is:
(i) output: yi observed (i = 1, 2,...,Ni)
(j) hidden layer: (j = 1, 2,...,Nj)
(k) input: s
k
= x
k
x
k
observed (k = 1, 2,...,Nk)
improvements: (ν = 1, 2,..,p)
where x
k
denotes observed stimuli, y
i
denotes observed responses, σ denotes the activation function of sigmoid-type: R → (0, 1), α denotes the learning rate, and p is the number of probes (genotyped subjects). This algorithm can easily be adapted to genetic models.
k-Fold cross-validation
Results derived through the standard NNA approach, which uses 80% of samples for training and the remaining 20% for testing, tend to be over optimistic, in particular if genotype errors and missing data are present. Therefore, in the k-fold cross-validation, the data are split into k roughly equal parts, and k - 1 partitions are used for training, while one partition is used for testing. The process is repeated until each partition has served as a testing set, so that k estimates of prediction error are generated. The choice of k is crucial in this approach, because the resulting prediction error is approximately unbiased for the "true" error only for sufficiently large k (k ≈ 10 is a typical value in practice).
Genetic vector spaces
Once a function is defined that quantifies the genetic distance between any two subjects with n-dimensional genotype patterns at n loci, the Housholder-Torgerson formula
gives a routine method for computing directly from the inter-individual genetic distances d
jk
a matrix (b
jk
) of scalar products between points with origin at the centroid of all of the points. The matrix is then factored by any of the usual factoring procedures to obtain the projections of the points onto r orthogonal axes of a vector space. In this metric vector space, individuals are characterized as distinct "points" in such a way that individuals with similar genotype patterns form compact clouds, while genetically dissimilar individuals are located in more distant regions. Accordingly, one expects the groupings associated with different IgM classes to be well separated in a "genetic" vector space constructed from those genomic loci that influence IgM levels.
Learning to recognize: three-stage adaptive strategy
Although the detection of causal genotype-phenotype relationships (in the strict sense) was not the primary goal of our analysis, we have taken special precautions to ensure that a biologically meaningful solution was established. Specifically, we used a three-stage strategy: 1) nonparametric linkage (NPL) analysis across three independently ascertained family samples was applied for initial signal detection; 2) the initial configuration was then modified by iteratively adding or removing genomic loci to increase genotype-phenotype correlations; 3) subsequent NNA was used to weight genomic loci and their interactions optimally. Nonetheless, all of these steps do not necessarily establish biological meaningfulness, but merely identify genomic regions likely to harbor functional DNA polymorphisms that are causally related to the trait of interest. Accordingly, our approach to establishing biological significance also involves a large-scale SNP analysis using 5728 selected SNPs. Because there are complex patterns of linkage disequilibrium and haplotype block structure across the whole genome with strong nonlinearities, special techniques are necessary to narrow in on candidate regions successfully [6]. Results derived from this SNP analysis are not presented.
Our sample comprised 599 nuclear families (NARAC screen 1: 256; NARAC screen 2: 255; France: 88) with 1868 genotyped subjects (718 + 717 + 433) who were genotyped for either 396 (NARAC) or 1083 microsatellites (France). An integrated genetic map was constructed on the basis of deCODE and NCBI-36 data, so that the three populations could be compared through NPL analyses. On the phenotype level, the quantitative clinical measure rheumatoid factor IgM was available for the NARAC screens, whereas the French data included only a dichotomous affected/unaffected measure. For the NPL analyses, which were carried out independently for the three populations under investigation, we relied on the dichotomous measures under the assumption of a sufficiently close association between RA and the measured antibody IgM, while the optimization procedures (NNA, genetic vector space method) evaluated the quantitative measures of NARAC screens 1 and 2 for which we defined three subject classes using IgM levels: 1) normal: 0 ≤ IgM < 13.5, 2) low: 13.5 ≤ IgM < 50, and 3) elevated: 50 ≤ IgM. Due to incomplete data, only 926 subjects could be included.