Recent genetic dissection of common diseases has largely been through linkage and association studies involving discrete or continuous traits including intermediate phenotypes such as gene expression data from microarray experiments. The latter can involve thousands of genes, and annotation of their roles in biological pathways and in relation to DNA polymorphisms poses immense challenges and has sparked huge interest [1]. These include development of methods appropriate for a much richer structure than classic clustering [2], discovery of interaction between genes, and inference of causal relationships.
A key challenge in analysis of gene expression data is the reconstruction of regulatory networks. Several approaches directly extend classical techniques such as cluster analysis to infer the relationship between plural variables. A novel but apparently unpopular approach of cluster analysis is to extract the patterned information formally and use it in typical linkage and association analyses. More importantly, cluster analysis can be followed by Gaussian graphical modelling [2, 3] and multivariate analysis in which a partial correlation coefficient (instead of a correlation coefficient) is used to measure the direct interaction between variables. In graphical modelling, the relationship between plural variables is represented as an independence graph G = (V, E), whose vertices V denote variables and edges E denote conditional dependence structure. Other approaches include regularization and moderation for suitable estimates of the covariance matrix and its inverse, by a full Bayesian or an empirical Bayes approach and followed by heuristic searches for an optimal graphical model http://www.strimmerlab.org/notes/ggm.html. A Bayesian network is notable because it provides a natural approach to model regulatory networks. As has been argued elsewhere [4], if the expression level of a given gene is regulated by certain proteins then it should be a function of the active levels of these proteins. Due to biological variability and measurement errors, the function would be stochastic rather than deterministic. A Bayesian network uses a generic analytic approach for identifying robust predictors of among-individual variation in expression levels, intermediate phenotypes, or disease end points. It has been successfully applied to APOE gene variation and plasma lipid levels [5]. Mathematical details on Bayesian networks are available [6], as is a comprehensive survey of genomic approaches to biological pathways [7].
The Problem 1 data from Genetic Analysis Workshop 15 (GAW15) offers an excellent opportunity for investigating the utility of Bayesian networks. An earlier report [8] showed evidence of substantial variation in expression levels between individuals and association with single-nucleotide polymorphisms (SNPs), as well as a cluster of 25 of 31 target genes in two master regulatory regions. Here, as a further step of analysis, we performed Bayesian network modelling to gain insight into these findings.