Data preparation
In this study, we used data for Problem 1 from Genetic Analysis Workshop 15 (GAW15), which provided expression levels of 3554 genes in lymphoblastoid cells from fourteen three-generation Centre d'Etude du Polymorphisme Humain (CEPH) Utah families. Genotypes of 2882 autosomal and X-linked SNPs were also included [3]. For the sib-pair linkage analysis, allele frequencies of each SNP locus were estimated by using a maximum-likelihood method incorporated in the FREQ program of S.A.G.E. (Version 5.3) [4], and the identical-by-decent (IBD) data were produced by the GENIBD program in S.A.G.E. [4]. Due to the problem for multipoint IBD computations of very dense SNPs, we performed only single-point IBD computations.
Sib-pair linkage analysis
We used the SIBPAL program in S.A.G.E. for sib-pair linkage analysis [4]. This is a model-free linkage analysis program based on the Haseman-Elston regression test that models trait data from full-sib pairs as functions of marker allele sharing IBD. Denote the jth sib-pair with the subscript ii', and define the mean transcriptional expression for a gene as:
where the summands were the gene expression values measured on N sib pairs. Then the dependent variable for the jth pair was
The basic regression model we fitted was of the form:
where α was the intercept, and β
h
and
h
were the total genetic variance due to the hth SNP marker and the estimated IBD at the marker respectively, and ε was the residual error. This regression model was used to establish the gene-SNP linkages.
Constructing a gene-SNP intermixed network
A gene-SNP network was constructed using the identified relationships between genes and SNPs, where genes and SNPs were two types of nodes. Two nodes were connected using an edge if a relationship between them was presented (Fig. 1a). Some nodes that had markedly more connections with other nodes were defined as "hub nodes", including "hub genes" and "hub SNPs". Hub genes or hub SNPs, which may have more important functions or key roles in some biological processes or pathways, were subjected to further investigations.
Exploring the expression pattern between genes linked to a common SNP locus
The genes that were linked to a common SNP(s) might have correlated mRNA expression profiles because they share a genetic factor. Let G
i
be a group of n genes that were linked to a common SNP (identified via the previous sib-pair linkage analysis). The hypothesis tested (Ht) was whether the genes in G
i
were significantly co-expressed by comparing it with the null model (H0) that depicts the random expression patterns among groups of genes of the same size as G
i
. The averaged Pearson's correlation coefficient () was used as the metric to measure the transcription consensus:
where r
ij
was the Pearson's correlation coefficient between ith and jth genes. A gene could be clustered into different functional groups if it was linked to multiple SNP co-linkers. A permutation approach was used to determine the consensus threshold for G
i
. We randomly sampled n genes from the original whole gene pool and calculated the averaged correlation coefficient between the n genes. The same process was repeated 100,000 times and the 95% quantile was then defined as the empirical threshold that corresponds to the type I error of 0.05 [5].
Extracting a gene-gene network
Gene expressions are often influenced by multiple factors (e.g., transcription factors, reporters, enhancers, modifiers, and so on). The more factors two genes share, the closer the relationship they tend to have. However, the strong linkage disequilibrium (LD) among the SNP co-linkers might influence the result. To investigate the issue, we assessed the LD for all SNP pairs by implementing a likelihood ratio test using the software Arlequin [6], whose empirical distribution was obtained by a permutation procedure [7]. We found that LD had a small influence on the result (only 4.8% of gene pairs were co-linked with a SNP pair or block of a significant LD (p < 0.05)). As shown in Figure 1b, we defined two genes to be connected if they shared at least one linked SNP and the number of co-linked non-significant LD SNPs for the two genes was used to describe the intensity of their relationship (considering a SNP pair of a significant LD as a single SNP). A gene-gene network was then constructed according to the newly identified relationships with their intensities of relationship as the weights for the edges.