Population and phenotypes
We used the Genetic Analysis Workshop 15 (GAW 15) Problem 1 microarray gene expression profiles for the analyses. Data were available for 14 three-generation CEPH Utah families. Expression levels of genes were obtained from lymphoblastoid cells using the Affymetrix Human Focus Arrays [2]. We were provided with data on 3554 transcripts that showed greater variation between individuals than within the same individual.
Family members were genotyped for 2882 autosomal and X-linked single-nucleotide polymorphisms (SNP) generated by the SNP Consortium http://snp.cshl.org/. Genetic map positions were obtained using the SNP Mapping web application developed by the University College Dublin Conway Institute of Biomolecular and Biomedical Research http://actin.ucd.ie/software.html, which uses the Rutgers Combined Linkage-Physical Map of the Human Genome and data taken from the NCBI dbSNP Build 123 (in Kosambi centimorgans). This information was used to calculate multipoint identity by descent matrices (MIBDs) with Merlin and Minx [3], after removal of Mendelian inconsistencies and double recombinants with Preswalk (based on Simwalk mistyping probabilities) [4]. MIBDs were used for linkage analyses.
Transcript distributions were normalized using an inverse normalization transformation of z-scores of individual transcripts regressed on the mean individual transcript level. We further adjusted for the effects of age, age2, sex, age × sex and age2 × sex interaction using predictive linear regression models in SAS 9.1 (Cary, NC). We generated these residuals as part of our processing of the transcripts for linkage analyses.
Heritability estimation and linkage analysis
Heritability was estimated using maximum likelihood variance decomposition methods in SOLAR [5, 6]. Genome scans were performed using multipoint variance-components models that test for linkage between traits and genetic variants by partitioning the variance of the expression level into its additive genetic and environmental variance components [7]. For transcripts with co-localized QTLs, we performed bivariate linkage analysis to identify shared genetic effects. The bivariate polygenic model estimates correlations caused by residual additive genetic effects (ρG) and correlations caused by random environmental effects (ρE) [8]. To test for additive genetic correlation among pairs of traits, the log likelihood of a model in which ρG is constrained to 0 (null hypothesis, no correlation) or ρG = 1 (null hypothesis, complete shared genetic effect) is compared to that of a model in which ρG is estimated for the traits. Significant differences among the models (ρG ≠ 0) suggest that some of the same genes influence both traits. We also performed linkage analysis using the factors obtained from the PCA in a sample of transcripts with co-localized QTLs.
Principal-component analysis
PCA was used to reduce the number of expression profiles into statistically meaningful groups while retaining the original total variance using all the expression profiles [9, 10]. We selected two different chromosomal regions of a length of 10 to 12 MB in which the QTLs of at least 10 transcripts were co-localized. Only transcripts of genes that were not located in these selected chromosomal regions were included in the analyses (trans-regulatory genes). Because of the small number of individuals in the study and concerns of overfitting the model, a maximum sample of 50 transcript values were considered at one time [11]. The number of factors was determined using the eigenvalue-one requirement [11]. Factors are interpreted by examining the varimax-rotated factor loadings, which are the correlations between each phenotype and the factor in question. Factor loadings greater than or equal to 0.40 in absolute value were used to interpret factors and to characterize the factor structures; this criterion ensures that the individual factor variables share at least 15% of their variance with the given factor [9]. The principle components were obtained by calculating the eigenvalues of the sample covariance matrix, which represent the amount of variance contributed by each factor. Only factors with eigenvalues higher than 1 were considered for linkage analysis.
Integrating data from linkage analysis for gene co-expression
Linkage signals of individual transcript expressions were recorded and the location of QTLs was compared to the location of the transcript gene in order to identify trans-regulatory sites. In addition, the location and LOD scores of QTLs identified in single individual transcript analysis (univariate analysis) were compared with the location of the QTLs identified using bivariate analysis or factors of the PCA. This allowed a determination of whether the bivariate analysis or PCA data reduction analysis improved our evidence for linkage, and if so, a more in-depth examination of the transcripts included in the principal components needs to be examined for biologic interactions on complex disorders.