Skip to main content

Extracting disease risk profiles from expression data for linkage analysis: application to prostate cancer


The genetic factors underlying many complex traits are not well understood. The Genetic Analysis Workshop 15 Problem 1 data present the opportunity to explore whether gene expression data from microarrays can be utilized to define useful phenotypes for linkage analysis in complex diseases. We utilize expression profiles for multiple genes that have been associated with a disease to develop a composite 'risk profile' that can be used to map other loci involved in the same disease process. Using prostate cancer as our disease of interest, we identified 26 genes whose expression levels had previously been associated with prostate cancer and defined three phenotypes: high, neutral, or low risk profiles, based on individual expression levels. Linkage analyses using MCLINK, a Markov-chain Monte Carlo method, and MERLIN were performed for all three phenotypes. Both methods were in very close agreement. Genome-wide suggestive linkage evidence was observed on chromosomes 6 and 4. It was interesting to note that the linkage signals did not appear to be strongly influenced by the location of the original 26 genes used in the phenotype definition, indicating that composite measures may have potential to locate additional genes in the same process. In this example, however, extreme caution is necessary in any extrapolation of the identified loci to prostate cancer due to the lack of data regarding the behavior of these genes' expression level in lymphoblastoid cells. Our results do indicate there exists potential to augment our current knowledge about the relationships among genes associated with complex diseases using expression data.


Recent advances in biotechnology have resulted in an explosion of genotypic and phenotypic data. Millions of single-nucleotide polymorphisms (SNPs) can quickly and accurately be genotyped, and microarray technology has made it possible to simultaneously assess the expression levels for many thousands of genes. The question becomes: what knowledge can we extract from these extensive data sources with respect to disease susceptibility? And how? The Genetic Analysis Workshop 15 (GAW15) Problem 1 data presents a unique opportunity to explore whether gene expression data from microarrays can be used to define useful phenotypes for linkage analysis to better understand disease susceptibility. The expression data provided for Problem 1 includes 3554 genes that were previously established to have greater variation between individuals than within individuals. These expression levels are reasonable candidates for use as phenotypes in linkage analysis [1].

For the majority of complex traits the underlying genetic factors are not fully understood, but for many, certain genes and/or genetic pathways have been implicated or related to the trait through expression experiments. The expression levels of a gene may be controlled by regulatory genes elsewhere in the genome, and the expression of multiple genes can be regulated by a common transcription factor [2]. Hence, linkage analysis of gene expression levels could conceivably identify regulatory loci associated with that gene. Further, and more related to a disease end-point, if several genes are known to be related to a given trait, it is also conceivable that their expression levels could be combined to create a phenotype to be used in linkage analysis to identify loci that are involved in disease susceptibility, perhaps through membership in the pathway or interaction (epistasis) with the known genes.

In this study we explore whether gene expression profiles for genes that have been associated with a disease can be used to map other genes that are involved in the disease process or highlight genes within the pathways that are key factors. Here we specifically examine the approach for prostate cancer.

Research has consistently shown that genetics plays a critical role in prostate cancer development, but the identification of specific genes has proven to be very difficult. Hereditary prostate cancer is a complex disease involving numerous genes and variable phenotypic expression [3]. Recent research has demonstrated great potential for the use of proteomic profiling and other biomarkers for prostate cancer diagnostics [4]. One such study was able to discriminate between benign and cancerous prostates with perfect sensitivity in men with elevated prostate specific antigen (PSA) levels using serum proteomic profiling [5]. The GAW15 Problem 1 data provide an opportunity to explore whether gene expression levels from lymphoblastoid cells can be used to develop a prostate cancer profile phenotype for use in linkage analysis. Using expression data from 26 genes whose expression levels had previously been reported to be associated with prostate cancer [6], we defined individuals as having high, neutral, or low risk profiles based on their individual expression levels. Here we present the results of linkage analyses based on those phenotypes.


Ashida et al. identified 21 genes that are commonly up-regulated and 63 genes that are commonly down-regulated in the transition from normal epithelium to cancerous and/or prostatic intraepithelial neoplasia (PIN) [6]. Of these 84 genes, 26 were included in the data for Problem 1. These 26 genes are listed in Table 1. Using the expression data for the 194 individuals in the Problem 1 data, we scaled the expression levels for each of these 26 genes to fit a standard normal distribution with mean of 0 and variance of 1. Two statistics, Aand B, were then computed for each individual. Arepresented the number of genes for which the expression levels was greater than 1 standard deviation in the direction associated with prostate cancer. Brepresented the number of genes for which the expression level was greater than 1 standard deviation in the opposite direction. One standard deviation was selected arbitrarily as a threshold to ensure that the expression values were distant from the center of the distribution, while allowing for a sufficient number of informative subjects in the subsequent linkage analysis. An individual was considered to be in the "high-risk profile" group if A≥ 4 and A-B≥ 2. Individuals were classified to be in the "low-risk profile" group if B≥ 4 and B-A≥ 2. All other subjects were classified as "neutral" and were considered as "unknown" in all linkage analyses. This classification system was devised to distribute the influence of the 26 genes on the assigned risk profiles and to prevent outlying expression levels of individual genes from having undue influence. As shown in Figure 1, 53 subjects (25 male and 28 female) were classified with high-risk profiles, 57 (32 male and 25 female) with low-risk profiles, and 84 (42 male and 42 female) as neutral (unknown). While women are not susceptible to prostate cancer, they may still carry the susceptibility genes, hence in our analyses both males and females are included. Figure 1 shows a scatter plot of the values of Aand Bfor each individual and the categorization to the high-risk, low-risk, and neutral groups.

Table 1 Genes used to create phenotype definition
Figure 1
figure 1

Phenotype distribution. A, the number of genes expressed more than 1 SD in the direction associated with prostate cancer for an individual; B, the number of genes expressed greater than 1 SD in the opposite direction; 0, neutral risk status; 1, low-risk profile; 2, high-risk profile.

Three phenotype models were considered. The first model ("FULL") included the high-risk profile individuals as "affected" and the low-risk profile individuals as "unaffected"; neutrals were "unknown". The second model ("HIGH") included the high-risk profile individuals as "affected" and all others as "unknown". The third model ("LOW") included the low-risk profile individuals as "affected" and all others as "unknown". This final phenotype model is akin to an analysis searching for protective genes. For the FULL and HIGH phenotype models, 10 of the 14 CEPH (Centre d'Etude du Polymorphisme Humain) pedigrees were informative for linkage, with between two and eight affected subjects per pedigree. Thirteen pedigrees were informative in the LOW analysis, with up to nine affected subjects.

Dominant and recessive parametric linkage analyses were performed using MCLINK, which uses Markov-chain Monte Carlo simulation methods to sample haplotype configurations to estimate the LOD statistic [7]. The inheritance model for the analysis was based on the "Smith" model used to map the HPC1 locus, but without the specificity to males [8], and assumes a population prevalence of 0.003 for the mutant allele. Genotypes for a genome-wide panel of 2882 SNP markers were provided by GAW15. The genetic map used in the analysis was based on the Rutgers genetic map, with the positions of SNPs for which genetic map position was not available interpolated from flanking markers based on physical location [9]. Any SNP located less than 0.001 cM from the preceding SNP was eliminated from the initial analysis. After completing the initial analyses, the best linkage peaks were identified and those regions were reanalyzed using a reduced marker map, with a minimum spacing of 0.3 cM between SNPs [10]. This was done to control for the possible effects of linkage disequilibrium (LD), which may inflate LOD scores. The linkage statistics for these chromosomes were then confirmed by performing both parametric and model-free analyses with MERLIN [11]. Linked pedigrees (LOD > 0.588, which represents a nominal, uncorrected p < 0.05 for an individual pedigree) were identified in the regions with HLOD > 1.9 (genome-wide suggestive evidence for linkage [12]) and gene expression profiles within those pedigrees were inspected to ensure that the linkage evidence was not correlated with the expression levels of any specific genes.


The genome-wide scan results showing the HLOD statistic for all models are shown in Figure 2. Significant linkage evidence was observed on chromosome 6q (HLOD = 3.51). Other peaks over HLOD = 1.9 were observed on chromosomes 3, 4, and 7. Only the peaks on chromosomes 4 and 6 retained at least suggestive linkage evidence with the reduced marker set without LD.

Figure 2
figure 2

Genome-wide HLOD statistics. A, FULL analysis model; B, HIGH model; C, LOW (protective) model. The solid line represents the dominant inheritance model and the broken line represents the recessive model in each figure. HLOD values are shown on the vertical axis, and chromosome number is shown on the horizontal axis.

The strongest linkage signal observed in the FULL analysis, and the best result overall, was HLOD = 3.51 at marker rs1491074 under the dominant model on chromosome 6q. As is shown in Figure 3, 2 of the 26 genes used in creating the phenotype (SERPINB1 and IER3) are located on chromosome 6, however, they are not situated near the linkage peak. Chromosome 6 was reanalyzed using a map with increased marker spacing (which reduced the number of SNPs used from 101 to 70 and excluded SNP rs1491074) and the maximum HLOD fell to 2.82, suggesting the possible influence of LD in the initial result. This result was confirmed using MERLIN. The model-based HLOD statistic from MERLIN was very similar to results from MCLINK for both the full and reduced marker sets, although the model-free Kong and Cox LOD score did not perform well.

Figure 3
figure 3

Analyses with increased marker spacing. Detail of chromosome 6 from the FULL phenotype model and chromosome 4 from the HIGH phenotype model using a minimum marker spacing of 0.3 cM. The solid blue line in each panel represents the dominant HLOD statistic as calculated by MCLINK, the broken red line shows the dominant HLOD from Merlin, and the black dotted line shows the model-free Kong and Cox LOD score from MERLIN. The locations of genes included in the phenotype definition are indicated at the top of each frame.

The best result in the HIGH analysis was HLOD = 2.75 at marker rs885103 under the dominant model on chromosome 4q. Three pedigrees were linked to the locus with individual LOD scores > 0.588. None of the genes used to determine the phenotype were located on chromosome 4. Linkage results were unchanged when the peak was reanalyzed with the reduced marker map, as shown in Figure 3. MERLIN analysis confirmed the parametric linkage result from MCLINK.


One concern of a study based on expression levels of known genes is that a linkage analysis may simply map back to the genes used to construct the phenotype. This did not appear to be the case for this study. None of the genes were located near our best results on chromosomes 6 and 4. Our phenotype definition was simplistic, but was designed to limit the influence of individual genes on the phenotype, and thereby enhance the likelihood of identifying a locus related to the entire set. It is interesting to note that the regions we identified on chromosomes 6 [13, 14] and 4q [15, 16] have each been implicated in previous linkage analyses for prostate cancer. However, it is premature to consider these as replications, because without data indicating that the expression levels seen in tumors [6] are also representative in lymphoblastoid cells, there is no evidence that the risk profiles we created are actually related to prostate cancer. This is a major weakness of our particular example, and perhaps illustrates the weakness of such approaches in general-that is, much of the experimental data is still missing and will be expensive to generate.

Because the true locations of any genes that interact with or modify the 26 we studied are not known, the statistical power of this approach can not be properly evaluated. However, with the 14 CEPH pedigrees, we were able to generate linkage peaks that appeared distinct from background noise. Further, we know that the linkage evidence observed was not influenced by the linkage analysis method chosen, as both MCLINK and MERLIN produced almost identical results. Recognizing the limitations of the data available, we present these results as proof of concept that the expression levels of several related genes can be combined to create a phenotype that can reasonably be used in linkage analysis. Such an approach could identify loci that regulate or contribute to disease pathways. More work is needed to refine and test the methodology, and more experimental data is needed to correlate tissue and lymphoblastoid expression levels, but the approach appears to have the potential to augment our current knowledge about the genetic basis of complex diseases.


  1. Morley M, Molony CM, Weber T, Devlin JL, Ewens KG, Spielman RS, Cheung VG: Genetic analysis of genome-wide variation in human gene expression. Nature. 2004, 430: 743-747. 10.1038/nature02797.

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  2. Pennacchio L, Rubin E: Genomic strategies to identify mammalian regulatory sequences. Nat Rev Genet. 2001, 2: 100-109. 10.1038/35052548.

    Article  PubMed  CAS  Google Scholar 

  3. Schaid D: The complex genetic epidemiology of prostate cancer. Hum Mol Genet. 2004, 13: R103-R121. 10.1093/hmg/ddh072.

    Article  PubMed  CAS  Google Scholar 

  4. Banez LL, Prasanna P, Sun L, Ali A, Zou Z, Adam BL, McLeod DG, Moul JW, Srivastava S: Diagnostic potential of serum proteomic patterns in prostate cancer. J Urol. 2003, 170: 442-446. 10.1097/01.ju.0000069431.95404.56.

    Article  PubMed  CAS  Google Scholar 

  5. Ornstein DK, Rayford W, Fusaro VA, Conrads TP, Ross SJ, Hitt BA, Wiggins WW, Veenstra TD, Liotta LA, Petricoin EF: Serum proteomic profiling can discriminate prostate cancer from benign prostates in men with total prostate specific antigen levels between 2.5 and 15.0 ng/ml. J Urol. 2004, 172: 1302-1305. 10.1097/01.ju.0000139572.88463.39.

    Article  PubMed  CAS  Google Scholar 

  6. Ashida S, Nakagawa H, Katagiri T, Furihata M, Iiizumi M, Anazawa Y, Tsunoda T, Takata R, Kasahara K, Miki T, Fujioka T, Shuin T, Nakamura Y: Molecular features of the transition from prostatic intraepithelial neoplasia (PIN) to prostate cancer: genome-wide gene-expression profiles of prostate cancers and PINs. Cancer Res. 2004, 64: 5963-5972. 10.1158/0008-5472.CAN-04-0020.

    Article  PubMed  CAS  Google Scholar 

  7. Thomas A, Gutin A, Abkevich V, Bansal A: Multipoint linkage analysis by blocked Gibbs sampling. Stat Comput. 2000, 10: 259-269. 10.1023/A:1008947712763.

    Article  Google Scholar 

  8. Smith JR, Freije D, Carpten JD, Gronberg H, Xu J, Isaacs SD, Brownstein MJ, Bova GS, Guo H, Bujnovszky P, Nusskern DR, Damber JE, Bergh A, Emanuelsson M, Kallioniemi OP, Walker-Daniels J, Bailey-Wilson JE, Beaty TH, Meyers DA, Walsh PC, Collins FS, Trent JM, Isaacs WB: Major susceptibility locus for prostate cancer on chromosome 1 suggested by a genome-wide search. Science. 1996, 274: 1371-1374. 10.1126/science.274.5291.1371.

    Article  PubMed  CAS  Google Scholar 

  9. Sung YJ, Di Y, Fu AQ, Rothstein JH, Sieh W, Tong L, Thompson EA, Wijsman EM: Comparison of multipoint linkage analyses for quantitative trait in the CEPH data: parametric LOD scores, variance components LOD scores and Bayes factors. BMC Proc. 2007, 1 (Suppl 1): S93-

    Article  PubMed Central  PubMed  Google Scholar 

  10. Sellick GS, Webb EL, Allinson R, Matutes E, Dyer M, Jonsson V, Langerak AW, Mauro FR, Fuller S, Wiley J, Wiley J, Lyttelton M, Callea V, Yuille M, Catovsky D, Houlston RS: A high-density snp genomewide linkage scan for chronic lymphocytic leukemia susceptibility loci. Am J Hum Genet. 2005, 77: 420-429. 10.1086/444472.

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  11. Abecasis G, Cherny S, Cookson W, Cardon L: Merlin-rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet. 2002, 30: 97-101. 10.1038/ng786.

    Article  PubMed  CAS  Google Scholar 

  12. Lander E, Kruglyak L: Genetic dissection of complex traits: guidelines for interpreting and reporting linkage results. Nat Genet. 1995, 11: 241-247. 10.1038/ng1195-241.

    Article  PubMed  CAS  Google Scholar 

  13. Slager S, Zarfas KE, Brown WM, Lange E, McDonnell S, Wojno KJ, Cooney KA: Genome-wide linkage scan for prostate cancer aggressiveness loci using families from the University of Michigan Prostate Cancer Genetics Project. Prostate. 2006, 66: 173-179. 10.1002/pros.20332.

    Article  PubMed  CAS  Google Scholar 

  14. Stanford J, McDonnell S, Friedrichsen D, Carlson E, Kolb S, Deutsch K, Janer M, Hood L, Ostrander E, Schaid D: Prostate cancer and genetic susceptibility: a genome scan incorporating disease aggressiveness. Prostate. 2006, 66: 317-325. 10.1002/pros.20349.

    Article  PubMed  CAS  Google Scholar 

  15. Suarez BK, Lin J, Burmester JK, Broman KW, Weber JL, Banerjee TK, Goddard KA, Witte JS, Elston RC, Catalona WJ: A genome screen of multiplex sibships with prostate cancer. Am J Hum Genet. 2000, 66: 933-944. 10.1086/302818.

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  16. Xu J, Gillanders EM, Isaacs SD, Chang BL, Wiley KE, Zheng SL, Jones M, Gildea D, Riedesel E, Albertus J, Freas-Lutz D, Markey C, Meyers DA, Walsh PC, Trent JM, Isaacs WB: Genome-wide scan for prostate cancer susceptibility genes in the Johns Hopkins hereditary prostate cancer families. Prostate. 2003, 57: 320-325. 10.1002/pros.10306.

    Article  PubMed  CAS  Google Scholar 

Download references


The computational resources for this project have been provided by the National Institutes of Health (grant NCRR 1 S10 RR17214-01) on the Arches Metacluster, administered by the University of Utah Center for High Performance Computing. GBC was funded via NIH National Library of Medicine training grant T15 LM07124. In addition, the work was funded in part by NIH grant CA098364-01 (to NJC).

This article has been published as part of BMC Proceedings Volume 1 Supplement 1, 2007: Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci. The full contents of the supplement are available online at

Author information

Authors and Affiliations


Corresponding author

Correspondence to G Bryce Christensen.

Additional information

Competing interests

The author(s) declare that they have no competing interests.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Christensen, G.B., Cannon-Albright, L.A., Thomas, A. et al. Extracting disease risk profiles from expression data for linkage analysis: application to prostate cancer. BMC Proc 1 (Suppl 1), S82 (2007).

Download citation

  • Published:

  • DOI: