Skip to main content

Data for Genetic Analysis Workshop (GAW) 15, Problem 1: genetics of gene expression variation in humans


Here we describe the data provided for Problem 1 of Genetic Analysis Workshop 15. The data provided for Problem 1 were unusual in two ways. First, the phenotype was the level of gene expression for each gene, not a conventional phenotype like height or disease, and second, there were more than 3500 such phenotypes. Natural variation in gene expression was a new idea in 2004 when these data were collected and published. Because the phenotypes were measured in members of 14 Centre d'Etude du Polymorphisme Humain (CEPH) families, there was an opportunity for linkage mapping on a very large scale. For this purpose, 2882 single-nucleotide polymorphism genotypes were also provided for each family member.


There is extensive individual variation in the expression level of many genes in organisms from yeast to humans. In humans, the differences are smaller in monozygotic twins than among individuals of other relationships, suggesting a genetic contribution to the variation [1]. The data for Problem 1 came from studies of the genetic basis of variation in human gene expression [2, 3].


Study subjects

The data provided to Genetic Analysis Workshop 15 (GAW15) were of several sorts. The basic collection was of data from large families, specifically 14 three-generation Centre d'Etude du Polymorphisme Humain (CEPH) Utah families (approximately 8 offspring per sibship and 14 individuals per family). The CEPH Utah families are the most uniform of the three-generation CEPH families (parents and grandparents are available) and cells are available for all four grandparents. The data provided were from 14 of these. In addition, gene expression data were provided from 30 "HapMap" trios: these are "grandparent-parent" trios that are partly included among those in the 14 families, plus approximately 12 additional grandparent-parent trios of CEPH Utah individuals. The 30 trios are also part of the International HapMap Project. The data included pedigree files with information on the structure of each family.


The expression level of genes expressed in lymphoblastoid cells (EBV-transformed B-cells) were obtained for the above subjects, using the Affymetrix Human Focus Arrays that contain probes for 8500 transcripts. For approximately 85 of the study subjects, array hybridizations were performed in duplicate. Data were provided for the 3554 genes for which we found greater variation among individuals than between replicates for the same individual. The cel files (raw image files) were provided, as well as slightly processed data (normalized data using the Affymetrix MAS software) for all array hybridizations.


Genotypes of 2882 autosomal and X-linked SNPs for members of the 14 CEPH Utah families described above were provided. The genotypes were generated by The SNP Consortium [[4]; also]. These marker genotypes were used for the mapping of gene expression phenotypes described in Morley et al. [2]. To make it possible to compare linkage results obtained in GAW15, the SNP genotype files that were used by Morley et al. [2] to generate the linkage results were also provided. These included the physical location of the SNP markers.

The genotypes for the HapMap subjects are freely available from the HapMap website Because those data sets continue to be updated, they were not provided separately for GAW15.

Discussion and possible analyses

The distinctive feature of these data is that genome scan data are provided for more than 3500 quantitative traits at one time. So instead of asking about properties of analysis methods for one phenotype at a time, once can ask about "operating characteristics" over a whole range of traits. Furthermore, because the phenotypes are in some sense all of one type (level of gene expression), it is natural to look for relationships among them, in a way that would not normally be appropriate for a collection of dissimilar traits, such as (for example) blood pressure, IQ, height, and serum glucose. The data offer an opportunity to look for genes whose expression appears to be co-regulated, and try to find where the determinants of these expression phenotypes map.

In addition, one might

  1. 1.

    Test methods for transforming expression data in a context where some answers are essentially "known."

  2. 2.

    Explore problems of multiple testing.

  3. 3.

    Explore replication by subsetting data, and/or by cross-validation and other post-hoc methods. Develop and use permutation tests appropriate for such a collection of data.

  4. 4.

    Look for interactions of chromosomal regions, or particular genes, etc.


The data provided for Problem 1 offer a unique opportunity for analysis of genetics of variation in expression levels of a large number of genes. The data lend themselves to a wide variety of analyses, including mapping of determinants for individual expression phenotypes, testing for effects of multiple determinants, and technical aspects of microarray interpretation. The participants in GAW15 carried out a remarkable variety of approaches, with ingenious new ideas for learning from the data.


  1. Cheung VG, Conlin LK, Weber TM, Arcaro M, Jen KY, Morley M, Spielman RS: Natural variation in human gene expression assessed in lymphoblastoid cells. Nat Genet. 2003, 33: 422-425. 10.1038/ng1094.

    Article  PubMed  CAS  Google Scholar 

  2. Morley M, Molony CM, Weber T, Devlin JL, Ewens KG, Spielman RS, Cheung VG: Genetic analysis of genome-wide variation in human gene expression. Nature. 2004, 430: 743-747. 10.1038/nature02797.

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  3. Cheung VG, Spielman RS, Ewens KG, Weber TM, Morley M, Burdick JT: Mapping determinants of human gene expression by regional and genome-wide association. Nature. 2005, 437: 1365-1369. 10.1038/nature04244.

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  4. Matise TC, Sachidanandam R, Clark AG, Kruglyak L, Wijsman E, Kakol J, Buyske S, Chui B, Cohen P, de Toma C, Ehm M, Glanowski S, He C, Heil J, Markianos K, McMullen I, Pericak-Vance MA, Silbergleit A, Stein L, Wagner M, Wilson AF, Winick JD, Winn-Deen ES, Yamashiro CT, Cann HM, Lai E, Holden AL: A 3.9-centimorgan-resolution human single-nucleotide polymorphism linkage map and screening set. Am J Hum Genet. 2003, 73: 271-284. 10.1086/377137.

    Article  PubMed Central  PubMed  CAS  Google Scholar 

Download references


This work was supported by U.S. National Institutes of Health grants (to RSS and VGC) and by the W.W. Smith Chair (VGC).

This article has been published as part of BMC Proceedings Volume 1 Supplement 1, 2007: Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci. The full contents of the supplement are available online at

Author information

Authors and Affiliations


Corresponding author

Correspondence to Richard S Spielman.

Additional information

Competing interests

The author(s) declare that they have no competing interests.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Cheung, V.G., Spielman, R.S. Data for Genetic Analysis Workshop (GAW) 15, Problem 1: genetics of gene expression variation in humans. BMC Proc 1 (Suppl 1), S2 (2007).

Download citation

  • Published:

  • DOI: