Skip to main content

Comparison of tagging single-nucleotide polymorphism methods in association analyses


Several methods to identify tagging single-nucleotide polymorphisms (SNPs) are in common use for genetic epidemiologic studies; however, there may be loss of information when using only a subset of SNPs. We sought to compare the ability of commonly used pairwise, multimarker, and haplotype-based tagging SNP selection methods to detect known associations with quantitative expression phenotypes. Using data from HapMap release 21 on unrelated Utah residents with ancestors from northern and western Europe (CEPH-Utah, CEU), we selected tagging SNPs in five chromosomal regions using ldSelect, Tagger, and TagSNPs. We found that SNP subsets did not substantially overlap, and that the use of trio data did not greatly impact SNP selection. We then tested associations between HapMap genotypes and expression phenotypes on 28 CEU individuals as part of Genetic Analysis Workshop 15. Relative to the use of all SNPs (n = 210 SNPs across all regions), most subset methods were able to detect single-SNP and haplotype associations. Generally, pairwise selection approaches worked extremely well, relative to use of all SNPs, with marked reductions in the number of SNPs required. Haplotype-based approaches, which had identified smaller SNP subsets, missed associations in some regions. We conclude that the optimal tagging SNP method depends on the true model of the genetic association (i.e., whether a SNP or haplotype is responsible); unfortunately, this is often unknown at the time of SNP selection. Additional evaluations using empirical and simulated data are needed.


Development and application of methods using linkage-disequilibrium (LD) for single-nucleotide polymorphism (SNP) selection has empowered genetic epidemiologic studies. Tagging SNP selection methods capitalize on the high levels of LD in much of the genome and aim to capture all of the common variation. SNP redundancy can be reduced, allowing for improved information/coverage within the constraints of a fixed budget. Three classes of tagging SNP methods have the following aims: 1) correlate each SNP of interest with a genotyped SNP (pairwise methods), 2) correlate each SNP of interest with a genotyped SNP or a combination of genotyped SNPs (multimarker methods), or 3) explain each haplotype of interest using a set of genotyped SNPs (haplotype-based methods). Investigators commonly select tagging SNPs using data from public projects [1] or a subset of study participants, then genotype only the SNP subset in the larger study population [2, 3].

Tagging SNP selection is implemented in commonly used, publicly available software packages that assess data from unrelated individuals (founders) or small families (trios). ldSelect [4] performs pairwise selection using a binning algorithm, Tagger [5] selects SNPs using pairwise and multimarker methods and allows for inclusion of trio data to reduce phase uncertainty, and TagSNPs v. 2.0-beta [6] implements pairwise, multimarker, and haplotype methods allowing for the inclusion of trio data.

We used these tagging SNP selection methods in genomic regions known to harbor associations with quantitative phenotypes [7]. We sought to assess whether (and to what degree) associations would have been detected if SNP subsets, rather than all SNPs, had been used. Previous simulated [8, 9] and family-based [10, 11] analyses suggest that empirical tagging SNP assessment in the context of association testing is needed. Here, we examine associations from analysis of >770,000 HapMap Phase I genotypes and ~1,000 expression phenotypes in 57 unrelated Utah residents with ancestors from northern and western Europe (CEU) [7]. We conducted a pilot study using a subset of samples with HapMap Phase II genotypes and contributed expression phenotypes as part of Genetic Analysis Workshop 15 (GAW15) [12].


Selection of regions to study was based on genetic associations with lymphocyte expression values reported by Cheung et al. [7]. Using linear regression and limiting the data to 28 individuals with both HapMap and GAW15 data (described in more detail below), excluding rs535088 (genotypes not available) and PSPHL (not uniquely mapped), we reassessed the ten most statistically significant genotype-phenotype pairs reported. Regions containing the five strongest associations (Table 1) were defined as 5 kb surrounding the previously reported SNPs and the nearby (cis) gene of interest.

Table 1 Chromosomal regionsa

Tagging SNP selection within these regions utilized HapMap release 21 CEU genotype data (60 founders or 30 trios) with MAF (or haplotype frequency) ≥ 0.05 and no quality control exclusions [13]. These parameters were chosen on the basis of common use in genetic association studies. From starting sets of "All SNPs", pairwise methods used a threshold of r2 ≥ 0.8 between unassayed and assayed SNPs among founders ("ldSelect", "TagSNPs-Rspair") or trios ("TagSNPs-Rspair-trios", "Tagger-pairwise"); multimarker methods used R s 2 ≥ 0.8 (or LOD > 3.0) between unassayed SNPs and combinations of up to three assayed SNPs among founders ("TagSNPs-Rs") or trios ("TagSNPs-Rs-trios", "Tagger-multimarker"); haplotype-based methods used R h 2 ≥ 0.8 between haplotypes and assayed SNPs among founders ("TagSNPs-Rh") or trios ("TagSNPs-Rh-trios").

Association testing was performed on 28 unrelated CEU individuals included in both HapMap and GAW15 datasets (IDs available upon request) [1, 13]. We used genotypes from HapMap release 21 (coded as 0, 1, and 2) and phenotypes from GAW15 (log2-transformed Affymetrix global-normalized lymphocyte expression values [14]). Single-SNP association testing used linear regression [7]. Haplotype association testing used the Splus library HaploStat [15] excluding haplotypes with estimated n < 5. Haplotypes were defined across each region (haplo.score) as well as by sliding three-SNP windows (haplo.score.slide) [15].


We examined five regions known to harbor genetic associations in a small, well characterized sample [7]. SNPs in these chromosomes 5, 6, 20, and 21 regions were associated with lymphocyte expression levels of proteins (LRAP, HLA-DRB2, CPNE1, AA827892, and CSTB) encoded by nearby genes (Table 1). The HapMap project genotyped a total of 210 SNPs (MAF ≥ 0.05 in 60 CEU samples) (Figure 1, 2, 3, 4, 5). The LRAP region included the most HapMap SNPs (n = 72, Table 1) and had strong linkage disequilibrium (LD); the HLA-DRB2 region had a large number of SNPs and low LD; the AA827892 region included only 16 SNPs in strong LD; and the CPNE1 and CSTB regions were of intermediate size with modest/variable LD. Single-SNP association testing in 28 phenotyped individuals yielded p-values < 10-6 in each region (Figure 1, 2, 3, 4, 5). Across regions of strong LD, consistent associations were seen (i.e., nearly identical -log10(p-values)); independent SNPs yielded unique results (Figure 4).

Figure 1
figure 1

SNPs, single-SNP associations, and LD for LRAP. Underline, original association; Haploview 3.32 plotted r2 (white, 0; black, 1) in 60 CEU samples.

Figure 2
figure 2

SNPs, single-SNP associations, and LD for HLA-DRB2. Underline, original association; Haploview 3.32 plotted r2 (white, 0; black, 1) in 60 CEU samples.

Figure 3
figure 3

SNPs, single-SNP associations, and LD for CPNE1. Underline, original association; Haploview 3.32 plotted r2 (white, 0; black, 1) in 60 CEU samples.

Figure 4
figure 4

SNPs, single-SNP associations, and LD for AA827892. Underline, original association; Haploview 3.32 plotted r2 (white, 0; black, 1) in 60 CEU samples.

Figure 5
figure 5

SNPs, single-SNP associations, and LD for CSTB. Underline, original association; Haploview 3.32 plotted r2 (white, 0; black, 1) in 60 CEU samples.

Nine subsets of tagging SNPs were identified within each region (Figure 1, 2, 3, 4, 5). In regions with lower LD (HLA-DRB2 and CSTB), more markers were generally required and selected SNPs were less consistent across methods. This may be because there are many possible haplotypes, and haplotype-based methods may thus estimate varying number and frequency of the haplotypes to tag. In regions with high-LD, there was also lack of consistency across methods. For example, in the AA827892 region, SNPs 10 and 14 are independent and selected by all methods, yet SNPs 1–9 and 11–13 are in high LD and methods vary in which they select (Figure 4). There were surprising discrepancies in SNP selection across methods that used an identical algorithm (e.g., ldSelect and TagSNPs-Rspair); we attribute this to differences in rounding LD measures. Generally, SNP subsets overlapped among pairwise methods (HLA-DRB2, Figure 2), among haplotype-based methods (CPNE1, Figure 3), among TagSNPs methods with trios and founders (LRAP, Figure 1), and among Tagger pairwise and multimarker methods (CSTB, Figure 5).

We then assessed whether subsets of tagging SNPs detected the strong association signals observed when all SNPs were studied (Table 1). The minimum single-SNP association p-values identified by each subset within each region are provided in Table 2. Single-SNP results in each region were strongest using "All SNPs", but were comparable in SNP subsets that included the strongest SNP or a SNP in strong LD with the strongest SNP (e.g., SNP 10, 13, and 18 in the CPNE1 region; Figure 1, 2, 3, 4, 5). Although all methods identified HLA-DRB2 associations, there was great variation in p-values, most likely due to one particularly strong SNP association (SNP 41) and low LD (except with SNP 44). Multimarker SNP selection methods implemented in TagSNPs (but not Tagger) failed to detect associations with CPNE1 or AA827892 (selected SNPs, e.g., AA827892 SNP 16, were not in LD with associated SNPs) (p > 0.01; Figure 1, 2, 3, 4, 5; Table 2).

Table 2 Single-SNP association resultsa

Although regions were initially chosen on the basis of observed single-SNP associations, we also assessed haplotype associations. Results considering all SNPs in each set (global p-value), and sliding windows of three-SNP haplotypes (minimum global p-value) are shown in Table 3. In all regions using "All SNPs", at least one three-SNP haplotype was associated at p < 0.01; but only the LRAP, CPNE1, and CSTB regions yielded global results significant at this level (Table 3). Comparing across subsets, note that set-haplotype analyses are comparable in terms of number of tests, while three-SNP haplotype analyses are comparable in terms of degrees of freedom. There was general consistency in results across methods for LRAP and AA827892 (regions with strongest LD); however, no subsets detected the strongest three-marker haplotype association for AA827892. There was also consistency in haplotype association results in the HLA-DRB2 region (with low LD); global p-values oscillated around 0.01. Haplotype-based SNP selection methods (TagSNPs-Rh-trios), which selected only two tagging SNPs, failed to detect the CPNE1 haplotype association observed by other methods (Table 3). Multimarker SNP selection methods implemented in TagSNPs (but not Tagger) failed to detect CSTB haplotype associations.

Table 3 Haplotype association resultsa

Figure 6 summarizes relative signals for associations across SNP subsets as the ratio of [-log(minimum p-value using subset)] to [-log(minimum p-value using all SNPs)]. Generally, haplotype-based selection methods and methods in TagSNPs "missed" more single-SNP and haplotype associations than other methods (Figure 6).

Figure 6
figure 6

Relative signal strength. [-log(min-p-Subset)]/[-log(min-p-All-SNPs)]; solid line, single-SNP; dashed line, 3-SNP haplotype.


Our ability to combine HapMap genotype data with GAW15 phenotype data provided a unique opportunity to assess chromosomal regions harboring known genetic associations in CEU samples. Although only a small pilot study, we explored whether these associations would have been detected if genotyping had been limited to tagging SNPs. The current analysis has advantages over other reported methods in that we focused on association testing, particular commonly used statistical tools, and use of HapMap data.

We make several observations. There was lack of consistency across selected SNP sets whether or not LD was present. Inclusion of trio data did not generally impact SNP selection. For the majority of regions, pairwise approaches worked well, relative to use of all SNPs, with marked reductions in the number of SNPs required. Methods reducing the number of SNPs over pairwise methods (e.g., multimarker methods) may lead to more missed signals, particularly in haplotype association testing. The program TagSNPs did not offer particular advantages over ldSelect or Tagger in terms of number of SNPs chosen or associations detected. Regardless of the method used, typing additional markers in areas of signal may improve signal strength and localization.

The current work suggests that empirical assessment of a larger data set and simulated data addressing a range of genetic models would allow for more precise comparison of approaches. Consideration of coverage, rather than signal strength, and examination of our assumption that signals detected in each region were due to a common underlying genetic cause could further inform comparisons. Additional issues include cost efficiency, transferability of tagging SNPs, and the role of bioinformatics.


The optimal tagging SNP method to use will depend on the true genetic model of the association. Pairwise or multimarker methods are optimal if the discovery SNP set contains the causal SNP (or a SNP in strong LD with causal SNP), while haplotype-based methods are optimal if the discovery SNP set defines a haplotype carrying the causal allele. Unfortunately, it is seldom known during the SNP selection phase of studies whether a SNP or a haplotype defines an association. Thus, critical assessment of the utility of available SNP selection methods under a variety of conditions is essential.


  1. Altshuler D, Brooks LD, Chakravarti A, Collins FS, Daly MJ, Donnelly P: A haplotype map of the human genome. Nature. 2005, 437: 1299-1320. 10.1038/nature04226.

    Article  Google Scholar 

  2. Benusiglio PR, Pharoah PD, Smith PL, Lesueur F, Conroy D, Luben RN, Dew G, Jordan C, Dunning A, Easton DF, Ponder BAJ: HapMap-based study of the 17q21 ERBB2 amplicon in susceptibility to breast cancer. Br J Cancer. 2006, 95: 1689-1695. 10.1038/sj.bjc.6603473.

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  3. Lu X, Zhao W, Huang J, Li H, Yang W, Wang L, Huang W, Chen S, Gu D: Common variation in KLKB1 and essential hypertension risk: tagging-SNP haplotype analysis in a case-control study. Hum Genet. 2007, 121: 327-335. 10.1007/s00439-007-0340-4.

    Article  PubMed  CAS  Google Scholar 

  4. Carlson CS, Eberle MA, Rieder MJ, Yi Q, Kruglyak L, Nickerson DA: Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am J Hum Genet. 2004, 74: 106-120. 10.1086/381000.

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  5. de Bakker PI, Yelensky R, Pe'er I, Gabriel SB, Daly MJ, Altshuler D: Efficiency and power in genetic association studies. Nat Genet. 2005, 37: 1217-1223. 10.1038/ng1669.

    Article  PubMed  CAS  Google Scholar 

  6. Stram DO: Tag SNP selection for association studies. Genet Epidemiol. 2004, 27: 365-374. 10.1002/gepi.20028.

    Article  PubMed  Google Scholar 

  7. Cheung VG, Spielman RS, Ewens KG, Weber TM, Morley M, Burdick JT: Mapping determinants of human gene expression by regional and genome-wide association. Nature. 2005, 437: 1365-1369. 10.1038/nature04244.

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  8. Huang Q, Fu YX, Boerwinkle E: Comparison of strategies for selecting single nucleotide polymorphisms for case/control association studies. Hum Genet. 2003, 113: 253-257. 10.1007/s00439-003-0965-x.

    Article  PubMed  CAS  Google Scholar 

  9. Burkett KM, Ghadessi M, McNeney B, Graham J, Daley D: A comparison of five methods for selecting tagging single-nucleotide polymorphisms. BMC Genet. 2005, 6 (Suppl 1): S71-10.1186/1471-2156-6-S1-S71.

    Article  PubMed Central  PubMed  Google Scholar 

  10. Duggal P, Gillanders E, Mathias R, Ibay G, Klein K, Baffoe-Bonnie A, Ou L, Dusenberry I, Tsai Y-Y, Chines P, Doan B, Bailey-Wilson J: Identification of tag single-nucleotide polymorphisms in regions with varying linkage disequilibrium. BMC Genet. 2005, 6 (Suppl 1): S73-10.1186/1471-2156-6-S1-S73.

    Article  PubMed Central  PubMed  Google Scholar 

  11. Chi PB, Duggal P, Kao WH, Mathias RA, Grant AV, Stockton ML, Garcia JG, Ingersoll RG, Scott AF, Beaty TH, Barnes KC, Fallin MD: Comparison of SNP tagging methods using empirical data: association study of 713 SNPs on chromosome 12q14.3-12q24.21 for asthma and total serum IgE in an African Caribbean population. Genet Epidemiol. 2006, 30: 609-619. 10.1002/gepi.20172.

    Article  PubMed  Google Scholar 

  12. Cordell HJ, de Andrade M, Babron M-C, Bartlett CW, Beyene J, Bickeböller H, Culverhouse R, Cupples LA, Daw EW, Dupuis J, Falk CT, Ghosh S, Goddard KA, Goode EL, Hauser ER, Martin LJ, Martinez M, North KE, Saccone NL, Schmidt S, Tapper W, Thomas D, Tritchler D, Vieland VJ, Wijsman EM, Wilcox MW, Witte JS, Yang Q, Ziegler A, Almasy L, MacCluer JW: Genetic Analysis Workshop 15: gene expression analysis and approaches to detecting multiple functional loci. BMC Proc. 2007, 1 (Suppl 1): S1-

    Article  PubMed Central  PubMed  Google Scholar 

  13. International HapMap Project. Build 35; August 10, 2006, []

  14. Morley M, Molony CM, Weber TM, Devlin JL, Ewens KG, Spielman RS, Cheung VG: Genetic analysis of genome-wide variation in human gene expression. Nature. 2004, 430: 743-747. 10.1038/nature02797.

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  15. Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA: Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am J Hum Genet. 2002, 70: 425-434. 10.1086/338688.

    Article  PubMed Central  PubMed  Google Scholar 

Download references


We acknowledge funding from R01 CA94919, R01 CA104667, and R01 H167406.

This article has been published as part of BMC Proceedings Volume 1 Supplement 1, 2007: Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci. The full contents of the supplement are available online at

Author information

Authors and Affiliations


Corresponding author

Correspondence to Ellen L Goode.

Additional information

Competing interests

The author(s) declare that they have no competing interests.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Goode, E.L., Fridley, B.L., Sun, Z. et al. Comparison of tagging single-nucleotide polymorphism methods in association analyses. BMC Proc 1 (Suppl 1), S6 (2007).

Download citation

  • Published:

  • DOI: