Volume 3 Supplement 7
Region-based analysis in genome-wide association study of Framingham Heart Study blood lipid phenotypes
© Asimit et al; licensee BioMed Central Ltd. 2009
Published: 15 December 2009
Due to the high-dimensionality of single-nucleotide polymorphism (SNP) data, region-based methods are an attractive approach to the identification of genetic variation associated with a certain phenotype. A common approach to defining regions is to identify the most significant SNPs from a single-SNP association analysis, and then use a gene database to obtain a list of genes proximal to the identified SNPs. Alternatively, regions may be defined statistically, via a scan statistic. After categorizing SNPs as significant or not (based on the single-SNP association p-values), a scan statistic is useful to identify regions that contain more significant SNPs than expected by chance. Important features of this method are that regions are defined statistically, so that there is no dependence on a gene database, and both gene and inter-gene regions can be detected. In the analysis of blood-lipid phenotypes from the Framingham Heart Study (FHS), we compared statistically defined regions with those formed from the top single SNP tests. Although we missed a number of single SNPs, we also identified many additional regions not found as SNP-database regions and avoided issues related to region definition. In addition, analyses of candidate genes for high-density lipoprotein, low-density lipoprotein, and triglyceride levels suggested that associations detected with region-based statistics are also found using the scan statistic approach.
Definition of an appropriate unit of gene function has been identified as a fundamental issue in genetic association analysis using high-dimensional single-nucleotide polymorphism (SNP) data . On one hand, the use of SNPs selected to capture variation across the whole genome may lend itself to treating a single SNP as the unit of analysis for false-positive error control. On the other hand, allocating SNPs into regions and treating the region as the unit of analysis can substantially reduce the dimensionality problem at the genome level, and is natural when the region corresponds to a candidate gene. Neale and Sham put forth an eloquent argument for such a gene-based approach . Given that a set of SNPs deemed to be relevant to a particular candidate region can be identified, the issue of how to evaluate genetic association for the candidate gene/region remains. Application of test statistics for multiple SNP markers within a chromosomal region may help address the problem of multiple testing by increasing the power to detect associations and/or reducing the number of tests conducted.
Scan statistics based on single-SNP tests have been proposed to identify genomic regions associated with disease [3, 4], whereas others consider a class of test statistics with small degrees of freedom (df) that combine information across a set of SNP markers within an identified region . A multi-locus regression-based test statistic that simultaneously tests for main effects of all the SNP loci within a region, ignoring haplotype phase, can be more powerful than haplotype analysis  because it allows for association across multiple markers but does not "spend" dfon rare haplotypes. At the other extreme, the results of multiple single dftests of SNPs within a candidate region require adjustment for multiple testing. A number of authors compared various test statistics, mainly in the case-control setting, finding that relative performance depends on the density and the correlation structure of the SNPs within a region, the selection criteria and the number of SNP markers, the placement and the number of liability/causal SNPs within a region, as well as on allele frequencies and the presence of allelic heterogeneity.
In this contribution, we apply two region-based approaches to a genome-wide association study (GWAS) analysis of blood lipid measures taken in members of Offspring Cohort and Generation 3 Cohort of the Framingham Heart Study (FHS). Initially, we tested each of the 550 k SNPs from the Affymetrix array datasets, one at a time. In an alternate approach, we applied scan statistics based on the single-SNP p-values to identify and test genomic regions simultaneously. Taking a more conventional approach, we also used external information from the UCSC gene database  to define gene and inter-gene regions corresponding to single SNPs with small p-values. Within the defined genomic regions, we then applied region-based test statistics using multiple linear regressions of sets of SNPs. We compare the two analytic strategies in GWAS with respect to the SNPs and the regions detected, and also compare the association test results in a set of regions defined by candidate lipid genes.
We analyzed the Genetic Analysis Workshop 16 FHS Offspring Cohort (n = 2584) and Generation 3 Cohort (n = 3811) using the SNP genotypes from GeneChip Human Mapping 500 k Array and 50 k Human Gene Focused Panel and the blood lipid phenotypes. All family members within these cohorts who had been genotyped and phenotyped were included in the analysis.
Definition of phenotypes
Fasting total cholesterol, high-density lipoprotein (HDL) cholesterol and triglycerides (TG) were measured at up to four exams for the Offspring Cohort and at one exam for the Generation 3 Cohort. Low-density lipoprotein (LDL) cholesterol was calculated using the Friedewald formula (Total = HDL + LDL + TG/5) for each measurement. For the patients on lipid lowering medication, the actual total cholesterol and TG values were imputed following the method of Kathiresan et al. . Imputation models were obtained separately by sex, and the sequential imputation process was performed separately within age-sex subgroups (10-year groups). TG values were log-transformed. The phenotype values were averaged over the multiple exams, as were the corresponding covariate values. We adjusted the mean HDL, mean LDL, and mean TG values for the averaged covariates using linear regression and treated the residuals as the phenotype values for the genotype-phenotype analysis. Two covariate models were used for the adjustment of phenotypes, separately by sex: Model 1: age and age2, and Model 2: age, age2, body mass index, alcohol intake, and cigarette smoking.
Quality control of SNP genotype data
Quality control was completed using the computer programs PLINK  and Eigenstrat . SNPs were filtered at a minor allele frequency <1%, Hardy-Weinberg equilibrium <10-10 and call rate <90%. Samples were filtered at a call rate <90%. There were no outliers for exclusion, as determined using Eigenstrat.
Individual level single-SNP association analysis
Linear regression of each of the residual phenotypes (Mean-HDL, Mean-LDL, Mean-TG) was performed using PLINK for each of the 550 k SNPs that passed filtering, based on a simple regression of additive SNP coding, including all individuals and ignoring familial correlation. Departures from the expected asymptotic distributions were assessed via quantile-quantile (Q-Q) plots for each of the phenotypes.
Region identification and testing via scan statistics
The scan statistic approach identifies regions of significant SNPs and tests for regional significance . It requires the SNP position and the p-value for association at that position. A group of SNPs tends to be identified as a region if there is statistical evidence of clustering of positions and of small p-values. The locations of SNPs along a chromosome are assumed to follow a Poisson process. To detect regions of association, the original Poisson process is partitioned into two independent Poisson processes, according to a chosen p-value threshold level. The resulting sets of SNP locations are both Poisson processes, with rates proportional to the original process. When the assumption of independent processes is violated, some regions may be detected solely because of their marker correlation structure, so to reduce the correlation among SNPs, we pruned the data by choosing tagSNPs with a pair-wise linkage disequilibrium (LD) R2 threshold less than 0.5 .
Using the statistical package R, we identified regions of association by evaluating windows along the chromosome including varying numbers of SNPs, and tested for region-level significance. The regional p-value is the probability of observing the same number of significant markers over a distance as short as or shorter than observed. The scan statistic is simply the distance spanned by the group of markers of interest, i.e., the sum of inter-marker distances. Under Poisson process assumptions of independently identically distributed exponential inter-SNP distances, the scan statistic follows a gamma distribution, so that the probability of a high association cluster is a gamma cumulative distribution function. If this observed regional probability is smaller than a pre-specified significance criterion, then the group of markers is identified as a cluster of significant associations not likely to occur simply by chance. Genome-wide regional p-values were calculated empirically, using 10,000 permutations of the tag-SNP p-values across positions. In each permutation we kept the top n regions, where n is the number of identified regions in the original analysis .
Region identification and testing via database-defined regions
Using the UCSC database, a list of regions meeting genome-wide criteria for significance (p < 10-4) was formed from the single-SNP tests. If a SNP was within ± 5 kb of a gene, then the assigned gene region was the gene endpoints ± 5 kb. Otherwise, the SNP position ± 5 kb was classified as an inter-gene region. In each of the gene and inter-gene regions thus defined, we performed region-based analyses using multi-variable regression of k SNPs within the defined region using the generalized estimating equations (GEE) robust variance to account for familial correlation, and the linear regression model: E(residual lipid phenotype) = α + β1 xG1 + β2 xG2 + ... + β k χ Gk . For test statistics, we calculated the global k dftest (Hotelling's test), the Schaid test (1 dflinear combination of SNP-specific test statistics; ), and the James min P test (correlation adjusted minimum p-value; ). To address SNP collinearity and reduce dimensionality, we repeated these analyses using principal components constructed from within-region SNPs .
Results and discussion
Markers from the 500 k chip, pruned for LD (R2 < 0.5), were used as input to the scan statistic analysis. The proportion of markers retained per chromosome ranged from 36 to 52%, with a mean of 40%. We specified a SNP p-value threshold of 0.01 and a regional threshold of 0.001. We categorized a scan statistic region as a gene region if it overlapped with a defined gene region (± 5 kb), and called the remaining regions non-gene regions. For HDL, 135 gene and 105 non-gene regions were detected genome-wide, with similar proportions for LDL and TG (133/110 and 100/104 for gene/non-gene, respectively).
Comparison of scan statistic regions with single-SNP tests for HDL having p-values < 10-4
Scan statistic regions
SNPs missed by
scan statistic regions
Total no. SNPs
Comparison of scan statistic regions with SNP-database regions defined from single-SNP tests for HDL having p-values < 10-4
Regions detected only by scan statistic
Total no. regions
Non-gene scan statistic
Gene scan statistic
Region-based tests of candidate genes for lipid phenotypes
Gene-based analysis (p-values)a
Scan statistic analysis
No. SNPs (No. PCs)
Global LR test
James min Ptest
Empirical GW p-valueb
7.96 × 10 -28
3.32 × 10 -20
3.81 × 10 -16
4.72 × 10-17
<1.0 × 10 -5
7.54 × 10 -7
8.95 × 10 -7
8.52 × 10 -6
1.06 × 10-8
9.42 × 10 -4
1.67 × 10 -6
1.12 × 10-3
2.51 × 10-8
1.50 × 10 -3
4.72 × 10-17
<1.0 × 10 -5
4.27 × 10-4
1.87 × 10 -4
6.15 × 10-4
7.81 × 10-26
<1.0 × 10 -5
7.81 × 10-26
<1.0 × 10 -5
2.43 × 10 -25
2.43 × 10 -25
1.21 × 10 -25
4.20 × 10-6
2.67 × 10 -5
3.80 × 10 -5
9.91 × 10 -6
1.82 × 10-8
1.10 × 10 -3
2.33 × 10 -11
5.41 × 10 -11
2.06 × 10 -9
9.40 × 10-10
2.22 × 10 -4
5.52 × 10-4
1.09 × 10 -4
1.38 × 10-3
6.09 × 10-11
4.69 × 10 -5
8.38 × 10 -14
2.78 × 10 -14
6.81 × 10 -12
4.64 × 10-10
4.75 × 10 -5
3.23 × 10 -11
1.70 × 10 -11
1.84 × 10 -9
1.27 × 10-16
<1.0 × 10 -5
8.98 × 10 -13
8.17 × 10 -10
2.46 × 10 -11
5.51 × 10-6
We consider chromosomal regions as the unit of analysis, rather than SNPs, so that the dimensionality problem is reduced at the genome-level. However, when using the scan statistic, the issue of criteria for genome-wide significance is difficult to address because the dimension of the problem is not well defined with testing of many possible overlapping regions consisting of different window sizes. Here we used positional permutation of p-values to obtain genome-wide regional p-values.
In using the statistically defined regions without referring to the top SNPs, it appears that although we missed a number of significant single SNPs, we also identified many additional regions not found as SNP-database regions. The scan-statistic approach could also be used as a first stage in GWAS analysis, followed by within-region fine-mapping and/or direct sequencing. Once a region is detected, both approaches require follow-up with additional analyses to assess specific SNP variation within a region.
List of abbreviations used
Framingham Heart Study
Generalized estimating equations
Genome-wide association study
The Genetic Analysis Workshops are supported by NIH grant R01 GM031575 from the National Institute of General Medical Sciences. This research was supported by research grants from the Canadian Institutes of Health Research (CIHR MOP-84287) and the Network of Centres of Excellence in Mathematics. JLA was supported by a post-doctoral fellowship from the Canadian Breast Cancer Foundation.
This article has been published as part of BMC Proceedings Volume 3 Supplement 7, 2009: Genetic Analysis Workshop 16. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/3?issue=S7.
- Clark AG, Boerwinkle E, Hixson J, Sing CF: Determinants of the success of whole-genome association testing. Genome Res. 2005, 15: 1463-1467. 10.1101/gr.4244005.View ArticlePubMedGoogle Scholar
- Neale BM, Sham PC: The future of association studies: gene-based analysis and replication. Am J Hum Genet. 2004, 75: 353-362. 10.1086/423901.PubMed CentralView ArticlePubMedGoogle Scholar
- Sun YV, Levin AM, Boerwinkle E, Robertson H, Kardia SL: A scan statistic for identifying chromosomal patterns of SNP association. Genet Epidemiol. 2006, 30: 627-635. 10.1002/gepi.20173.View ArticlePubMedGoogle Scholar
- Sun YV, Jacobsen DM, Turner ST, Boerwinkle E, Kardia SLR: Fast implementation of a scan statistic for identifying chromosomal patterns of genome-wide association studies. Comput Stat Data Anal. 2009, 53: 1794-1801. 10.1016/j.csda.2008.04.013.PubMed CentralView ArticlePubMedGoogle Scholar
- Schaid DJ, McDonnell SK, Hebbring SJ, Cunningham JM, Thibodeau SN: Nonparametric tests of association of multiple genes with human disease. Am J Hum Genet. 2005, 76: 780-793. 10.1086/429838.PubMed CentralView ArticlePubMedGoogle Scholar
- Clayton D, Chapman J, Cooper J: Use of unphased multilocus genotype data in indirect association studies. Genet Epidemiol. 2004, 27: 415-428. 10.1002/gepi.20032.View ArticlePubMedGoogle Scholar
- Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D: The human genome browser at UCSC. Genome Res. 2002, 12: 996-1006. [http://genome.ucsc.edu/cite.html]PubMed CentralView ArticlePubMedGoogle Scholar
- Kathiresan S, Manning AK, Demissie S, D'Agostino RB, Surti A, Guiducci C, Gianniny L, Burtt NP, Melander O, Orho-Melander M, Arnett DK, Peloso GM, Ordovas JM, Cupples LA: A genome-wide association study for blood lipid phenotypes in the Framingham Heart Study. BMC Med Genet. 2007, 8 (suppl 1): S17-10.1186/1471-2350-8-S1-S17.PubMed CentralView ArticlePubMedGoogle Scholar
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC: PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007, 81: 559-575. 10.1086/519795.PubMed CentralView ArticlePubMedGoogle Scholar
- Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D: Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006, 38: 904-909. 10.1038/ng1847.View ArticlePubMedGoogle Scholar
- James S: Approximate multinormal probabilities applied to correlated multiple endpoints in clinical trials. Stat Med. 1991, 10: 1123-1135. 10.1002/sim.4780100712.View ArticlePubMedGoogle Scholar
- Gauderman WJ, Murcray C, Gilliland F, Conti DV: Testing association between disease and multiple SNPs in a candidate gene. Genet Epidemiol. 2007, 31: 383-395. 10.1002/gepi.20219.View ArticlePubMedGoogle Scholar
- Sandhu MS, Waterworth DM, Debenham SL, Wheeler E, Papadakis K, Zhao JH, Song K, Yuan X, Johnson T, Ashford S, Inouye M, Luben R, Sims M, Hadley D, McArdle W, Barter P, Kesäniemi YA, Mahley RW, McPherson R, Grundy SM, Wellcome Trust Case Control Consortium, Bingham SA, Khaw KT, Loos RJ, Waeber G, Barroso I, Strachan DP, Deloukas P, Vollenweider P, Wareham NJ, Mooser V: LDL-cholesterol concentrations: a genome-wide association study. Lancet. 2008, 37: 483-491. 10.1016/S0140-6736(08)60208-1.View ArticleGoogle Scholar
- BROAD Institute. [http://www.broad.mit.edu/diabetes/scandinavs/metatraits.html]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.