Volume 5 Supplement 9
Detecting disease rare alleles using single SNPs in families and haplotyping in unrelated subjects from the Genetic Analysis Workshop 17 data
© Kraja et al; licensee BioMed Central Ltd. 2011
Published: 29 November 2011
We present an evaluation of discovery power for two association tests that work well with common alleles but are applied to the Genetic Analysis Workshop 17 simulations with rare causative single-nucleotide polymorphisms (SNPs) (minor allele frequency [MAF] < 1%). The methods used were genome-wide single-SNP association tests based on a linear mixed-effects model for discovery and applied to the familial sample and sliding windows haplotype association tests for replication, implemented within causative genes in the unrelated individuals sample. Both methods are evaluated with respect to the simulated trait Q2. The linear mixed-effects model and haplotype association tests failed to detect the rare alleles of the simulated associations. In contrast, the linear mixed-effects model and haplotype association tests detected effects for the most important simulated SNPs with MAF > 1%. We conclude that these findings reflect inadequate statistical power (the result of small simulated samples) for the complex genetic model that underlies these data.
Based on evolutionary history, it is expected that human genomes will be arranged into local regions with high haplotype similarity intermingled with local regions with haplotype diversity. In the diverse haplotype regions, rare alleles may play an important role in creating the foundation for individual subjects’ susceptibility or resistance to a particular disease. It is a common practice in genetic research to identify causative disease regions in the framework of association using single single-nucleotide polymorphism (SNP) tests or by grouping neighboring regions under identifiable haplotypes and to test their association with disease and quantitative traits. Genetic Analysis Workshop 17 (GAW17) provides 200 replicates of simulated data of a family-based cohort with eight large families and 200 replicates of simulated data for unrelated individuals. This simulated problem is challenging, because both sets of data are relatively small (n = 697). We investigate the two problems independently to determine whether the family-based association tests have power to detect rare allele effects and whether rare allele effects in the simulated genes can also be detected by haplotype analysis of the unrelated individuals sample.
The GAW17 data represent 200 replicates of simulated phenotypes for a sample of 697 subjects organized in 8 large families (referred in this article as the familial sample) and 200 replicates of simulated phenotypes for a separate sample of data of 697 unrelated individuals (referred to here as the unrelated sample). Genotypes from the 1000 Genomes Project were used as the genotype sample for the unrelated sample. The GAW17 simulation authors  used the family data set, by means of the program CHRSIM , to drop the phased founder genotypes throughout the rest of the pedigree by considering a single obligate crossover event occurring on each chromosome. The same two genotype sets were used for all 200 phenotypic simulation replicates for the familial or unrelated sample.
We analyzed the unrelated sample genotypes for linkage disequilibrium using HaploView software (version 4.2), with the purpose of identifying tag SNPs . The options we used in a batch mode run of HaploView for identifying tag SNPs were –pairwiseTagging and –tagrsqcutoff 0.8. We used the number of discovered tag SNPs as a denominator for extrapolating the Bonferroni genome-wide significance threshold for a single-SNP association test (see Results section).
where β1g measures the change in Y as a result of the additive change in the genotype g∈(−1, 0, +1) where −1 is the code for the homozygote genotype for the minor allele, 0 is the code for the heterozygote, and +1 is the code for the homozygote for the major allele. We implemented association tests using a statistical mixed model on the familial sample via the mixed procedure (PROC MIXED) of SAS (version 9.2, Linux OS). The “repeated” statement was used with sub=pedid, which defines the structure of the R matrix, and the covariance structure was selected as UN. We tested all 24,487 SNPs included in the simulation, although we had prior knowledge of the GAW17 simulation answers. With such prior knowledge we focused on trait Q2.
Q2 was simulated as a quantitative trait, influenced primarily by 72 SNPs in 13 genes, with 1–15 functional variants per gene and with minor allele frequencies (MAFs) ranging from 0.07% to 17.07%. The residual heritability of Q2 was simulated to be 29%. Most of the genes affecting the Q2 trait were selected to be related to cardiovascular disease risk and inflammation, and they are located on chromosomes 2, 3, 6–12, and 17. Before the LME association tests, we performed a stepwise regression for Q2 within Sex to remove the effects of Age and Age2. As a result, we produced a Q2 residual, which we then used as the dependent variable Y in our analyses. In the statistical analyses, the variable Sex was included as a covariate (β2cov).
with a chi-square distribution and degrees of freedom equal to the rank of V β .
Finally, we use two definitions to summarize the LME results: sensitivity and specificity. Sensitivity is the proportion of true-positive causative SNPs that have a significant test result in 200 replications; and specificity is the proportion of true-negative SNPs that have a negative test result in 200 replications. Both sensitivity and specificity are expressed as percentages.
Detecting association of very rare alleles (MAF < 1%) with disease or quantitative traits in familial data of small sample size is challenging. One would have guessed that in the GAW17 data set, because of its simulation as a transmission of selected alleles from the unrelated sample into the familial sample, one or more rare alleles would have been transmitted as common alleles in the familial sample. In the simulated data such conversion from rare alleles into common ones with the existing small sample sizes (from unrelated founders into familial data) was not present. Therefore in both sets of genotypic data, simulations were performed mostly on the very rare alleles in 96% of the 72 selected SNPs for Q2. The average −log10p-value of genome-wide association tests on the simulated data was less than the significance threshold. This illustrates the necessity of finding other methods that manage association tests for rare alleles.
With the massive data dumps coming from new sequencing technologies, rare alleles will be captured more often in the data and will be the aim of many studies. Moreover, it is believed that rare alleles are the gold dust that confer the rest of disease heritability. It is possible that, in contrast to this simulation, sample sizes in real data will become larger as the costs of sequencing decrease. But in the existing case we needed to find ways to increase our statistical power to identify the rare causative variants in association with disease or quantitative traits. Therefore we used a two-step analysis design: the LME model for discovering genome-wide causative SNPs or genes in the familial sample and haplotype association tests for replicating primary findings, but now in the unrelated sample.
The haplotype association test has more power to identify the SNPs’ effect on disease and quantitative traits. The haplotype association tests produced significant results for several common haplotypes, including those with a frequency close to 1%. However, this method also failed to identify any other significant haplotypes in association with Q2 from the very rare simulated SNPs (MAF < 1%). Based on our results for 200 replications, we conclude that failing to detect the very rare causative SNPs was not a failure of the methods but more a reflection of the low statistical power available with the small sample size of the simulated data.
The genotypic data analyzed from the 1000 Genomes Project (sampled real data) based on unrelated individuals show that for the selected gene regions there is an important saving of about 53% when tag SNPs are used rather than the full sequence. The other dimension of the data, the sample size of 697 individuals used in this simulation, was small, and thus although sequence data were available, the power of association analysis was low. Therefore the additive single-SNP association tests based on LME models in small samples of data failed to detect the simulated causative very rare polymorphisms. The association tests based on haplotype analyses improved the p-value strength by replicating discovered causative polymorphisms, but they still failed with very rare alleles, reflecting the inadequate statistical power of small sample sizes in the simulated data.
This article has been published as part of BMC Proceedings Volume 5 Supplement 9, 2011: Genetic Analysis Workshop 17. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/5?issue=S9.
- Almasy LA, Dyer TD, Peralta JM, Kent JW, Charlesworth JC, Curran JE, Blangero J: Genetic Analysis Workshop 17 mini-exome simulation. BMC Proc. 2011, 5 (suppl 9): S2-10.1186/1753-6561-5-S9-S2.PubMed CentralView ArticlePubMedGoogle Scholar
- Speer MC, Terwilliger JD, Ott J: Data simulation for GAW9 problems 1 and 2. Genet Epidemiol. 1995, 12: 561-564. 10.1002/gepi.1370120606.View ArticlePubMedGoogle Scholar
- Barrett JC, Fry B, Maller J, Daly MJ: HaploView: analysis and visualization of LD and haplotype maps. Bioinformatics. 2005, 21: 263-265. 10.1093/bioinformatics/bth457.View ArticlePubMedGoogle Scholar
- Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA: Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am J Hum Genet. 2002, 70: 425-434. 10.1086/338688.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.