Copy number variation in the Framingham Heart Study

Shtir, Corina; Pique-Regi, Roger; Siegmund, Kim; Morrison, John; Schumacher, Fredrick; Marjoram, Paul

doi:10.1186/1753-6561-3-S7-S133

Volume 3 Supplement 7

Genetic Analysis Workshop 16

Proceedings
Open access
Published: 15 December 2009

Copy number variation in the Framingham Heart Study

Corina Shtir¹,
Roger Pique-Regi^2,3,
Kim Siegmund¹,
John Morrison¹,
Fredrick Schumacher¹ &
…
Paul Marjoram¹

BMC Proceedings volume 3, Article number: S133 (2009) Cite this article

752 Accesses
3 Citations
Metrics details

Abstract

In this paper we test for association between copy number variation and diabetes in a subset of individuals from the Framingham Heart Study. We used the 500 k SNP data and called copy number variation using two algorithms: the genome alteration detection algorithm of Pique-Regi et al. and the software Golden Helix. We then tested for association between copy number and diabetes using a gene-based analysis. Our results show little evidence of association between copy number and diabetes status. Furthermore, our results indicate a relatively poor level of agreement between copy number calls resulting from the two programs. We then examined potential causes for this difference in results and the implications for future studies.

Background

One of the most important challenges facing the field of biology today is making sense of genetic variation. One form of variation that is gaining an increasing level of attention is copy number variation (CNV). The term CNV is used to encompass a variety of polymorphisms that occur at the sequence level and that affect the copy number of regions of genetic material. Examples include insertions, deletions, rearrangements, and duplications. While humans have CNV, the degree of that CNV is only now becoming appreciated [1, 2], and its influences on phenotypic variation is not yet well understood. However, while a number of associations between single-nucleotide polymorphisms (SNPs) and disease have been verified, it remains the case that the effect sizes of these polymorphisms appear relatively small, with reported odds ratios typically being below 1.5 (see Estivill and Armengol [3] for a recent review). While the reasons for this remain unclear, it is plausible that SNP data are not capturing all the relevant associations. Consequently, in this paper we implement population-based tests for association between CNV and phenotypic variation within the Framingham Heart Study samples provided as part of Genetic Analysis Workshop 16 Problem 2.

Methods

Copy number inference

CNV inference algorithms start by analyzing 'intensities' for each probe at each SNP location. Intensities are generated from .cel files via the Affymetrix Power Tools software [4]. Our major focus is then on two CNV-calling methods: 1) the genome alteration detection algorithm (GADA) of Pique-Regi et al. [5], and 2) the HelixTree copy-number analysis module (CNAM) [6]. Both methods begin by normalizing intensities for SNPs across all chips. Such normalization is designed to correct for problems such as batch effects (the intensities for a sample may depend globally upon the batch in which the sample was analyzed) or local defects on the genotyping hardware. The CNAM analysis uses a proprietary normalization scheme for which details are not available. The GADA analysis employs a median normalization step [5] followed by the widely used normalization scheme within the Affymetrix Power Tools suite [4].

GADA copy number analysis employs three main steps: 1) it uses a compact linear algebra representation for the genome copy number from normalized probe intensities, 2) it applies a sparse Bayesian learning technique, and 3) it uses a backward-elimination procedure that ranks the inferred points from the previous step and also efficiently adjusts the accuracy trade-off between sensitivity and false-discovery rate. Pique-Regi et al. [6] tested their method on a variety of actual and simulated data sets and concluded that it achieved the highest accuracy, lowest false-discovery rate, and was, by several orders of magnitude, faster than competing methods.

CNAM employs an optimal segmenting algorithm that searches regions of markers in which log₂ ratios vary significantly from region to region and therefore confer variation in copy number. Log₂ ratios are created by normalizing raw intensity data against a reference sample. The CNAM segmenting process is optimized through 1) subdivision of the chromosomal region of markers into a moving window of sub-regions and 2) a permutation algorithm that validates the found cut-points. We used a moving window of 10,000 markers. One thousand permutations were used to assess significance.

Association between variation in copy number and phenotype

One of the central motivations for better understanding patterns of CNV is to then relate those patterns to phenotypic variation. Such applications are in their infancy, and in this paper we propose a relatively straightforward technique for addressing this issue. The output from the CNV calling algorithms we employ here can be regarded as a piece-wise constant function that indicates the estimated number of copies of each allele for each individual over the entire length of the genome. Changes in state of this function correspond to changes in (estimated) copy number. The points at which these changes in copy number occur are likely to be correlated across individuals, due to shared ancestry, but will certainly still vary across individuals. Thus, when summarizing the output of such an algorithm across an entire sample of interest, we break the genome up into a set of intervals, I= {I_j}, j= 1, ..., n, such that the copy number does not change for any individual in the sample within any interval. For convenience, we define intervals to start and end at locations corresponding to the midpoint between neighboring SNPs. Clearly, it is most sensible to use a set I that contains the minimal number of intervals, which means that the end of an interval will correspond to a point at which estimated copy number changes for at least one member of the sample.

Our analysis method focuses directly on genes. For a given gene we calculate the mean copy number across all segments contained within that gene for each individual. Because estimated copy number is not obviously normally distributed, we then test for a difference in the rank of estimated copy number between cases and controls using a Wilcoxon rank-sum test. This results in a p-value for each gene.

Results

Our analysis currently focuses on a subset of Offspring Cohort individuals with diabetes. In order to protect against undiagnosed diabetes, we insisted that cases had a fasting plasma glucose measure of at least 126 mg/dl, while controls had a fasting measure less than 110 mg/dl. The controls were also frequency-matched to cases by 5-year age intervals, sex, and ever smoked status (age and smoking history taken at baseline). Some individuals are related, but the majority are not (at least by the pedigree information provided). This resulted in a final sample size of 194 cases and 213 controls, for which we analyzed the 500 k SNP data. We then called copy number. Space prohibits a display of estimated copy number here, but, in summary, we observed that previously known sites of copy number variations [2] were detected as such in this data set, and that, for each method, there is clear correlation between the locations at which copy number changes across individuals (despite the fact that the algorithm call copy number independently for each individual). As expected, the number of changes of copy number in a gene was directly proportional to its length (correlation coefficient = 0.89).

Table 1 presents a summary of the results for the GADA and CNAM analyses for each chromosome. No gene attains genome-wide significance when tested for association between copy number and diabetes. However, some interesting features are observed. For example, under the null hypothesis of no association, p-values should be uniformly distributed, with each gene having a probability of 0.05 of being reported as 'significant' on this basis. Thus, for each chromosome, the number of genes showing association at the level p < 0.05 will be binomially distributed. We used this to test whether there was an over- or under-representation of such associated genes on any chromosome. After a multiple comparison correction, several chromosomes showed evidence of excess, or lack, of associated genes. Furthermore, there is a clear excess of small p-values for the CNAM analysis. This can be seen in Figure 1, where we show a quartile-quartile (Q-Q) plot of observed vs. expected p-values genome-wide (we plot -log₁₀ of the p-values). The Q-Q plots reflect the relative over- or under-abundance of small p-values resulting from the CNAM or GADA analysis. In Figure 2 we show how the distribution of the genes with the smallest p-values when tested for association with diabetes varies along the genome for the GADA and CNAM analyses. There appears to be little agreement between the two methods. This raises the question of why the results of the methods show such poor agreement among these genes. Recall that the first step for both methods is the normalization of SNP intensities, but that the two methods employ different normalization techniques. We therefore examined the degree of agreement between the SNP intensities after normalization. We show this in Figure 3 (left side), a scatter plot of the intensities resulting from the two schemes. The results are striking: there is very poor correlation between the normalized intensities. Thus, even if the methods were using identical routines to call copy number, which they are not, we would expect to obtain widely different calls of copy number, with consequent large differences between p-values resulting from subsequent tests of association between copy number and phenotype. We show the poor agreement in copy number in Figure 3 (right side). It is important to note that both methods use somewhat arbitrary definitions of upper and lower cut-points (CU and CL, respectively), defined such that if the normalized intensity for a SNP is below CL it is determined to have a copy number less than two, whereas if normalized intensity is above CU the copy number is called as greater than two; otherwise the copy number is called as two. These arbitrary cut-points are different for the two methods and, combined with the differing results from normalization, result in wildly different calls of copy number. Most often both methods call copy number as two, but it is seldom the case that the methods simultaneously determine copy number to be other than two for any given SNP. This, clearly, is a concern.

Table 1 Summary of gene-based CNV associations for GADA based segmentation (Wilcoxon rank-sum test)

Full size table

Conclusion

The importance of CNV has only recently become appreciated. The relationship of CNV to phenotypic variation is even less well developed. While we hope the analysis presented in this paper is a useful step forward in this area, it merely scratches the surface of what is likely to be an extremely complex challenge. CNV lacks some of the 'neatness' of SNP data. It does not occur at well defined positions (i.e., the points at which CNV changes is often different across individuals). Furthermore, for a variety of reasons, the 500 k chip platforms are not ideal for detecting CNV when compared with more recent platforms. It is also the case that many functional mutations occur outside genes (in promoter regions, for example). As such, many regions of CNV may not be detected. It is also likely to be challenging to detect small CNVs using these technologies.

An approach in which we break the genome into regions of maximal length, such that copy number remains constant for each individual within each region, results in an extremely large number of regions (around 82,000 regions for the samples analyzed in this paper). While each region can be treated as if it were a (multi-allelic) locus, and marginal tests can then be performed, such tests are likely to be far from optimal.

The principal reason for this is the correlation between intervals. This is directly analogous to the situation with SNP analysis, but may well be even more complex in this new setting. An alternative approach might be to attempt to relate SNP variation to CNV, and adopt an approach akin to the tag-SNP idea.

In this paper we attempted to move past the perils of a strategy based upon marginal tests by conducting a gene-at-a-time analysis, relating mean copy number within a gene to phenotype. Such an approach seems reasonable, succeeds in reducing the analysis to a reasonable number of tests, and thereby avoids the worst excesses of multiple comparison corrections. However, it should be noted that the very poor agreement in results from the two methods explored in this paper indicates that substantial work remains to be done. It is this lack of agreement that represents the principal lesson to be drawn from the current study. Normalization is regarded as an important, but somewhat routine step in analyses such as these. However, our paper demonstrates that the particular method of normalization chosen can have a key influence on the results obtained. In our case, the two normalization methods are both widely used, and appear inherently sensible, but result in normalized intensities that are very poorly correlated across methods. Consequently, subsequent analyses will produce wildly different results. As the phrase "garbage in, garbage out" reminds us, it is important to ensure that such normalization routines are adding signal, rather than noise to the data. As such, there is an urgent need for a widespread comparison of normalization methods in order to better assess which of them is most effective.

Finally, it should also be noted that, while we look at all genes in the present study, there is little reason a priori to expect candidate genes chosen on the basis of a SNP study to also have a function due to CNV. It is entirely possible that genes that affect phenotype through CNV will be distinct from those that have an effect due to SNP polymorphism. In a recent study, Stranger et al. examined data from Phase 1 of HapMap and noted that SNPs and CNVs captured 83.6% and 17.7%, respectively, of the total detected genetic variation in the expression of around 15,000 genes, but that "the signals from the two types of variation had little overlap" [7]. Other studies take a different view (e.g., McCarroll et al. [8]).

Abbreviations

CNAM:: Copy-number analysis module
CNV:: Copy-number variation
GADA:: Genome alteration detection algorithm
Q-Q:: Quantile-quartile
SNP:: Single-nucleotide polymorphism.

References

Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, Maner S, Massa H, Walker M, Chi M: Large-scale copy number polymorphism in the human genome. Science. 2004, 305: 525-528. 10.1126/science.1098918.
Article CAS PubMed Google Scholar
Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W: Global variation in copy number in the human genome. Nature. 2006, 444: 444-454. 10.1038/nature05329.
Article PubMed Central CAS PubMed Google Scholar
Estivill X, Armengol L: Copy number variants and common disorders: filling the gaps and exploring complexity in genome-wide association studies. PLoS Genet. 2007, 3: 1787-1799. 10.1371/journal.pgen.0030190.
Article CAS PubMed Google Scholar
Affymetrix Power Tools. [http://www.affymetrix.com/partners_programs/programs/developer/tools/powertools.affx]
Pique-Regi R, Monso-Varona J, Ortega A, Seeger R: Sparse representation and Bayesian detection of genome copy number alterations from microarray data. Bioinformatics. 2008, 24: 309-318. 10.1093/bioinformatics/btm601.
Article PubMed Central CAS PubMed Google Scholar
CNAM. [http://goldenhelix.com/SNP_Variation/CNAM/index.html]
Stranger BE, Forrest MS, Dunning M, Ingle CE, Beazley C, Thorne N, Redon R, Bird CP, de Grassi A, Lee C: Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science. 2007, 315: 848-853. 10.1126/science.1136678.
Article PubMed Central CAS PubMed Google Scholar
McCarroll SA, Kuruvilla FG, Korn JM, Cawley S, Nemesh J, Wysoker A, Shapero MH, Bakker PIWD, Maller JB, Kirby A: Integrated detection and population-genetic analysis of SNPs and copy number variation. Nat Genet. 2008, 40: 1166-1174. 10.1038/ng.238.
Article CAS PubMed Google Scholar

Download references

Acknowledgements

The authors gratefully acknowledge funding from NIH grants GM069890, MH084678, HG002790, EY03040, P30 ES07048, CA097346 and NEI grant U10-EY-11753. The Genetic Analysis Workshops are supported by NIH grant R01 GM031575 from the National Institute of General Medical Sciences.

This article has been published as part of BMC Proceedings Volume 3 Supplement 7, 2009: Genetic Analysis Workshop 16. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/3?issue=S7.

Author information

Authors and Affiliations

Department of Preventive Medicine, Keck School of Medicine, University of Southern California, 1540 Alcazar Street, CHP 220, Los Angeles, California, 90089, USA
Corina Shtir, Kim Siegmund, John Morrison, Fredrick Schumacher & Paul Marjoram
Department of Electrical Engineering, Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
Roger Pique-Regi
Department of Pediatrics, Children's Hospital Los Angeles, Keck School of Medicine, University of Southern California, 4650 Sunset Boulevard, Los Angeles, California, 90027, USA
Roger Pique-Regi

Authors

Corina Shtir
View author publications
You can also search for this author in PubMed Google Scholar
Roger Pique-Regi
View author publications
You can also search for this author in PubMed Google Scholar
Kim Siegmund
View author publications
You can also search for this author in PubMed Google Scholar
John Morrison
View author publications
You can also search for this author in PubMed Google Scholar
Fredrick Schumacher
View author publications
You can also search for this author in PubMed Google Scholar
Paul Marjoram
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paul Marjoram.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

CS, RPR, KS, RJ and FS developed the methodology and performed the analysis. PM conceived the study. PM and CS wrote the paper.

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Shtir, C., Pique-Regi, R., Siegmund, K. et al. Copy number variation in the Framingham Heart Study. BMC Proc 3 (Suppl 7), S133 (2009). https://doi.org/10.1186/1753-6561-3-S7-S133

Download citation

Published: 15 December 2009
DOI: https://doi.org/10.1186/1753-6561-3-S7-S133

Genetic Analysis Workshop 16

Copy number variation in the Framingham Heart Study

Abstract

Background