Case-control studies with affected sibships

Köhler, Karola; Sohns, Melanie; Bickeböller, Heike

doi:10.1186/1753-6561-1-S1-S29

Volume 1 Supplement 1

Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci

Proceedings
Open access
Published: 18 December 2007

Case-control studies with affected sibships

Karola Köhler¹,
Melanie Sohns¹ &
Heike Bickeböller¹

BMC Proceedings volume 1, Article number: S29 (2007) Cite this article

844 Accesses
3 Citations
Metrics details

Abstract

Related cases may be included in case-control association studies if correlations between related individuals due to identity-by-descent (IBD) sharing are taken into account. We derived a framework to test for association in a case-control design including affected sibships and unrelated controls. First, a corrected variance for the allele frequency difference between cases and controls was directly calculated or estimated in two ways on the basis of the fixation index F_STand the inbreeding coefficient. Then the correlation-corrected association test including controls and affected sibs was carried out. We applied the three strategies to 20 candidate genes on the Genetic Analysis Workshop 15 rheumatoid arthritis data and to 9187 single-nucleotide polymorphisms of replicate one of the Genetic Analysis Workshop 15 simulated data with knowledge of the "answers". The three strategies used to correct for correlation give only minor differences in the variance estimates and yield an almost correct type I error rate for the association tests. Thus, all strategies considered to correct the variance performed quite well.

Background

It is desirable to include related cases in case-control studies because pedigrees of multiple affected individuals have a higher expected frequency of susceptibility allele(s), leading to increased power [1]. Several methods have been proposed to test for association in case-control designs that take correlations due to IBD sharing into account [1–4]. Most of these determine correlations of related individuals based on prior kinship coefficients assuming no linkage under the hypothesis of no association. Only Slager and Schaid [4] incorporate individual identity-by-descent (IBD) estimates from previous linkage analyses. A comparison of the two strategies with respect to their power has been presented by Bourgain [5]. To integrate both strategies in one model we derive a unified framework to test for association including affected sibships and unrelated controls and apply the introduced test statistics to the candidate gene data set of Plenge et al. [6] as well as a replicate of the simulated single-nucleotide polymorphism (SNP) genome data.

Methods

Notation and assumptions

The study sample contains n₁ cases and n₀ controls (n₁ + n₀ = n) with corresponding allele frequencies p₁ and p₀ and common frequency p under the null hypothesis of no association. There are m cases from sibships with at least two sibs and n₁ - m independent cases. At the candidate locus, each individual has two alleles, X_i1and X_i2(i = 1,..,n) coded as 0/1. Usually only the genotype X_i.= X_i1+ X_i2is known. For all individuals the affection status y_i= 0/1 is given. The cases from families comprise k = 1,...,K sibships of size m_k, and z_idenotes the sibship of individual i. For the cases, the X_ijvalues have a Bernoulli(p₁) distribution. Cases from different sibships are assumed to be independent, cases from the same sibship are not independent. To describe the correlation structure between sibs we use a model from population genetics that considers a population consisting of different subpopulations based on the coefficient F_STand the inbreeding coefficient F_IT. Sibships are regarded as small subpopulations and F_STdenotes the correlation between two randomly chosen alleles of two individuals from the same sibship. Under the assumption of no population structure, correlations within sibships only arise from IBD sharing between sibs and F_STequals the expected kinship coefficient between two siblings. F_ITmeasures the correlation of the two alleles within an individual and equals 0 under assumption of random mating and no further population structure.

The test statistic

Based on the correlations F_STand F_IT, the true variance of the numerator of the allelic χ²-test statistic can be calculated. One component is the sum of all alleles from cases of sibships $S = \sum_{i : y_{i} = 1, z_{i} \in 1, ..., K} X_{i .}$ . Its true variance can be calculated as

V a r (S) = p_{1} (1 - p_{1}) 2 m [1 + F_{I T} + 2 F_{S T} ((\sum_{k} m_{k}^{2} / m) - 1)],

where the term in square brackets, in the following denoted by γ, is the variance inflation in comparison to the variance of the sum of alleles from independent cases. If the data set only consists of affected sib pairs, the inflation factor simplifies to γ = 1 + F_IT+ 2F_ST. The total numerator can be expressed as the estimated allele frequency difference between cases and controls

T = 1 /(2 n_{1)} \sum_{i : y_{i} = 1} X_{i .} - 1 / (2 n_{0}) \sum_{i : y_{i} = 0} X_{i .} .

Under the null hypothesis of no association, its variance can be derived by dividing the sum of alleles within cases into two parts: one for affected sib pairs and one for independent cases, leading to

V a r_{γ} (T) = p (1 - p) ((m γ + n_{1} - m) / (2 n_{1}^{2}) + 1 / (2 n_{0})) .

The inflation γ for the allelic χ²-test Var_γ=1(T) is defined as λ = Var_γT/Var_γ=1T.

Strategies to determine the correlations F_STand F_IT

To estimate Var(T), different strategies for determining F_STand F_ITwere investigated. In strategy I ("no linkage") F_STis directly calculated under the assumptions of no linkage and F_IT= 0. Here F_STcorresponds to the prior kinship coefficient of a sib pair. F_ST= 0.25, since 2F_STis the probability that two alleles from the same parent of a sib pair are IBD. In the two other strategies F_STis estimated to account for regions of linkage where the true F_STis larger than 0.25.

In strategy II ("ANOVA") F_STand F_ITare estimated by analysis of variance based on the marker data of the affected sibships at the candidate locus [7]. This strategy has no further assumptions and is based on a partitioning of the total sum of squares into three sums of squares: within individuals, within sibships, and between sibships. Each of them describes the additional variance compared to the lower level in the given order. Because F_STand F_ITcan be expressed as ratios of variance components, estimates for F_STand F_ITcan be derived as functions of the sums of squares.

Strategy III ("MULTI") uses a multipoint F_STestimate assuming F_IT= 0, requiring genotype information at adjacent markers, e.g., for cases previously analyzed for linkage with these markers. F_STcan be directly estimated from the estimated mean number Y of alleles IBD within the affected sib pairs. The expectation of Y can be expressed as E(Y) = 2N·2F_ST, where $2 N = \sum_{k = 1}^{K} m_{k} (m_{k} - 1)$ is the total number of allelic pairs considered and 2F_ST, is the probability that such an allele pair is IBD. The estimated number Y of alleles IBD has to be calculated from individual IBD estimates. If there are only affected sib pairs in the data (N = K), Y can be derived from the nonparametric linkage-score (NPL- or Z-score), which is then equivalent to the classical mean test statistic $Z = (Y - K) / \sqrt{K / 2}$ . Here the same IBD measure is used as in linkage analysis.

To evaluate the strategies we implemented the test statistic in the computer program R. For strategy I F_ST= 0.25, for strategy II F_STwas estimated in the ANOVA framework implemented in R, and for strategy III we calculated NPL-scores with Merlin.

Application to data from a candidate gene study for rheumatoid arthritis

The proposed methods were applied to case-control data from 20 candidate genes for rheumatoid arthritis previously analyzed by Plenge et al. [6]. The 839 cases were from the North American Rheumatoid Arthritis Consortium (NARAC) and include 717 cases from affected sibships and 122 unrelated cases. The 855 unrelated controls were selected from healthy individuals who were enrolled in the New York Cancer Project (NYCP). Because we have to include additional data for strategy III, we only investigated the introduced test statistics based on strategy I, II, and the traditional allelic χ²-test based on allele frequencies ignoring familial correlations. We compared our results to Plenge et al. [6] who analyzed the same sample with only a few additional individuals.

Application to the simulated data

Additionally, the SNP genome data from Replicate 1 of the simulated Genetic Analysis Workshop 15 data were analyzed knowing the solutions. The data contain 1500 families of two parents and an affected sib pair and 2000 controls. We calculated our test statistics based on strategies I-III for all 9187 SNPs of the genome scan comparing 3000 cases to 2000 controls. Subsequently, in order to remove true associations, we excluded SNPs in a region around ±3 cM of simulated disease loci to analyze data simulated under the null hypothesis of no association but allowing for linkage. For the remaining SNPs we verified the type I error rate of the test statistics. We also analyzed chromosome 6 containing the major disease locus to concentrate the analysis on a region of known linkage.

Results

Results for the candidate gene study for rheumatoid arthritis

Table 1 contains the candidate genes that show a significant association based on the traditional allelic χ²-test. It shows whether these associations remain significant after accounting for the IBD sharing of the cases. In the ANOVA model F_STis slightly underestimated, being below 0.25. Thus in this example the p-values for the "no linkage"-strategy are slightly more conservative than for ANOVA. The variance inflation λ of the allelic χ²-test is estimated around 1.20–1.25. The exact value depends on the strategy of estimating F_STand the number of missing values. By using a significance level of 0.05, all test statistics remain significant with the correct variance estimate. If a Bonferroni corrected significance level of 0.0025 is used, PTPN22, CTLA, and SUMO4/rs237025 (unexpected direction) are significant for the two-sided allelic χ²-test. For the test statistics accounting for familial correlations, only PTPN22 clearly remains significant, CTLA is no longer significant and the p-value for SUMO4 is very close to the significance level.

Table 1 Results for selected candidate genes

Full size table

Results for the simulated data

Figure 1 shows the estimated F_STvalues for chromosome 6. As expected, the multipoint F_STestimation is more stable than the single-point ANOVA method. However, even with the single-point method, the F_STestimate is in most cases larger than 0.25, thus accounting for linkage correctly. For the simulated data an F_STvalue of 0.25 leads to an inflation factor of 1.2, whereas an F_ST= 0.3 corresponds to λ = 1.24. Because of this small difference between the inflation factors, the method to determine F_STis expected to have only a minor impact on the test statistic. After excluding regions of true associations, 9055 SNPs remained, including 627 out of 674 SNPs on chromosome 6. Figure 2 shows the observed type I error rate for the different test statistics. The results for the entire genome indicate that the allelic χ²-test is far too liberal. In contrast, the observed type I error rates for the test statistics accounting for familial correlations are all very close to each other within the expected range for all significance levels up to 0.1. The separate analysis of chromosome 6 confirms that even in a region of known linkage there is only a minor difference between the three strategies, with the "no-linkage" being the most liberal.

Conclusion

If related cases are included in a case-control study, the allelic χ²-test can lead to an increased rate of false positives, as indicated by the simulations and the real data analysis. All strategies to correct the variance perform quite well and lead to an almost correct type I error rate on the entire genome. In the presence of linkage, test statistics based on estimating the correlations from data are somewhat superior, but a single-point strategy based on the candidate gene data seems sufficient. Moreover, our conclusions are consistent with the simulation results of Bourgain [5], who observed only a minor difference in power between the association test of Slager and Schaid [4] based on IBD estimates and the test of Bourgain et al. [2] based on prior kinship coefficients.

References

Risch N, Teng J: The relative power of family-based and case-control designs for linkage disequilibrium studies of complex human diseases. I. DNA pooling. Genome Res. 1998, 8: 1273-1288.
PubMed CAS Google Scholar
Bourgain C, Hoffjan S, Nicolae R, Newman D, Steiner L, Walker K, Reynolds R, Ober C, McPeek MS: Novel case-control test in a founder population identifies P-selectin as an atopy-susceptibility locus. Am J Hum Genet. 2003, 73: 612-626. 10.1086/378208.
Article PubMed Central PubMed CAS Google Scholar
Browning SR, Briley JD, Briley LP, Chandra G, Charnecki JH, Ehm MG, Johanssson KA, Jones BJ, Karter AJ, Yarnall DP, Wagner MJ: Case-control single-marker and haplotypic association analysis of pedigree data. Genet Epidemiol. 2005, 28: 110-122. 10.1002/gepi.20051.
Article PubMed Google Scholar
Slager SL, Schaid DJ: Evaluation of candidate genes in case-control studies: a statistical method to account for related subjects. Am J Hum Genet. 2001, 68: 1457-1462. 10.1086/320608.
Article PubMed Central PubMed CAS Google Scholar
Bourgain C: Comparing strategies for association mapping in samples with related individuals. BMC Genet. 2005, 6 (Suppl): S98-10.1186/1471-2156-6-S1-S98.
Article PubMed Central PubMed Google Scholar
Plenge R, Padyukov L, Remmers EF, Purcell S, Lee AT, Karlson EW, Wolfe F, Kastner DL, Alfredsson L, Altshuler D, Gregersen PK, Klareskog L, Rioux JD: Replication of putative candidate-gene associations with rheumatoid arthritis in >4,000 samples from North America and Sweden: association of susceptibility with PTPN22, CTLA4, and PADI4. Am J Hum Genet. 2005, 77: 1044-1060. 10.1086/498651.
Article PubMed Central PubMed CAS Google Scholar
Excoffier L, Balding DJ, Bishop M, Cannings C: Analysis of population subdivision. Handbook of Statistical Genetics. Edited by: Balding DJ, Bishop M, Cannings C. 2000, Chichester: Wiley, 271-307.
Google Scholar

Download references

Acknowledgements

This work was supported in part by the Federal Ministry of Education and Research BMBF – German National Genome Research Network NGFN (01GR0462, 01GS0422).

This article has been published as part of BMC Proceedings Volume 1 Supplement 1, 2007: Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/1?issue=S1.

Author information

Authors and Affiliations

Department of Genetic Epidemiology, Georg-August-University Goettingen, Medical School, Humboldtallee 32, D-37073, Goettingen, Germany
Karola Köhler, Melanie Sohns & Heike Bickeböller

Authors

Karola Köhler
View author publications
You can also search for this author in PubMed Google Scholar
Melanie Sohns
View author publications
You can also search for this author in PubMed Google Scholar
Heike Bickeböller
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Heike Bickeböller.

Additional information

Competing interests

The author(s) declare that they have no competing interests.

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Köhler, K., Sohns, M. & Bickeböller, H. Case-control studies with affected sibships. BMC Proc 1 (Suppl 1), S29 (2007). https://doi.org/10.1186/1753-6561-1-S1-S29

Download citation

Published: 18 December 2007
DOI: https://doi.org/10.1186/1753-6561-1-S1-S29

Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci

Case-control studies with affected sibships

Abstract

Background