An integrated genome-wide association analysis on rheumatoid arthritis data
BMC Proceedings volume 1, Article number: S35 (2007)
We propose a nonparametric association analysis combining both family and unrelated case-control genotype data. Under the assumption of Hardy-Weinberg equilibrium, we formed an affected group to compare with a group of unaffecteds.
Comparison with traditional case-control chi-square test and transmission-disequilibrium test shows that this new approach has noticeably improved power. All analysis was based on the simulated rheumatoid arthritis data provided by Genetic Analysis Workshop 15. In the situation of population stratification, we also suggest an approach to update the genotype data using principal components. However, the Genetic Analysis Workshop 15 simulation data does not simulate population stratification. All analysis was done without knowledge of the answers.
Traditional linkage analysis has achieved great success in the genetic dissection of mendelian diseases caused by a single gene with large effect. However, it is well known that association analysis has more power than linkage analysis for complex diseases such as rheumatoid arthritis (RA) . Nowadays genome-wide association studies have been widely planned and carried out due to biotechnical improvements and decreasing experimental costs. Traditional approaches to association study designs are either family-based or unrelated case-control subjects based. Here we demonstrate an integrated association analysis using both family and unrelated simulation data on RA from Genetic Analysis Workshop 15 (GAW15).
Simulated data without population stratification
The RA data set was simulated according to familial patterns and other environmental effects. Each of the 100 replicates has 1500 nuclear families consisting of one affected sibling pair (ASP) and their parents, and 2000 unrelated unaffected individuals as controls. Markers include 730 microsatellite markers, 9187 evenly distributed SNPs on 22 autosomal chromosomes, and 17,820 dense SNPs on chromosome 6. In the analysis, we used the first 200 families and the first 200 people of the 2000 controls. To include unrelated cases in the analysis, we randomly picked one of the two affected siblings from the next 200 families. Our final data set includes 200 families, 200 unrelated cases, and 200 controls. Among the 200 selected families, there were 56 families with a single parent and two families with both parents affected. In the most general setting, we form one group of all affected individuals consisting of affected siblings, affected parents, and unrelated cases, which was compared with a group of all unaffected individuals consisting of unaffected siblings, unaffected parents, and unrelated controls. Depending on the number of affected parents, there are three possible groupings for a family with r affected siblings with genotype x1,..., x r ; s unaffected siblings with genotype y1,..., y s , and parents with genotype x m and x f Here, genotypes x and y denote the number of a particular allele whose allele frequency is p. Suppose in the data there are l families with both unaffected parents, m families with one affected parent (say the mother), n families with both affected parents, and additionally unrelated cases wi, i = 1,..., u, and v controls z i , i = 1,..., v. The allele frequencies of the two groups are given by:
We then use a normal test statistic , which is a generalization of Risch and Teng's result . In particular, Var(p a - p u ) = Var(p a ) + Var(p u ) - 2Cov(p a , p u ). Assuming Hardy-Weinberg equilibrium, each term is given below:
And p is the estimated average allele frequency of all subjects in the data. For our final data, r = 2; s = 0; l = 140; m = 56; n = 2, and u = v = 200.
In the presence of population stratification
In the situation of population stratification, we suggest an approach to adjust the genotype data using principal components before the above procedures are applied. Unfortunately, the RA data was simulated without a population stratification effect, therefore we only give brief idea of this method here. The rationale of this approach is that across the genome there should be a consistent pattern among allele frequency differences, and that pattern is summarized by principal components to which many markers contribute. We sketch the procedures below. Details may be found in Price et al. . First, pick founders from each family and all unrelated case-controls. Denote the genotype at the ith locus for jth individual by g ij , i = 1,..., M and j = 1,..., N. Let be the sample mean for ith locus and X = (x ij ) the matrix normalized by subtracting u i from each row and dividing by . Second, compute the estimated covariance matrix of all markers , and list the first k largest eigenvalues λ1,..., λ k with corresponding eigenvectors v1,..., v k The lth eigenvector v l = (vl1,..., v lM ) gives the lth principal component as . Finally, regress genotypes on the markers by , where is the regression coefficient for lth marker and jth individual.
Because population stratification was not simulated in GAW15, we did not adjust the genotype data using principal component procedures. We directly applied the test to the 9187 SNPs, and identified four SNPs whose p-values are far less than the Bonferroni corrected p-value 0.05/9187 = 5.44 × 10-6. We used the software Haploview  to test the linkage disequilibrium pattern among them. The D' scores among SNP6-152, SNP6-153, and SNP6-154 are above 0.93, suggesting strong LD, and the D' between SNP6-155 and the rest was less than 0.38. Next, we applied a case-control chi-square test to the unrelated 200 cases and controls, and a family-based test (transmission-disequilibrium test, or TDT) to the family data. As a comparison, we also applied our test zfam only to the family data. All the test results were consistent, and are summarized in Table 1. The squares of the new test value z are strictly larger than the square sum of the corresponding chi-square test and TDT. For the family data, the value of our statistic zfam is also bigger than the value of TDT test statistic. These suggest that the proposed combined test has improved power. Also, as expected, the values of test statistic z are much larger than the test statistic zfam, which is restricted only to families, because more information from the unrelated case-control sample is used.
The type I errors of the proposed test are reasonable and comparable to the other two tests, which are listed in Table 2. At the significance level α = 0.05, we observed 483 SNPs with p-values less than 0.05, giving a slightly higher type I error rate of 0.0525, which might be caused by correlation with disease loci. Thus, we excluded all the 674 SNPs on chromosome 6, and then observed 433 SNPs with p-value less than 0.05, with a corresponding type I error of 0.0508 (Table 2). Next, we applied our test to the dense map of chromosome 6, and got 56 significant SNPs whose p-values are less than the Bonferroni corrected p-value 0.05/(17820 + 9187) = 1.85 × 10-6. In particular, the markers 3439, 3442, 3437, 3436, 3440, 3430, and 3426 have the largest test value. Together with the LD patterns from Haploview, we conclude that the most likely interval for a major gene is between 49.4262 cM and 49.5184 cM on chromosome 6.
Under the assumption of Hardy-Weinberg equilibrium, the proposed approach has improved power by combining families of different structures with unrelated subjects, and it also give a potential way to resolve the issue of population stratification. Compared with the traditional TDT test, the proposed test can combine all the available families and may have better power than the TDT because the TDT excludes a certain proportion of families. Under the assumption of no population stratification and low disease prevalence in parents, another simpler test that Risch and Teng describe is to regard all parents from families as unaffected, with the remainder of this test being the same as ours . However, when we carried out this test on the RA data, it led to an inflated type I error rate. At the significance level α = 0.05, the type I error rate reached 0.055. On the other hand, our new proposed test might lose power without the random mating assumption.
Recently Epstein et al.  described a likelihood-based approach for combining triads and unrelated subjects, but it requires further work to combine families of different structures. Li et al.  also published another likelihood-based approach using hidden Markov model of affected sibling pairs. However, their approaches can not deal with the issue of population stratification. We proposed a principal-component based approach to resolve this, and will test the performance of adjusting population stratification procedure elsewhere.
Risch N, Merikangas K: The future of genetic studies of complex human diseases. Science. 1996, 273: 1516-1517. 10.1126/science.273.5281.1516.
Risch N, Teng J: The relative power of family-based and case-control designs for linkage disequilibrium studies of complex human diseases I. DNA Pooling. Genome Res. 1998, 8: 1273-1288.
Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D: Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006, 38: 904-909. 10.1038/ng1847.
Barrett JC, Fry B, Maller J, Daly MJ: Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics. 2005, 21: 263-265. 10.1093/bioinformatics/bth457.
Epstein MP, Veal CD, Trembath RC, Barker JN, Li C, Satten GA: Genetic association analysis using data from triads and unrelated subjects. Am J Hum Genet. 2005, 76: 592-608. 10.1086/429225.
Li M, Boehnke M, Abecasis G: Efficient study for test of genetic association analysis using sibship data and unrelated cases and controls. Am J Hum Genet. 2006, 78: 778-792. 10.1086/503711.
The authors are very grateful to the reviewers for their numerous suggestions for improving the format and content of this paper. This work was supported by a grant from National Human Genome Research Institute (R01 HG003054) to XZ.
This article has been published as part of BMC Proceedings Volume 1 Supplement 1, 2007: Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/1?issue=S1.
The author(s) declare that they have no competing interests.
About this article
Cite this article
Zhang, J., Zhu, X. & Cooper, R.S. An integrated genome-wide association analysis on rheumatoid arthritis data. BMC Proc 1 (Suppl 1), S35 (2007). https://doi.org/10.1186/1753-6561-1-S1-S35