- Open Access
Novel approach for genome scan meta-analysis of rheumatoid arthritis: a kernel-based estimation procedure
© Briollais et al; licensee BioMed Central Ltd. 2007
- Published: 18 December 2007
Genome scan meta-analysis (GSMA) can prove very useful in detecting genetic effects too small to be detected in an individual linkage study and can also lead to more consistent results. In this paper, we propose a new kernel-based estimation procedure for GSMA. Instead of estimating identity by descent between markers, as performed in interval mapping approaches, we estimated directly the nonparametric linkage score between markers using a kernel procedure. The GSMA is then extended to take into account the kernel estimate of the nonparametric linkage score and its variance at a given chromosomal position. The method is applied to the rheumatoid arthritis genome scan data (Genetic Analysis Workshop 15 Problem 2).
- Interval Mapping
- Kernel Estimator
- Kernel Regression
- North American Rheumatoid Arthritis Consortium
- Sharing Probability
Rheumatoid arthritis (RA) is a chronic inflammatory disease that primarily affects the synovial tissues of multiple joints in the body. The etiology of the disease remains unknown, but it appears to have a complex genetic component. Several genome scans for RA studies have been performed to identify susceptibility loci, but most of the results have not been replicated . These inconsistencies could arise from the small sample size, low statistical power, and clinical or genetic heterogeneity of these studies. Genome scan meta-analysis (GSMA) that combines the results from several linkage studies can have greater statistical power to detect small genetic effects and can lead to more consistent results. A general difficulty in GSMA is the heterogeneity across studies due to different marker maps, marker informativeness, sample sizes, sampling plans, and linkage tests. Loesgen et al.  proposed a meta-analytic test that computed a weighted average estimate of score statistics. Recently, Etzel et al.  used this method in a genome-wide meta-analysis of RA. Because of differences in marker maps across studies, they decided to align the marker maps after performing some interval mapping and combining nonparametric linkage (NPL) scores obtained from GeneHunter2 for markers in a pre-specified interval. Their method requires the estimation of identity-by-descent (IBD) sharing probabilities through the interval between two markers [4, 5], which can be somewhat inaccurate and imprecise. The variability of the IBD estimate is difficult to measure and often not reflected in the GSMA. In this paper, we propose an alternative approach that estimates the NPL score between markers directly using a kernel-based estimation procedure. The GSMA is then extended to take into account the kernel estimate of the NPL score and its variance at a given chromosomal position.
Summary of studies included in the meta-analysis
No. of families
No. of microsatellite markers
where Z ij is the NPL score from the ith study at the jth position, k is the number of studies, and w ij is the weight given to each study. To perform the GSMA, we used three different strategies that differ in the way the NPL score and IC are estimated between markers and the definition of the weight.
Following Etzel et al. , the first method tries to align the marker maps. After the 2-cM interval mapping was completed with MERLIN, the NPL scores that were within 1 cM of each other were combined and the statistic ZMA was computed in each interval. The weight w ij is the product of the number of sib-pairs equivalents (SPE) from the ith study and the IC estimated from MERLIN for the ith study at the jth interval:
w ij = SPE i *IC ij .
This approach is not based on marker alignment. Instead of using an interval mapping estimate of the NPL score and IC between markers, we used a kernel regression method. The statistic ZMA is then computed at all marker positions available after merging the three data sets (i.e., if one marker is present in one study but missing in the other two, its associated NPL score and IC are estimated by the kernel regression). The weight is identical to Method 1 except that the IC is now replaced by its kernel estimate (ICK):
w ij = SPE i *ICK ij .
The third approach is identical to Method 2 but now takes into account in the weight the precision of the kernel estimator, more precisely the inverse of standard deviation of NPL kernel estimator (SDnpl):
w ij = SPE i *ICK ij *I/SDnpl ij .
In Methods 2 and 3, the relationship between the NPL score (or the information content) (Y) and the marker location (T) is modelled using a nonparametric model given by:
Y i = m(T i ) + ε i , for i = l,..., n,
where n is the sample size, h is the bandwidth (the smoothing parameter) to be determined, and K is the continuous fixed kernel function with finite variance generally satisfying K > 0, K(-t) = K(t), and ∫ K(t) dt = 1. Here we considered the Gaussian kernel. This raises the question of determining the bandwidth parameter. The estimator is wiggly when h is small and very flat when h is large. Different procedures have been previously proposed to determine h, for example cross-validation. The problem in our application is that we need to estimate the NPL score function at different marker locations using the kernel procedure but also its variance. Both the kernel estimator and the variance depend on the same smoothing parameter and an optimal choice for the kernel estimator might not be optimal for the variance. To our knowledge, there is no optimal procedure for this problem. For that reason, we could not apply the classical cross-validation procedure, so we decided to choose the bandwidth empirically. The bandwidth was chosen inversely proportional to the number of markers of each individual study on each chromosome. Therefore, h was not constant in our study but depended on the genetic background. More exactly, we chose: , where M i is the number of markers of each individual linkage study on one particular chromosome and the constant was fixed to 4.0. Intuitively, we understand that a study with less markers yields more variable results. Applying a larger h leads to a smoother function and thus to a decreased variability. This determination of h provided a good estimation of both the NPL score function and its variance.
and is the estimator of (t).
In our method 3 above, we took . All our computations were preformed with the computer program R for Linux.
However the variance of the NPL score at this location, as estimated by the kernel regression procedure, was higher for ECRAF than for the two other studies (Fig. 2D). Method 3, unlike Methods 1 and 2, weights each study inversely proportionally to this variance and therefore led to a lower Z MA test statistic. Moreover, the peak of linkage in ECRAF is relatively thin, which could be associated with a larger variance of the kernel estimator at this location (Fig. 2B). This is because the variance of the kernel estimator is inversely proportional to the density estimate of the NPL score at one particular location (see variance formula above). In general, denser marker regions and wider peak regions both could contribute to a low variance of the kernel estimator and hence, to a larger GSMA statistic.
The use of kernel-based regression methods allow us to estimate the NPL score function at various locations along the genome and thus make possible the meta-analysis of several linkage studies with different genetic maps. To our knowledge, this is the first kernel-based approach for GSMA studies. Previous GSMAs have tried to perform some map alignment that requires an estimation of the IBD sharing probabilities between markers using interval mapping. However, the variability of this estimate is not reflected in the GSMA statistic. An important advantage of our approach is that it is completely nonparametric and we can obtain a measure of the variability of the NPL score estimate along the genome. Incorporating this variability into the GSMA statistic (Method 3) might improve the consistency of linkage results by over-weighting studies with more precise estimate of the NPL score function. This could reflect, for example, a higher marker density. A larger weight will be given to a study that finds a linkage peak with many markers than to a study that finds the same peak with fewer markers. Therefore, the information about NPL score variability is very useful to weight each individual study. Our procedure can also down-weight thin peaks. Further simulation studies are needed to better understand its properties, in particular in terms of detection of true linkage peaks.
This article has been published as part of BMC Proceedings Volume 1 Supplement 1, 2007: Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/1?issue=S1.
- Choi SJ, Rho YH, Ji JD, Song GG, Lee YH: Genome scan meta-analysis of rheumatoid arthritis. Rheumatology. 2006, 45: 166-170. 10.1093/rheumatology/kei128.View ArticlePubMedGoogle Scholar
- Loesgen S, Dempfle A, Golla A, Bickboller H: Weighting schemes in pooled linkage analysis. Genet Epidemiol. 2001, 21 (Suppl 1): S142-S147.PubMedGoogle Scholar
- Etzel CJ, Chen WV, Shepard N, Jawaheer D, Cornelis F, Seldin MF, Gregersen PK, Amos CI for the North American Rheumatoid Arthritis Consortium: Genome-wide meta-analysis for rheumatoidarthritis. Hum Genet. 2006, 119: 634-641. 10.1007/s00439-006-0171-8.View ArticlePubMedGoogle Scholar
- Fulker DW, Cardon LR: A sib-pair approach to interval mapping of quantitative trait loci. Am J Hum Genet. 1994, 54: 1092-1103.PubMed CentralPubMedGoogle Scholar
- Olson JM: Multipoint linkage analysis using sib pairs: an interval mapping approach for dichotomous outcomes. Am J Hum Genet. 1995, 56: 788-798.PubMed CentralPubMedGoogle Scholar
- Abecasis GR, Cherny SS, Cookson WOC, Cardon LR: MERLIN-rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet. 2002, 30: 97-101. 10.1038/ng786.View ArticlePubMedGoogle Scholar
- Nadaraya EA: On estimating regression. Theory Probability Appl. 1964, 10: 186-190. 10.1137/1110024.View ArticleGoogle Scholar
- Watson GS: Smooth regression analysis. Sankhy Ser A. 1964, 26: 359-372.Google Scholar
- Härdle W: Applied Nonparametric Regression. 1990, Cambridge: Cambridge University PressView ArticleGoogle Scholar
- Wand MP, Jones MC: Kernel Smoothing. 1995, London: Chapman and HallView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.