Joint modeling of linkage and association using affected sib-pair data.

There has been a growing interest in developing strategies for identifying single-nucleotide polymorphisms (SNPs) that explain a linkage signal by joint modeling of linkage and association. We compare several existing methods and propose a new method called the homozygote sharing transmission-disequilibrium test (HSTDT) to detect linkage and association or to identify SNPs explaining the linkage signal on chromosome 6 for rheumatoid arthritis using 100 replicates of the Genetic Analysis Workshop (GAW) 15 simulated affected sib-pair data. Existing methods considered included the family-based tests of association implemented in FBAT, a transmission-disequilibrium test, a conditional logistic regression approach, a likelihood-based approach implemented in LAMP, and the homozygote sharing test (HST). We compared the type I error rates and power for tests classified into three categories according to their null hypotheses: 1) no association in the presence of linkage (i.e., a SNP explains none of the linkage evidence), 2) no linkage adjusting for the association (i.e., a SNP explains all linkage evidence), and 3) no linkage and no association. For testing association in the presence of linkage, we found similar power among all tests except for the homozygote sharing test that had lower power. When testing linkage adjusting for association, similar power was observed between LAMP and HST, but lower power for the conditional logistic regression method. When testing linkage or association, the conditional logistic regression method was more powerful than FBAT.


Background
The availability of high throughput single-nucleotide polymorphism (SNP) genotyping technologies at more affordable costs has generated increasing enthusiasm for genome-wide association study (GWAS) for a wide range of disorders [1]. How to best analyze dense SNP data is of great interest to the scientific community. Family-based association tests model both linkage and association and thus can better localize the disease locus than linkage analyses alone and avoid spurious association results due to population admixture. Recently, there has been a growing interest in developing methods for identifying SNPs that account for all the observed linkage evidence [2], a goal that can be achieved by joint modeling of linkage and association.
There are three major types of family-based association tests categorized by their null hypotheses: 1) H 0 : no association, in the presence of linkage or the tested SNP is in linkage equilibrium (LE) with all disease loci (denoted T 1 ); 2) H 0 : no linkage adjusting for the association, or the tested SNP is in complete linkage disequilibrium (LD) (r 2 = 1) with all disease loci (denoted T 2 ); 3) H 0 : no linkage and no association (denoted T 3 ). It is vital to understand the differences among these hypotheses and the relative efficiencies of valid tests for each hypothesis. In what follows, we classify the tests we considered into the three categories and compare the power and type I error among them.

Methods
We used all 100 replicates of Genetic Analysis Workshop 15 (GAW15) simulated nuclear families, each containing both parents and two affected children (Problem 3). Rheumatoid arthritis is the phenotype in all of our analyses. An initial nonparametric multipoint genome-wide linkage scan [3] using SNP markers was performed with Merlin [4]. A linkage peak with mean logarithm of odds (LOD) score of 79 (from 100 replicates) at about 50 cM on chromosome 6 was identified. Because this linkage region is broad, we selected 2102 dense SNPs between 40 and 60 cM under the linkage peak for assessing power of all methods. In addition, we selected 947 dense SNPs between 130 and 140 cM on chromosome 6 for evaluating type I error rates of the null hypothesis T 1 . Although the LOD scores range from 3 to 6, these 947 SNPs are far away from the disease loci (DR, C and D locus) and should be in LE with the disease loci. For T 3 , type I error rates were evaluated using all 16 chromosomes that do not harbor any disease loci. Available data from controls were used to determine LD, as measured by r 2 between each of 2102 dense SNPs and the disease loci. For the triallelic DR locus, a generalized R 2 was calculated for each SNP and tested using the LOGISTIC procedure in SAS version 8 [5]. For the C and D loci, r 2 was estimated using Haploview [6] and tested using the method proposed by Sabatti and Risch [7]. For each SNP, the mean r 2 from the 100 replicates was used to estimate more accurately LD with the disease loci. We briefly describe and categorize each method by its hypotheses. The analyses were performed with knowledge of the "answers". Rabinowitz and Laird [8] proposed the family-based association test (FBAT) that is applicable to multiple siblings, quantitative traits, and incomplete parental genotypes. FBAT is a valid test of T 3 . Lake et al. [9] extended FBAT to a valid test of T 1 by incorporating an empirical variance estimate (FBAT-e).

Conditional logistic regression method
Millstein et al. [10] recently proposed a pseudo-control approach for joint modeling of linkage and association. Let g 1 , g 2 , g m , g f denote the genotypes at a studied locus for two affected offspring, mother, and father, respectively. D 1 and D 2 are the disease states for the two sibs. Conditional on parental genotypes and their disease states, the likelihood for the children is which can be modelled as where g* represents the four possible offspring genotypes; e 12 = E[ibd 12 |g 1 , g 2 , g m , g f ], the expected identical-bydecent (IBD) sharing between g 1 and g 2 given the observed marker genotypes and e 1* is the expected IBD sharing between g 1 and . A test of β = 0 is a test of T 1 (denoted Millstein-b), a test of γ = 0 is a test of T 2 (denoted Millstein-c), and the two degree of freedom likelihood ratio test (LRT) of β = 0 and γ = 0 is a test of T 3 (denoted Millstein-a).

Likelihood-based approach -LAMP
Li et al. [2] proposed a method to identify SNPs in LD with the disease locus through estimation of the degree of LD between the tested SNP and the putative disease locus. The method is implemented in the software called LAMP. They use a likelihood function that 1) assumes a single diallelic disease locus, 2) assumes no recombination between the tested SNP and the disease locus, 3) uses disease-SNP haplotype frequencies and disease penetrances as parameters, and 4) can incorporate information from flanking markers in LE with the tested SNP. Two LRTs are proposed. The first (denoted as LAMP-LE) assesses whether the tested SNP is in LE with the disease locus, while the second (denoted as LAMP-LD) assesses whether the tested SNP is in complete LD with the disease locus. Therefore, LAMP-LE is a test of T 1 and LAMP-LD is a test of T 2 . The statistical significance of these two tests is assessed empirically by comparing the observed statistic with simulated null distributions. We exclude flanking short tandem repeat (STR) markers in LD [7] with tested SNPs and use the remaining STRs in our application of LAMP.

HSTDT -Combination of HST-LE and transmissiondisequilibrium test (TDT)
The original TDT, proposed by Spielman et al. [13], tests linkage and association between a marker and a disease locus using ascertained affected individuals and his/her parental marker information. The TDT can also be used to test association in a linked region (Spielman and Ewens [14]) using data that consist of nuclear families with a single affected child.  The major distinction between homozygote sharing tests (HST and HSTDT) and other tests of T 1 or T 2 is that the former can be used to test if a SNP explains the linkage peak (by using IBD information at the linkage peak). When assuming no recombination between the tested SNP and the presumed disease locus at the linkage peak, testing association in the presence of linkage is equivalent to testing whether a SNP partially explains the linkage peak; while testing no linkage, adjusting for association is equivalent to testing whether the tested SNP fully explains the linkage peak. However, when the assumption is violated, for tests of T 1 and T 2 other than HST and HSTDT, one may not be able to claim that the tested SNP explains the peak linkage evidence but rather that it explains the linkage evidence at the location of the tested SNP. When a linkage signal is identified in a linked region, the LOD score at the linkage peak should be of greatest interest and is the usual quantity reported. In this report, HST and HSTDT are applied to identify SNPs explaining the peak linkage evidence.

Results
SNPs were classified into five groups according to their LD with the disease loci. In Table 1, the first group (labeled r 2 = 0) included 947 dense SNPs between 130 and 140 cM on chromosome 6 that were used to assess type I error rates for T 1 . In Table 2, the first group (labeled r 2 = 0, θ = 0.5) included 6597 SNPs from all 16 chromosomes that did not harbor any disease loci and were used to assess type I error rates for T 3  . . χ χ +   type I error rates were assessed at significance levels of 5%, 1%, and 0.1%. Table 1 presents results for tests of T 1 . All methods have appropriate type I error rates (r 2 = 0), except that HSTDT has slightly inflated type I error rates. TDT is the most powerful among these six methods, followed closely by HSTDT, FBAT-e, Millstein-b and then LAMP-LE. HST-LE appears to be less powerful. Table 3 presents results for tests of T 2 . These tests are used to examine if a SNP explains all the observed linkage evidence. LAMP-LD and HST-LD test whether a SNP is in complete LD with the disease loci. Equivalently, Millsteinc tests if there is residual linkage when conditioning on the genotype covariate of the tested SNP and is essentially a binary-trait version of the test proposed by Almasy and Blangero [16]. Because none of the SNPs is in complete LD with all disease loci, type I error rates cannot be evaluated. HST-LD and LAMP-LD have similar power and are more powerful than Millstein-c. Table 2 presents results for tests of T 3 . The type I error rates (r 2 = 0, θ = 0.5) of FBAT and Millstein-a are appropriate.

Testing linkage or association (T 3 )
FBAT is less powerful than Millstein-a for SNPs in low LD with the disease loci. Note that Millstein-a directly uses IBD between sib pairs in the model, while FBAT uses allelic transmission from parents to children, possibly explaining the difference in their ability to pick up the linkage signal.

Conclusion
With a dense SNP map, it is natural to speculate whether all disease loci under a linkage peak have been identified or their contributions to the phenotypic variation are fully explained by a small subset of SNPs in association with these disease loci. Recently, there was much interest in testing whether a SNP can partially or fully account for all the observed linkage evidence.
We examined several methods for joint linkage and association analysis and identifying SNPs that explain the linkage evidence. For testing association in the presence of linkage and for testing linkage or association, all methods  have appropriate type I error rates. For testing association in the presence of linkage, TDT is most powerful (but only slightly more powerful than HSTDT, FBAT-e, Millstein-b, and LAMP). For testing whether a SNP explains all the linkage evidence, HST and LAMP have similar power and are more powerful than Millstein-a; the difference in power may be explained by the difference in type I error, which we were unable to assess in the GAW15 data set because there was no single disease locus explaining all of the linkage evidence. For testing linkage or association, we found that Millstein-c was more powerful than FBAT. These conclusions may not extend to study designs other than nuclear families each with two affected children with parental genotypes available, a requirement for HST, HSTDT, and Millstein et al. [10] Furthermore, the excessively high LOD score observed in this study may explain the slightly inflated type I error rate observed for HSTDT.
In this study, there are three disease loci in the linked region and they do not contribute equally to the linkage signal, with the DR locus having a major effect on affection status. A different scenario may lead to different results for methods that use linkage peak information to identify SNPs, explaining the linkage evidence. However, for a complex disease, there may be multiple disease loci acting interactively, so methods that do not assume a single causal variant would be most helpful in identifying SNPs associated with disease loci. There is a great need for developing methods suitable for multiple disease loci.