We used all 100 replicates of Genetic Analysis Workshop 15 (GAW15) simulated nuclear families, each containing both parents and two affected children (Problem 3). Rheumatoid arthritis is the phenotype in all of our analyses. An initial nonparametric multipoint genome-wide linkage scan [3] using SNP markers was performed with Merlin [4]. A linkage peak with mean logarithm of odds (LOD) score of 79 (from 100 replicates) at about 50 cM on chromosome 6 was identified. Because this linkage region is broad, we selected 2102 dense SNPs between 40 and 60 cM under the linkage peak for assessing power of all methods. In addition, we selected 947 dense SNPs between 130 and 140 cM on chromosome 6 for evaluating type I error rates of the null hypothesis T1. Although the LOD scores range from 3 to 6, these 947 SNPs are far away from the disease loci (DR, C and D locus) and should be in LE with the disease loci. For T3, type I error rates were evaluated using all 16 chromosomes that do not harbor any disease loci. Available data from controls were used to determine LD, as measured by r2 between each of 2102 dense SNPs and the disease loci. For the tri-allelic DR locus, a generalized R2 was calculated for each SNP and tested using the LOGISTIC procedure in SAS version 8 [5]. For the C and D loci, r2 was estimated using Haploview [6] and tested using the method proposed by Sabatti and Risch [7]. For each SNP, the mean r2 from the 100 replicates was used to estimate more accurately LD with the disease loci. We briefly describe and categorize each method by its hypotheses. The analyses were performed with knowledge of the "answers".
Family-based association test (FBAT)
Rabinowitz and Laird [8] proposed the family-based association test (FBAT) that is applicable to multiple siblings, quantitative traits, and incomplete parental genotypes. FBAT is a valid test of T3. Lake et al. [9] extended FBAT to a valid test of T1 by incorporating an empirical variance estimate (FBAT-e).
Conditional logistic regression method
Millstein et al. [10] recently proposed a pseudo-control approach for joint modeling of linkage and association. Let g1, g2, gm, gf denote the genotypes at a studied locus for two affected offspring, mother, and father, respectively. D1 and D2 are the disease states for the two sibs. Conditional on parental genotypes and their disease states, the likelihood for the children is
P(g1, g2|g
m
, g
f
, D1, D2) = P(g1|g
m
, g
f
, D1) × P(g2|g1, g
m
, g
f
, D1, D2),
which can be modelled as
where g* represents the four possible offspring genotypes; e12 = E[ibd12|g1, g2, gm, gf], the expected identical-by-decent (IBD) sharing between g1 and g2 given the observed marker genotypes and e1* is the expected IBD sharing between g1 and . A test of β = 0 is a test of T1 (denoted Millstein-b), a test of γ = 0 is a test of T2 (denoted Millstein-c), and the two degree of freedom likelihood ratio test (LRT) of β = 0 and γ = 0 is a test of T3 (denoted Millstein-a).
Likelihood-based approach – LAMP
Li et al. [2] proposed a method to identify SNPs in LD with the disease locus through estimation of the degree of LD between the tested SNP and the putative disease locus. The method is implemented in the software called LAMP. They use a likelihood function that 1) assumes a single di-allelic disease locus, 2) assumes no recombination between the tested SNP and the disease locus, 3) uses disease-SNP haplotype frequencies and disease penetrances as parameters, and 4) can incorporate information from flanking markers in LE with the tested SNP. Two LRTs are proposed. The first (denoted as LAMP-LE) assesses whether the tested SNP is in LE with the disease locus, while the second (denoted as LAMP-LD) assesses whether the tested SNP is in complete LD with the disease locus. Therefore, LAMP-LE is a test of T1 and LAMP-LD is a test of T2. The statistical significance of these two tests is assessed empirically by comparing the observed statistic with simulated null distributions. We exclude flanking short tandem repeat (STR) markers in LD [7] with tested SNPs and use the remaining STRs in our application of LAMP.
Homozygote Sharing Test (HST)
The HST statistic [11, 12] is constructed using a likelihood function conditional on parental genotypes. It compares the observed IBD sharing from homozygous and heterozygous parents to determine if a SNP explains partially the evidence for linkage. HST capitalizes on the fact that parents who are homozygous at all disease loci in a linkage region should not transmit any alleles preferentially to the affected siblings, and hence no excess IBD sharing should be observed from homozygous parents. Additionally, the IBD sharing from homozygous and heterozygous parents should be equal for SNPs in LE with all disease loci. For the intermediate case in which the tested SNP is in partial LD with disease loci, some increased sharing may be observed from homozygous parents in a linkage region. The HST statistic to identify SNPs explaining some of the linkage evidence is derived from the likelihood ratio of the following hypotheses H0:1/2 <α
homo
= α
het
vs. H1: 1/2 ≤ α
homo
<α
het
, where α
homo
and α
het
are the probabilities that an affected sib-pair shares one allele IBD with respect to homozygous and heterozygous parents, respectively. The HST is defined as
where and denote the number of sib pairs sharing "j" allele IBD from homozygous and heterozygous parents respectively (j = 0,1). This HST statistic (denoted HST-LE) is a test of T1. Once subsets of SNPs explaining some of the linkage evidence have been identified, one can then test H0: 1/2 = α
homo
<α
het
vs. H1: 1/2 <α
homo
<α
het
with the following HST statistic:
Rejection of the null hypothesis indicates that the tested SNP does not explain fully the linkage evidence. This HST statistic (denoted HST-LD) is a test of T2. Both HST-LE and HST-LD are LRTs under independent parental transmissions, equivalent to assuming a multiplicative model of transmission. Under the null hypothesis of LE between the tested SNP and disease loci, both HST statistics asymptotically follow a chi-square mixture distribution of , assuming independent parental transmissions.
HSTDT – Combination of HST-LE and transmission-disequilibrium test (TDT)
The original TDT, proposed by Spielman et al. [13], tests linkage and association between a marker and a disease locus using ascertained affected individuals and his/her parental marker information. The TDT can also be used to test association in a linked region (Spielman and Ewens [14]) using data that consist of nuclear families with a single affected child. The TDT examines the allelic transmission to the affected child from his/her heterozygous parents. For families with multiple affected siblings, the transmissions are correlated among siblings if there is linkage, and TDT is no longer a valid test of association in the presence of linkage. To solve this problem, Martin et al. [15] focused on the set of transmissions from a heterozygous parent shared by all his/her affected children. For affected sib-pair data and a marker with two alleles M1 and M2, they showed that for a marker in LE with the disease loci, the probability that both affected siblings receive M1 (denoted as ) and the probability that both affected siblings receive M2 (denoted as ) from their heterozygous parent are equal. Thus, there should be no over-transmission of M1 or M2 to affected offspring. In what follows we use TDT to refer Martin et al.'s strategy, which is a test of T1 [15]. Note that the TDT does not use information from homozygous parents, while HST-LE compares the observed allele sharing from homozygous and heterozygous parents without considering which allele is over-transmitted from heterozygous parents. To fully use all available information to identify whether a SNP explains some of the linkage evidence, we propose HSTDT, which combines HST-LE and TDT by decomposing the allele sharing from heterozygous parents (α
het
) into two allele-specific IBD sharing probabilities ( and ), to test H0: vs. H1: . The HSTDT statistic is defined as
Similar to HST-LE, HSTDT is a LRT under the assumption of independent parental transmission. Under the null hypothesis of LE between the tested SNP and disease loci (T1), the HSTDT asymptotically follows a chi-square mixture distribution of , assuming independent parental transmissions.
The major distinction between homozygote sharing tests (HST and HSTDT) and other tests of T1 or T2 is that the former can be used to test if a SNP explains the linkage peak (by using IBD information at the linkage peak). When assuming no recombination between the tested SNP and the presumed disease locus at the linkage peak, testing association in the presence of linkage is equivalent to testing whether a SNP partially explains the linkage peak; while testing no linkage, adjusting for association is equivalent to testing whether the tested SNP fully explains the linkage peak. However, when the assumption is violated, for tests of T1 and T2 other than HST and HSTDT, one may not be able to claim that the tested SNP explains the peak linkage evidence but rather that it explains the linkage evidence at the location of the tested SNP. When a linkage signal is identified in a linked region, the LOD score at the linkage peak should be of greatest interest and is the usual quantity reported. In this report, HST and HSTDT are applied to identify SNPs explaining the peak linkage evidence.