Problem 3 data set
The GAW15 Problem 3 data consisted of 100 simulated data sets of affected families with genotype and quantitative phenotype information patterned after data from rheumatoid arthritis studies [7]. Due to data corruption issues and time constraints, we analyzed simulations 1–64, 75, 77–87, and 90–95. Each simulated data set consisted of 1500 two-child nuclear families. Every individual had genotype information for 9187 autosomal single-nucleotide polymorphisms (SNPs) and several phenotype measures. Having looked at the simulation "answers," we used latent severity as the trait of interest. Values of latent severity were simulated based on independent effects of two diallelic loci, not in linkage disequilibrium (LD), on chromosome 9. A quarter of the variability in latent severity was due to each locus independently and the rest was due to individual random effects.
Problem 1 data set
The GAW15 Problem 1 data set consisted of a set of 14 CEPH (Centre d'Etude du Polymorphisme Humain) pedigrees in which individuals have been genotyped for 2819 autosomal SNPs [8]. For our analysis, we ignored grandparental data and looked only at nuclear families. We adopted a convention that a family's data was ignored for a SNP when a parental genotype was missing for that SNP. This allowed us to test transmission without confronting the issue of missing parental genotypes. The methods used here rely on asymptotic theory, so SNPs missing in more than 10% of the individuals or with a minor allele frequency less than 0.1 were removed from the data set, leaving 1509 autosomal SNPs.
In order to examine the behavior of the methods under various trait models on realistic data with known causal SNPs, we simulated traits based on the Problem 1 genotypes. We used two models for our simulations: a simple additive model and an additive model with polygenic effects (hereafter referred to the polygenic model). Simple additive traits follow the model y
ij
= μ + ax
ij
+ ε
ij
, where y
ij
is the trait value of individual j from family i, μ is the average trait value, a is the genetic effect, x
ij
is the number of copies of allele A at the causal SNP in individual j from family i, and ε
ij
is a N(0, ) residual. For polygenic traits, we used the model y
ij
= μ + ax
ij
+ z
ij
+ ε
ij
, where in the parents, z
ij
are normal random variables with mean 0 and variance . Given parental polygenic effects z
im
and z
if
, the offspring polygenic values follow a N((z
im
+ z
if
)/2, /2) distribution.
Statistical tests
For a quantitative trait, the conditional means model test (CMMT) described by Lange et al. [4] examines each SNP for association using the linear mixed-effect model:
y
ij
= μ + β0E(x
ij
| x
im
, x
if
) + z
i
+ ε
ij
, (1)
where β0 is the genetic effect, E(x
ij
| x
im
, x
if
) is the child's expected genotype given the parental genotypes, z
i
is a random family effect, and ε is residual error. The CMMT is the Wald statistic to test the null hypothesis of no association that specifies β0 = 0 vs. the alternative β0 ≠ 0 [5]. Generalized estimating equations are used to fit the model and obtain p-values for each SNP.
The family-based association test (FBAT) [4] is a score statistic for transmission disequilibrium based on the model:
y
ij
= μ' + β'0(x
ij
- E(x
ij
| x
im
, x
if
)) + ε'
ij
, (2)
where the null hypothesis is β'0 = 0 and the alternative is β'0 ≠ 0. This test statistic follows a distribution in large samples.
The CMMT and FBAT provide statistically independent tests corresponding to between- and within-family association analyses of a SNP and quantitative trait [2, 4, 5]. These test statistics can be examined separately in two stages [4, 5] or combined in a single-stage analysis. The two-stage method screens every SNP with the CMMT and passes the ten best-scoring SNPs to the second stage. Retaining the top ten SNPs is recommended by Van Steen et al. [5] based on estimates of power to identify multiple loci of small effects in simulations of a 10 K genome scan. In the second stage, an FBAT score statistic is computed for each of the ten retained SNPs and a Bonferroni correction factor of ten is applied. The corrected FBAT p-values are considered to be the final genome-wide p-values for those ten SNPs.
A single-stage analysis can be performed by adding the CMMT and FBAT tests statistics to obtain a chi-square sum test (CSST). Because both CMMT and FBAT are independent test statistics with an asymptotic distribution [5], the resulting CSST statistic has an asymptotic distribution.
Leek, Rohlfs, and Storey developed the conditional likelihood ratio test (CLRT) (personal communication), another single-stage method analogous to the two-stage method. As in Lange's two-stage method, the CLRT one-stage likelihood can be divided into association and linkage portions [2]. Note that:
L = L(y
ij
, x
ij
| θ, θ', x
im
, x
if
) = L(y
ij
| θ, x
im
, x
if
)L(x
ij
| y
ij
, θ', x
im
, x
if
), (3)
where θ and θ' are vectors of model parameters. The screening step from the two-stage design is mirrored in L(y
ij
| θ, x
im
, x
if
) and the testing stage corresponds to L(x
ij
| y
ij
, θ', x
im
, x
if
).
The CLRT uses the linear mixed model y
ij
= μ + β0E(x
ij
| x
im
, x
if
) + z
i
+ ε for the L(y
ij
| θ = (μ, β0, ), x
im
, x
if
) part of the total likelihood. For the L(x
ij
| y
ij
, θ' = (μ', β'0, ), x
im
, x
if
) part of the likelihood, we used the model logit(Pr(c
ij
| y
ij
, θ')) = μ' + β'0y
ij
+ ε'
ij
[9]. Here, each child has one c
ij
value for each of its heterozygous parents with c
ij
= 1 if allele A was transmitted from that parent and c
ij
= 0 otherwise. This is not precisely analogous to the original two-step method testing stage where a simple linear model was used.
As in the two-stage method, the two parts of our likelihood are calculated under different models, so β0 and β'0 measure two different quantities. The null hypothesis in this method is that both β0 and β'0 are zero. A likelihood-ratio test statistic of the joint likelihood follows a distribution.
Case study
As a case study, we analyzed Replicate 77 from the GAW15 Problem 3 data using CMMT, FBAT, CSST, CLRT, and Lange's two-stage method [4, 5]. CMMT and FBAT analyses were performed using the PBAT version 3.2 software, and the CSST p-values were corrected for multiple-testing with a family-wise error rate (FWER) of p < 5.0 × 10-8 and false discovery rate (FDR) of p < 0.05.
Simulation analyses
To compute empirical type I error rates and power, we ran a similar analyses using CSST, CLRT, and the two-stage method on 100 partially-simulated and 82 simulated data sets from Problems 1 and 3.