A logistic mixture model for a family-based association study.

A family-based association study design is not only able to localize causative genes more precisely than linkage analysis, but it also helps explain the genetic mechanism underlying the trait under study. Therefore, it can be used to follow up an initial linkage scan. For an association study of binary traits in general pedigrees, we propose a logistic mixture model that regresses the trait value on the genotypic values of markers under investigation and other covariates such as environmental factors. We first tested both the validity and power of the new model by simulating nuclear families inheriting a simple Mendelian trait. It is powerful when the correct disease model is specified and shows much loss of power when the dominance of a model is inversely specified, i.e., a dominant model is wrongly specified as recessive or vice versa. We then applied the new model to the Genetic Analysis Workshop (GAW) 15 simulation data to test the performance of the model when adjusting for covariates in the case of complex traits. Adjusting for the covariate that interacts with disease loci improves the power to detect association. The simplest version of the model only takes monogenic inheritance into account, but analysis of the GAW simulation data shows that even this simple model can be powerful for complex traits.


Background
Linkage analysis is a useful tool for the initial exploration of complex diseases, but is limited in its ability to localize the loci potentially segregating for disease susceptibility. Association analysis, which directly tests the association between a trait and marker alleles, can more precisely localize causative genes. For this purpose a family-based association study design can be used to follow up an initial linkage scan. Moreover, it can help explain the genetic mechanism underlying the trait because extended pedigrees provide more genetic information than a random sample consisting of the same number of individuals. Note also that the often-quoted paper by Risch and Merikangas [1] for advocating an association study for detecting genes of modest effect employed a family-based association study design-in particular, using the transmission-disequilibrium test (TDT) [2] -showing it to be more powerful than a linkage analysis of affected sib pairs.
There have been several ways proposed to conduct an association study of a binary trait using general pedigrees [3,4], and there have also been joint linkage and association models proposed [5,6]. In this paper we propose a logistic mixture model for an association study of binary traits in general pedigrees. We first tested both the validity and power of this new model by simulating nuclear families inheriting a simple Mendelian trait of monogenic inheritance. We describe this initial study briefly before applying the new model to the Genetic Analysis Workshop 15 (GAW15) simulated complex trait data; in particular, we examined the performance of the model when adjusting for covariates in the case of a complex trait, as this could provide information on familial residual correlations due to common environmental sharing.

Methods
Denote by y i the phenotype (y ∈ {0, 1}, where 1 denotes affected and 0 denotes unaffected) and by g i the genotype of the i th individual in a pedigree of n members. The likelihood for the pedigree data Y = (y i ) is given by a mixture model: where denotes the conditional probability that individual i has genotype g i given parental genotypes if he or she is a non-founder, or the probability that individual i has the genotype g i determined by population genotype frequencies if he or she is a founder; and P(y i |g i ), the penetrance function, denotes the probability that individual i has phenotype y i given genotype g i .  [7] algorithm in the context of complex segregation analysis. The significance of a marker can be tested by comparing the likelihood with and without this marker in the model logit. Because the finite sample-size null distribution of the likelihood ratio test statistic is not known, to determine the empirical significance level of a particular observed result one can either perform a simulation study based on the null hypothesis by generating unassociated marker data for the sample at hand, or perform a permutation test (e.g., [8]). We classify the current method as model-based because a penetrance function is explicitly specified.

Initial simulation: simple traits
We first describe our initial simulation study, for which we simulated nuclear families consisting of two parents and four children. One diallelic marker was simulated with the minor allele frequency (MAF) p D = 0.3, i.e., a common variant corresponding to the case for which association mapping is advocated [9], and the affection status of all individuals was simulated under nine disease models, covering the relative-risk spectrum from low to high, with the minor allele of this diallelic marker the same as the disease-predisposing allele (Table 1). Among the nine models there were three models simulated under each of dominant, recessive, and additive modes of inheritance. For example, a dominant model with penetrances (f 0 , f 1 , f 2 ) = (0.01, 0.03, 0.03) is denoted D3: D stands for dominant mode of inheritance, 3 stands for f 2 = 0.03, and logit (f i ) = α + βg i , where j denotes the number of copies of the disease allele. Under each model random families were generated and those with at least two affected children were ascertained. In this way we generated 500 replicate samples with 30 families in each sample data set. The diallelic marker and pedigree structures were simulated using the program SimPed [10]. Under the alternative hypothesis of association, the affection status of individuals was simulated according to the penetrance functions of each model; under the null hypothesis of no association, the affection status was randomly simulated according to a disease prevalence given by .
According to an individual's genotype, in particular the number of copies of the disease-predisposing allele, a gen-L Y P y g P g g g otypic value was assigned under the dominant, recessive, or additive model, respectively. For simplicity, we assumed Mendelian transmission probabilities and known disease allele frequencies. We analyzed each data set under three genetic models, i.e., dominant, recessive, and additive models. The significance of association between the trait and the diallelic marker was tested by fitting models with and without the diallelic marker as a covariate and then performing the likelihood-ratio test (LRT). Theoretically, this LRT statistic should asymptotically follow a distribution, and we report the empirical type I error rate under the null by calculating the percentage of the 500 replicate p-values attaining the nominal level of 0.05. Because the finite sample-size distribution of the LRT statistic is not known, we report a power determined as the percentage of the 500 replicate likelihood ratio statistics under the alternative larger than the cut-off for the top 5% of the 500 replicate LRT statistics under the null. We termed the current method SEGREG, and, as a comparison, we also analyzed the same data sets by another family-based association method [3], which is called ASSOC. The analyses were performed using the programs SEGREG and ASSOC in the Statistical Analysis for Genetic Epidemiology (S.A.G.E.) software suite, Version 5.2 [11].

Simulation results
The two methods had very similar type I error rate and power, though SEGREG showed slightly higher power in most cases (Table 2); therefore we focus below only on the SEGREG results. The empirical type I error rate at a nominal 0.05 significance level was found to be always far below 0.05, regardless of the correctness of the model assumptions. Under the alternative, assuming a correct genetic model always results in more power than assuming a wrong model, which was anticipated. However, when analyzed under wrong assumptions, the power depends on both the underlying true model and the model assumed for the analysis. If the true disease model is dominant, analysis assuming a recessive model has little power; and if the true disease model is recessive, analysis assuming a dominant model has little power.
Analysis assuming an additive model usually has fair power. However, in the case of the low penetrance disease models D3 and R3, wrongly assuming an additive model has power of only 0.530 and 0.450, respectively. Note, however, that even the power on correctly assuming an additive model is only 0.552 when the true model is A3.
When the true disease model is additive, analysis assuming a dominant model is usually much more powerful than that assuming a recessive model.

Application to the GAW15 simulated data: complex traits
Compared to our simulation of simple traits, the GAW15 simulation data provided an opportunity to test the new model on a complex trait. In particular, we examined the performance of the model when adjusting for covariates in the case of the data simulating rheumatoid arthritis (RA). Being aware of the answer to the simulated data that smoking interacts with locus B (MAF = 0.35) on chromosome 8 and locus F (MAF = 0.50) on chromosome 11 to increase susceptibility to RA, we performed an association analysis screening for the binary RA trait loci on chromosomes 8 and 11 using all 100 replicates of the simulated 10 K SNP data. The empirical power at a nominal 0.05 significance level for loci B and F was determined by comparing the likelihood ratio statistics to the distribution of statistics for the disease-unassociated markers SNP1_3 (MAF = 0.35) and SNP1_4 (MAF = 0.50), respectively. Age and sex were always included as covariates, and we compared the results with and without adjusting for smoking.
To mimic a real situation, we only chose the first 100 nuclear families in each replicate for this study. Markers on chromosome 8 and SNP1_3 were coded with the minor allele dominant and markers on chromosome 11 and SNP1_4 were coded in an additive fashion, corresponding to the fact that loci B and F were simulated under dominant and additive models, respectively.
Without adjusting for smoking, the power of detecting loci B and F was 0.33 and 0.45 respectively, whereas the power increased to 0.35 and 0.56 after taking this covariate into consideration.

Discussion
The new method is illustrated here on the full likelihood function of a pedigree, which is usually ascertained according to the phenotypes of probands instead of random sampling, and so, without appropriate ascertainment correction, the parameter estimates may be biased; however, this does not affect validity for testing the significance of association as a generalized linear model, though proper parameter inference can be made only when correcting for ascertainment or building the model on a conditional likelihood [12,13]. The LRT using large sample theory was very conservative for both SEGREG and ASSOC in the current study. We speculate this is because of the small sample size and we hypothesize that, as the sample size increases, the type I error rate will come closer to the nominal level. This topic awaits further investigation.
As a model-based approach, our method requires prespecifying a genetic model and explicit penetrance functions. According to the simulation study, great power loss occurs when the dominance of a model is inversely specified, i.e., a dominant model is wrongly specified as recessive and vice versa, which is similar to the situation in model-based linkage analysis [14]. This is no surprise, because both are built on the full likelihood function of a pedigree. In general, the results suggest using an additive model in practice. However, in the common variant low penetrance scenario (D3, R3, A3), under the true genetic mechanism both dominant and recessive models have appreciably higher power than an additive model (0.828, 0.646, and 0.552, respectively). The fundamental reason for this lies in the three components of the mixture distributions not being as easily distinguishable as two components in the case of low penetrance. It is of interest to note that ASSOC, as a model-free method in the sense that no penetrance function is specified, shows similar sensitivity to the marker coding scheme as SEGREG. Because most association methods require coding SNPs in a dominant, recessive, or additive fashion, we speculate that this observation is applicable to most methods. Therefore, caution should be taken when coding markers regardless of the method employed to test association.
Detecting variants of modest effect remains a challenge for association studies, especially in the case of rare variants, which were not even simulated in the current study. For the GAW simulation data, the proposed method showed more power in detecting locus F than in detecting locus B. The effect of locus F was simulated via a variance-component method for the continuous trait IgM, which in turn affected the RA trait, whereas the effect of locus B was simulated under a dominant model with relative risk equal to 1.5. There is no direct way to compare their effect sizes. Because only locus F was fairly detected, we speculate that, if measured on the same scale, the effect size of locus B would be much more moderate. The main limit of our method lies in its assumption of a monogenic disease mechanism without allowing for familial correlation due to polygenic and/or common environmental effects, which is unrealistic for complex diseases (though widely adopted by most methods in the literature). Model-based methods should use models that approximate the complexity of the disease being studied in order to be both robust and powerful. Analyses by models that ignore residual familial correlation can result in decreased power; how to model the familial correlations is a topic for further investigation, though we could easily incorporate covariates that resemble the action of familial correlations into the current model.