Testing for genetic association taking into account phenotypic information of relatives.

We investigated efficient case-control association analysis using family data. The outcome of interest was coronary heart disease. We employed existing and new methods that take into account the correlations among related individuals to obtain the proper type I error rates. The methods considered for autosomal single-nucleotide polymorphisms were: 1) generalized estimating equations-based methods, 2) variance-modified Cochran-Armitage (MCA) trend test incorporating kinship coefficients, and 3) genotypic modified quasi-likelihood score test. Additionally, for X-linked single-nucleotide polymorphisms we proposed a two-degrees-of-freedom test. Performance of these methods was tested using Framingham Heart Study 500 k array data.


Background
Several single-gene variants associated with coronary heart disease (CHD) using Framingham Heart Study (FHS) 100 k array data were reported previously [1]. Regression models with generalized estimating equations (GEE) [2] as well as family-based association testing using FBAT [3] were used. Both methods do not utilize all family information available. While the FBAT test statistic is based on the use of offspring genotypes conditional on (informative) parental genotypes, the GEE association test uses all individuals with genotype and phenotype data. The latter usually uses an exchangeable working correlation matrix to account for correlation within each sibship. Hence, available parental information is not optimally used.
Our aim is to use family information efficiently. In this paper we study an association between CHD and candidate genes using the binary outcome of CHD directly. The following methods were investigated: 1) a logistic regression model taking into account familial dependence of the observations using GEE, 2) Cochran-Armitage (CA) trend test taking into account the correlations among related individuals when computing the variance, and 3) the extensions of modified quasilikelihood score (M QLS ) test [4]. The last methods also

BioMed Central
Open Access use phenotypic information of ungenotyped family members for an optimal weighting scheme, and can be used for sibships as well as for nuclear families. Because the first two methods are genotypic tests, we extended the allelic M QLS test to the corresponding genotypic test (gMQLS), assuming a multiplicative model [5].
Unil now, little has been reported on performance of such test statistics for association on the X chromosome [6,7]. Because the X chromosome represents 2.5% of the human genome for males and 5% for females, information coming from the X chromosome cannot be ignored.
To identify X-linked markers for susceptibility to a disease, we investigate statistics to test for association on the X chromosome in a related sample using GEE and sex-stratified allelic M QLS test.

Study sample
We analyzed Problem 2 of Genetic Analysis Workshop 16 data, using GeneChip® Human Mapping 500 k Array Set provided by the FHS SHARe (SNP Health Association Resource) project. The large pedigrees (n = 841) were broken up into nuclear family units (n = 1,902). The data consist of 2,878 subjects in the Offspring Cohort (n = 2,555) and their parents in the Original Cohort (n = 323). A binary outcome variable was created as any event of hard CHD (n = 225). The details of data sets created and used are described in Table 1.

Single-nucleotide polymorphism (SNP) selection
We checked inheritance error. PLINK version 1.02 [8] was used for preprocessing of data with the following inclusion thresholds: minor allele fequency ≥ 0.01, missing rate per person ≤ 0.1, missing rate per SNP ≤ 0.1, and Hardy-Weinberg equilibrium p ≥ 0.001. For chromosome 8, by ignoring relatedness between subjects, we conducted allelic tests for the preprocessed 22,207 SNPs (from 27,362 of FHS 500 k SNP resource) using PLINK. Then, 121 SNPs were selected using a threshold of allelic p-values < 0.005. For chromosome X, 8,020 SNPs (from 9,828) were tested, and using the same threshold 35 SNPs were selected.

GEE-based and modified CA trend test
One merit of using pedigrees in a case-control study is that cases with affected relatives might have higher expected frequency of associated alleles than cases without affected relatives. For GEE, an exchangeable working correlation matrix was used to account for correlation within each sibship and each family. However, this correlation is prone to misspecification, and subsequent loss of efficiency may be substantial [9].
Under the null hypothesis of no association between genotype and disease, CA trend test is U U 2 1 2 /Var( )~χ , where U is a sum of weighted differences of genotype counts between cases and controls. When subjects are biologically related, we need to account for their correlations by computing the variance of U. Slager and Schaid [10] proposed a method in which the variance and covariance terms can be calculated based on identity-by-decent-sharing probabilities. We calculated the covariance using expected identity-by-decent (2 times kinship coefficient); hence, this method is called the modified Cochran-Armitage (MCA) test.

M QLS test and its extensions
Alternatively, we considered M QLS test proposed by Thornton and McPeek [4], which is said to be more powerful and more widely applicable. It distinguishes between unaffected controls and controls of unknown phenotype (general population controls), and it also incorporates phenotypic data of relatives with missing genotypes.
Suppose we have n + m sampled individuals with phenotypic information. Let Y = (Y 1 , ..., Y n ) denote genotype data of n individuals with non-missing genotype, so that m individuals have missing genotype. Let Φ be the kinship matrix of the non-missing genotype individuals, and Φ N, M between missing and nonmissing genotype individuals. The entries of the matrix are 1 on the diagonal and 2j ij kinship coefficient between the i th and j th individual off the diagonal. A N and A M are the column of the phenotype of the respectively non-missing and missing genotype with 0 <k < 1 specified to be the population prevalence of the trait. Then, the statistic is given by We extended the allelic M QLS test to the corresponding genotypic test, gMQLS, assuming multiplicative model using genotypic meanμ 0 and the corresponding varianceσ 0 2 − .
For the X-linked SNPs, a simple allele-based test can be constructed by counting alleles, with males contributing a single allele and females two alleles. Because the assumption that the allele frequency does not vary with sex could not be met, we stratified the analysis by sex, and used the allelic M QLS test. To combine the results we combined the two chi-squared tests to obtain a twodegrees-of-freedom test (xMQLS).
The analyses using new methods have been conducted using functions written by the authors in R [11].

Results
Association study for autosomal SNPs on chromosome 8 We compared the following methods: CA, MCA, GEE, and gMQLS. These tests were performed 1) using Offspring Cohort and 2) using the Original and Offspring Cohorts as described in Table 1. Note that for gMQLS, phenotypic information of un-genotyped individuals was also incorporated. The population prevalence of CHD -k in Eq. (1) -was set as 5%. To compare type 1 error rates, the quantile-quantile plots of 0.5percentiles (the percentage of SNPs selected) are depicted in Figure 1. The points below the diagonal indicate that allelic tests ignoring relatedness in PLINK overestimated the association. The results are comparable for these selected SNPs.
In Table 2, the top ten ranking SNPs detected by gMQLS using nuclear families are reported. The gMQLS gave more significant results when information of parental generation was included: for example, the p-value decreased from 9.80 × 10 -5 to 1.05 × 10 -5 for RS17094201. None of the SNPs tested were found to have genome-wide significance (nominal p < 5 × 10 -8 ).
Testing association for X-linked SNPs We performed analysis using GEE adjusted for sex and the two-degrees-of-freedom test, xMQLS. The results of the top ten ranking SNPs using xMQLS are reported in Table 3. The xMQLS gave more significant results compared with other methods (minimum p-value = 6.05 × 10 -7 ).

Discussion
The fact that the behavior of the GEE-based methods sometimes deviates from other methods may be explained by the fact that the working correlation matrix has not been specified correctly, especially for nuclear families [9]. This can be a disadvantageous feature of the GEE-based methods for family-based genome-wide association study.
We did not perform simulation studies regarding type 1 error rates of the new methods. However, a good performance of the allelic variants has been reported [4,12], and it is reasonable to expect similar performance from the new tests.
The extended M QLS tests can be used for different types of families, and also to incorporate phenotypic information of ungenotyped relatives. Therefore, a better performance can be expected by increasing the number of cases. For this, selecting families with many cases might be more efficient.
The use of an allelic test for X-linked SNPs leads to criticism that males have only half the impact on the analysis as females. Instead, Clayton [7] proposed genotype-based tests for association that treat males as homozygous females. For females, we denote genotypes 0, 1, and 2, and genotypes of males are coded as 0 and 2.
Then, X-chromosome specific covariances can be used to calculate genotypic trend tests taking into account the family relationship.
The extended M QLS methods are promising. However, these may not be computationally feasible for familybased genome-wide association study. We recommend these tests to be used in a two-stage approach.

Conclusion
Analyzing family data using all information available in a case-control association study may improve efficiency.
BMC Proceedings 2009, 3(Suppl 7):S123 http://www.biomedcentral.com/1753-6561/3/S7/S123   To gain efficiency, we also considered the extensions of MQLS test. The last methods utilize most of family information, and therefore might be more efficient than others. Using these methods, we analyzed the real FHS data. The new methods performed well compared with the GEE-based methods.
Adding family information seemed to improve the results. Although only a small number (n = 323) was added, the proportion of cases added (20%) was relatively large compared with that in the sibling-only data (6%). And, the gMQLS test might be more efficient because it incorporates all phenotypic information available -even CHD cases of un-genotyped parents.
For X-linked SNPs, equivalent results were obtained: the xMQLS test outperform the GEE-based methods using these specific data. Further work should be done to evaluate the new methods.