Family-based association test using normal approximation to gene dropping null distribution

We derive the analytical mean and variance of the score test statistic in gene-dropping simulations and approximate the null distribution of the test statistic by a normal distribution. We provide insights into the gene-dropping test by decomposing the test statistic into two components: the first component provides information about linkage, and the second component provides information about fine mapping under the linkage peak. We demonstrate our theoretical findings by applying the gene-dropping test to the simulated data set from Genetic Analysis Workshop 18 and comparing its performance with existing population and family-based association tests.


Background
When testing genotype-phenotype association using individuals from extended families, one has to account for correlations in genotypes and/or phenotypes between related individuals. One simple and effective method to account for genotype correlations is to simulate the null genotype distribution by gene dropping [1], which is simulating founder alleles according to estimated allele frequencies and dropping these alleles down the pedigrees according to random segregation of gametes (i.e., Mendel's first law). The gene-dropping method is straightforward to implement (e.g., implemented in by Allen-Brady et al [2]) and applies to all pedigree structures, but it is computationally intensive and thus is impractical to use when dealing with millions of singlenucleotide polymorphisms (SNPs).
In this article, we derive the analytical mean and variance of the score test statistic under the gene-dropping setting and approximate the gene-dropping null distribution of the test statistic by a normal distribution with the analytically derived mean and variance. Using this normal approximation, the gene-dropping test becomes computationally efficient and can be easily applied to millions of SNPs.
Furthermore, we provide insights into the genedropping test by decomposing the test statistic into two components: the first component resembles a quantity frequently used in variance-component based linkage tests and provides information for linkage, and the second component provides information for fine mapping under the linkage peak. Rabinowitz and Laird [3], among others, have pointed out the subtle distinction between two types of null hypotheses in family-based association analysis: the null hypothesis of no linkage and no association versus the null hypothesis of no association in the presence of linkage. To test the latter, one needs to condition on the inheritance S τ vector at the test locus [3]. Our decomposition provides an explicit separation of linkage and association information in a family-based study.
We compare the performance of the gene-dropping test (using normal approximation) to association tests using only unrelated individuals and to the family-based association test in the software program FBAT [3] by analyzing Genetic Analysis Workshop 18 (GAW18) simulated data set.

Preprocessing of genotype data
We analyzed SNPs from chromosome 3 only. At each of the SNPs, we performed Pearson's chi-squared test for the Hardy-Weinberg equilibrium using 142 unrelated individuals. We excluded SNPs that yielded a p-value smaller than 10 −4 from our analysis. In the gene-dropping test, we excluded SNPs with estimated minor allele frequency (MAF) smaller than 0.001.

Preprocessing of phenotype data
We focused on the analysis of the quantitative trait systolic blood pressure (SBP) in the simulated data set 1. The true simulation model was known to us [4]. When testing association between genotype doses and trait values (see later discussion), we include factors AGE, SEX, and AGE by SEX interaction as covariates ( Z k 's in equation [1]). Including BPMED as a covariate will overcompensate because BPMED is a consequence of SBP level. Instead, we estimated the effect of BPMED from a regression model with only individuals with hypertension. Because BPMED was randomly assigned to individuals with hypertension, the BPMED effect estimated this way will not be biased by its correlation with SBP. We then adjusted the trait values Y by subtracting the estimated BPMED effect.

Score tests of genotype-phenotype association using unrelated individuals
At locus τ , we consider a quantitative trait model and test the null hypothesis β τ = 0 . In equation (1), Y is the vector of trait values (SBP adjusted for the BPMED effect), μ is a constant vector of baseline mean trait values, coefficients α k represent the effects of the covariates Z k , k = 1, . . . , K, (e.g., AGE, SEX and AGE by SEX interaction) on trait values, X τ is the vector of genotype doses (the number of minor alleles possessed by each individual) at locus τ , and the coefficient β τ represents the effect size of a single allele. The fitted value of β τ will reflect the collective effect of all causal SNPs that are in linkage disequilibrium (LD) with the test SNP τ [5].
LetŶ and X τ be the vectors of fitted values after regressing the Y and X τ on measured covariates Z k 's. The score statistic [6,7] for testing genotype-trait asso- where Z = (1, Z 1 , . . . , Z K ) and s YY is the sample variance of the residual trait values ( 1 is a vector of ones) [6]. To test association, u 2 /v is compared with a χ 2 1 distribution.

Family-based association test by gene dropping
When related individuals are used to compute the score test statistic u = X τ R , components of X τ can be dependent, and the variance estimator (2) is no longer valid. One can account for correlations between components in X τ by simulating the null distribution of X τ using gene dropping. We now derive the analytical mean and variance of u under the gene-dropping setting. In the score test using unrelated individuals, we treat R as random, and X τ can be viewed as either random or fixed. In a gene-dropping simulation, R is held fixed, and X τ is random.
allele is the minor allele and 0 otherwise. So E (X i ) is twice the MAF f τ at SNP τ and is the same for all individuals and thus are all Bernoulli random variables with probability f τ , and any two of them are identical if the corresponding alleles are identity-by-descent (IBD) and are independent otherwise [8]. Let φ ij be the number of IBD pairs among the four pairs of alleles P i P j , The value of φ ij at locus τ is determined by the inheritance vector S τ , which summarizes whether the paternal or the maternal allele is passed from the parent to the child in each meiosis [9]. Given the inheritance vector S τ , (e.g., E P i P j = E P 2 i = f τ if P i and P j correspond to IBD alleles and E P i P j = E (P i ) E P j = f 2 τ if P i and P j correspond to non-IBD alleles). In a gene-dropping simulation, the inheritance vector S τ is randomly sampled among all possible inheritance vectors. The expected number of IBD alleles shared between i and E(φ ij (S τ )), , E(φ ij (S τ )), over all possible inheritance vectors is four times the kinship coefficient The kinship coefficients are determined by pedigree structures. The expected value of X i X j in a gene-dropping simulation is thus Letting (S τ ) = (φ ij ) be the matrix of IBD counts and be the matrix of kinship coefficients, we can rewrite the above as: where J is a matrix of all ones. Because R JR = 0 for residuals from a linear regression model with an intercept, the variance of u under gene dropping if conditional on the inheritance vector S τ (holding S τ fixed). We can approximate the gene-dropping null distribution of u by a normal distribution with mean 0 and variance v gd , and compute the gene-dropping p-value by comparing t = u 2 /v gd with a χ 2 1 distribution. To test association in the presence of linkage, one needs to condition on the inheritance S τ vector at τ [3] and use v τ . In practice, S τ is not observable, but we estimate v τ by drawing Markov chain Monte Carlo (MCMC) samples of S τ based on observed genotypes in the pedigrees using MORGAN (http://www.stat. washington.edu/thompson/Genepi/MORGAN/Morgan. shtml) [10].

Theoretical findings
In a gene-dropping simulation, the analytical mean of the score statistic u = X τ R is 0. The variance of the if conditional on the inheritance vector (i.e., holding the inheritance vector fixed during gene-dropping simulation) and is 4R R f τ − f 2 τ if unconditional on the inheritance vector. The normal approximation is justified by the central limit theorem because the test statistic is additive over pedigrees. Its performance depends on the number, sizes, and structure of pedigrees and on MAF at the test locus. The approximation may not be accurate for extremely small p-values. However, the rankings of the pvalues will not change.
We can decompose the unconditional gene-dropping test statistic into two components: The first component can be used as a test statistic for detecting association in the presence of linkage (i.e., fine mapping under a linkage peak) because the denominator is the variance of u conditional upon the observed IBD sharing. The second component provides information about linkage. The kinship coefficients in are determined by pedigree structure, so R R is a constant in a gene-dropping simulation. R (S τ ) R = ij r i r j φ ij (S τ ) measures the correlation between trait value similarity (r i r j ) and IBD sharing (φ ij ) at locus τ across all pairs of individuals in a pedigree. This correlation is expected to be stronger if there is stronger linkage between τ and a true causal locus. Therefore, R (S τ ) R can be used as a test statistic to detect linkage, with null distribution obtained by gene-dropping simulations. In a gene-dropping simulation, the inheritance vectors are simulated as if they were from a marker unlinked to any potential causal loci. R (S τ ) R resembles similar quantities that are frequently used in linkage analysis methods such as the wellknown Haseman-Elston regression [11] as well as many variance components or generalized estimating equationbased methods [12].

Simulation results
We performed a genome-wide association studies (GWAS) score test using 142 unrelated individuals, the family-based association test using FBAT [3], and the gene-dropping test on SNPs on chromosome 3 (FBAT and the gene-dropping test used 847 individuals from 20 pedigrees). Table 1 summarizes the p-value ranks that each test assigns the true causal SNPs. The gene-dropping test for fine mapping (conditional on the inheritance vector) performs very similarly to the unconditional genedropping test, so its results are omitted. It is seen that the gene-dropping tests can quickly identify a few true causal SNPs within a short list of top findings. However, if we allow more false positives by considering a greater number of the most significant SNPs, other methods start to pick up true causal SNPs and eventually have a result similar to gene dropping. Figure 1 shows the physical positions and negative log pvalues of the top 500 SNPs identified by each of the three tests, as well as the negative log p-values of the linkage test based on the linkage component of the gene-dropping test statistic.
We also examined adjusting for population stratification by fitting the first two principal components of genetic variation [13] as covariates in the regression model (1). The p-values resulting from this expanded model differed negligibly from the original model. The ranks in Table 1 were essentially unchanged by this adjustment.

Discussion
Comparison between genome-wide association studies, FBAT and gene-dropping test FBAT splits each pedigree into nuclear families. In each nuclear family, FBAT uses information from the offspring while conditioning on the parental marker genotypes. In contrast, GWAS uses information in unrelated individuals. The two methods use almost "orthogonal" sources of information. There is almost no correlation between the log p-values from these two methods (Table 2). In contrast, the gene-dropping test applies to multigeneration pedigrees and uses information from all Rank is raw ranks in terms of p-value significance of truly influential single-nucleotide polymorphisms (SNPs) (smaller numbers better, indicating that the method identifies a true SNP as more significant). The fractional ranks appearing in the gene-dropping column arise from ties: two SNPs being assigned exactly the same p-value. Note that it is not completely fair to compare these numbers directly because FBAT and genome-wide association studies (GWAS) produce not available (NA) results for a significant portion of the tested SNPs. Relative rank is the normalized ranks of truly influential SNPs: p-value rank divided by the total number of non-NA SNPs tested multiplied by 100. SNP position is the base-pair position of the identified truly influential SNP.  individuals: the gene-dropping test extracts information from founders by resimulating founder genotypes and from offspring by resimulating inheritance vectors. It is also possible to derive the analytical mean and variance of the test statistic in the gene-dropping test where we permute the founder alleles rather than resimulate the founder alleles. FBAT is more robust to population stratification by conditioning on founder genotypes. The gene-dropping test can gain similar robustness by restricting permutations to founder alleles within each family.
It is somewhat surprising that the gene-dropping test did not outperform GWAS given that it uses more individuals. One possible interpretation is that the effect of LD is stronger when more individuals are used. As we can see in Figure 1, the signals detected by the gene-dropping test come in bigger clusters. In other words, many SNPs ranked high by the gene-dropping test might be in LD with one or more of the causal SNPs.

Separating linkage and association signals
The gene-dropping test captures both linkage and association signals. One can decompose the test statistic into a linkage component and an association component.
The association component corresponds to testing association in the presence of linkage, which requires one to condition on the true inheritance vector at the test locus. Our results through MCMC approximation show that whether or not to condition on the inheritance vector actually does not make a big difference for this data set because the variance of the test statistic with conditioning only differs slightly from the variance of the test statistic without conditioning. This conclusion might be dependent on the structure of the pedigree.
The linkage component, however, clearly provides valuable information. The linkage signal is stronger in most regions containing causal SNPs. It is obvious that the linkage curve can help eliminate many of the false association signals in this study. It would be interesting to investigate how to use the linkage information more effectively in the future.