Multipoint association mapping for longitudinal family data: an application to hypertension phenotypes

It is essential to develop adequate statistical methods to fully utilize information from longitudinal family studies. We extend our previous multipoint linkage disequilibrium approach—simultaneously accounting for correlations between markers and repeat measurements within subjects, and the correlations between subjects in families—to detect loci relevant to disease through gene-based analysis. Estimates of disease loci and their genetic effects along with their 95 % confidence intervals (or significance levels) are reported. Four different phenotypes—ever having hypertension at 4 visits, incidence of hypertension, hypertension status at baseline only, and hypertension status at 4 visits—are studied using the proposed approach. The efficiency of estimates of disease locus positions (inverse of standard error) improves when using the phenotypes from 4 visits rather than using baseline only.


Background
Approaches for analyzing longitudinal family data have been categorized into 2 groups [1]: (a) first summarizing repeated measurements into 1 statistic (eg, a mean or slope per subject) and then using the summarized statistic as a standard outcome for genetic analysis; or (b) simultaneous modeling of genetic and longitudinal parameters. In general, joint modeling is appealing because (a) all parameter estimates are mutually adjusted, and (b) within-and between-individual variability at the levels of gene markers, repeat measurements, and family characteristics are correctly accounted for [1].
The semiparametric linkage disequilibrium mapping for the hybrid family design we developed previously [2] uses all markers simultaneously to localize the disease locus without making an assumption about genetic mechanism, except that only 1 disease gene lies in the region under study. The advantages of this approach are (a) it does not require the specification of an underlying genetic model, so estimating the position of a disease locus and its standard error is robust to a wide variety of genetic mechanisms; (b) it provides estimates of disease locus positions, along with a confidence interval for further fine mapping; and (c) it uses linkage disequilibrium between markers to localize the disease locus, which may not have been typed. We extended this approach to map susceptibility genes using longitudinal nuclear family data with an application to hypertension. Four different outcomes were used based on the proposed method: (I) ever having hypertension ("Ever"), (II) incidence event with status changed from unaffected to affected ("Progression"), (III) first available visit as baseline only ("Baseline"), and (IV) all available time points ("Longitudinal"). We compared the estimates of the disease locus positions, their standard errors, the genetic effect estimate at the disease loci, and their significance for the 4 phenotypes to examine the efficiency gained from using repeated longitudinal phenotypes.

Genome-wide genotypes and phenotype data
Association mapping was conducted on chromosome 3 of the genome-wide association study (GWAS) data. A total of 65,519 single-nucleotide polymorphisms (SNPs) included in 1095 genes were genotyped on chromosome 3 for 959 individuals from 20 original pedigrees in Genetic Analysis Workshop 19 (GAW19). Of these individuals, there were 178 (38 %) affected offspring out of 469 offspring for phenotype (I) "Ever"; 130 (31 %) out of 421 offspring for phenotype (II) "Progression"; 64 (11 %) out of 600 offspring for phenotype (III) "Baseline"; and 60 (11 %) out of 565 offspring to approximately 85 (45 %) out of 189 offspring across the 4 visits (or 87 [21.63 %] out of 402 offspring on average) for phenotype (IV) "Longitudinal" ( Table 1). To compare phenotypes (I) and (II), only individuals with at least 2 measurements were included in the "ever" phenotype. PedCut [3] was used to split large pedigrees with members more than 20 members into nuclear pedigrees. Consequently, we analyzed a total of 138 pedigrees with 1,495 individuals (the IDs for missing parents were added to form trios). In divided pedigrees, the nuclear families contained between 3 and 25 individuals. Five SNPs were removed because they failed the test of Hardy-Weinberg equilibrium (HWE) (p value < 10 −4 ). The HWE test was performed using PLINK 1.07 [4] based on 56 unrelated subjects. (For information on PLINK, see http:// pngu.mgh.harvard.edu/purcell/plink/.) A total of 22,056 genotypes from various SNPs with genotyping errors (genotyping error rate was around 3.51 × 10 −4 ) were further excluded by the MERLIN 1.1.2 computing package (see http://www.sph.umich.edu/csg/abecasis/merlin/tour/ linkage.html). None of the covariates was adjusted for in this approach.

Multipoint linkage disequilibrium mapping
Suppose M markers were genotyped in the region R at locations of 0 ≤ t 1 < t 2 < … < t M ≤ T. We assume there are 2 alleles per marker. With H (t) being the target allele at marker position t, and h (t) being the nontarget allele, we define for the affected offspring D k il , for the unaffected offspring D k il . Then, we define the prefe- ð Þ for the maternal side for a trio; similarly, the preferential ð Þ for an unaffected trio for both parental sides, respectively, where k il = 1, …, N 1il (for unaffected), N 1il (N 2il ) is the number of affected (unaffected) offspring in the family i at the l th time point, i = 1, … n, l = 1, …, L (L = 1 or 4 in this study).
The expectation of the statistic is where θ t j ; τ is the recombination fraction between marker position t j and disease locus position τ, the recombination fraction Θ is a parametric function of the parameter of primary interest (τ, the physical position of the functional variant), N is the number of generations since the initiation of the disease variant, Φ 1 denotes the event that the offspring is affected, Φ 2 represents the event that the offspring is unaffected, Þ is the vector of parameters, and π j = Pr [h(t j ) |h(τ)]. μ 1k il j is the probability for an affected offspring to receive a target allele, and −μ 2k il j is the probability for an unaffected offspring to receive a target allele. The statistic Z 1k il j ¼ X Tk il j þ Y Tk il j and Z 2k il j ¼ X Uk il j þ Y Uk il j were used to estimate the parameters. The estimating equations used to solve for parameters δ are: whereπ j is the average of nontransmitted parental alleles in the sample. The estimating equations were solved iteratively for parameters τ, N, C, and C*, where τ and C are the 2 parameters of interest. The variance of the disease locus position estimate was estimated to make inferences about the disease locus position (τ) and its genetic effect (C) [2]. Theoretically, the genetic effect of τ, characterized by C, is the transmission probability that the affected offspring will carry the disease allele, H, at τ. Detailed derivations for case-parent trios in a crosssectional design can be found in Chiu et al. [2,5]. We will present the details of this proposed methodology elsewhere.
Gene-based association mapping was conducted for all SNPs on chromosome 3. This approach accounts for correlations between markers and repeated phenotypes within subjects, and correlations between subjects per family. The consistent estimates of hypertension locus position using "Ever" and "Progression" are shown in Table 2 and Fig. 1, while the consistent estimates of hypertension locus position using baseline and longitudinal data (at all 4 visits) are listed in Table 3 and Fig. 2.

Results and discussion
A total of 119 (11 %), 79 (7 %), 49 (4 %), and 42 (4 %) of 1095 genes had a significant genetic effect (P < 4.57 × 10 −5 with Bonferroni correction) based on hypertension status at "Ever," "Progression," baseline ("Baseline"), and 4 visits ("Longitudinal"), respectively. There are only 3 significantly associated genes (P ≤ 0.05) for baseline and longitudinal phenotypes duplicated with the significantly associated genes for "Ever" and "Progression" outcomes: FETUB, IL1RAP, and C3orf21. Several hits identified here have been reported from linkage or GWAS studies. Table 2 Table 2 Significant and consistent estimates of disease locus positions and their genetic effects using "Ever" and "Progression" phenotypes Ĉ, the genetic effect estimate; G, previous GWAS hits; L, previous linkage hits;τ , the disease locus position estimate in cM *Because of space limitations, we list only the 2 phenotypes with consistent estimates for the disease locus positions (the difference between the 2τ for both phenotypes is less than 10 −2 cM) and significant estimates for the genetic effects (both with P < 4.57 × 10 −5 , Bonferroni) Fig. 1 Length of 95 % confidence intervals (CIs) for the estimate of the disease locus position for "Ever" and "Progression" phenotypes Table 3 Significant and consistent estimates of disease locus positions and their genetic effects using "Baseline" and "Longitudinal" phenotypes The gene is significant with the Bonferroni correction (P < 4.57 × 10 −5 ) and its P values are 2.31 × 10 −6 and 0.00044 for "Ever" and "Progression," respectively ‡ The same genes for the "Ever" and "Progression" phenotypes had P values <0.05 but > 4.57 × 10 −5 for the genetic effect estimate shows genes with a significant genetic effect (P < 4.57 × 10 −5 ). Table 3 presents the genes that are significant at a significance level of 0.05. Only 1 gene, GRM7, is significant at the level of P < 4.57 × 10 −5 . Figures 1 and 2 display the 95 % confidence intervals for the estimate of the hypertension locus position for the 4 phenotypes centered at the estimated disease locus position. The comparison is shown for the genes listed in Tables 2 and 3. The standard errors of the estimates for the disease locus position are smaller in 64 % of the genes based on longitudinal data (Table 3) compared to those based on baseline data. This is because those incidence cases included in "Progression" were also included in the analysis of "Ever." Only prevalent cases, a relatively small proportion, are additionally included in the analysis of "Ever." Thus, the results from "Progression" and "Ever" are similar.

Conclusions
Methods of genetic analysis rely heavily on correlations among family members' outcomes to infer genetic effects, whereas longitudinal studies allow investigators to study factors' effects on outcomes and changes over time [1]. To retrieve full information from longitudinal family data, appropriate statistical approaches are crucial. We proposed a multipoint linkage disequilibrium approach accounting for multilevel correlations between markers per subject, within-subject longitudinal observations, and subjects within families, aiming to correctly localize the disease locus and assess its genetic effects. This approach has several advantages: it allows us to estimate the disease locus position, the disease locus's genetic effect, and the 95 % confidence intervals without specifying a disease genetic mode and yet making full use of the markers and repeated measurements. In addition, this approach treats the genotype data as random conditional on the phenotype, eliminating the problem of ascertainment bias. We applied this approach to the baseline and longitudinal prevalence/incidence of hypertension events. The efficiency of parameter estimates was similar for the "Ever" and "Progression" categories, but was improved with repeated longitudinal outcomes compared to the use of "Baseline" only. This difference between analyses might largely result from the different total sample sizes and proportions of hypertensive subjects for different phenotypes. Several identified genes on chromosome 3 for hypertension were consistent with findings from previous linkage and association studies. Despite its advantages, this proposed approach also has limitations; for example, covariate adjustment is not available.