Analysis of the progression of systolic blood pressure using imputation of missing phenotype values

We present a genome-wide association study of a quantitative trait, "progression of systolic blood pressure in time," in which 142 unrelated individuals of the Genetic Analysis Workshop 18 real genotype data were analyzed. Information on systolic blood pressure and other phenotypic covariates was missing at certain time points for a considerable part of the sample. We observed that the dropout process causing missingness is not independent of the initial systolic blood pressure; that is, the data is not missing completely at random. However, after the adjustment for age, the impact of systolic blood pressure on dropouts was no longer significant. Therefore, we decided to impute missing phenotype values by using information from individuals with complete phenotypic data. Progression of systolic blood pressure (∆SBP/∆t) was defined based on the imputed phenotypes and analyzed in a genome-wide fashion. We also conducted an exhaustive genome-wide search for interaction between single-nucleotide polymorphisms (7.14 × 1010 tests) under an allelic model. The suggested data imputation and the association analysis strategy proved to be valid in the sense that there was no evidence of genome-wide inflation or increased type I error in general. Furthermore, we detected 2 single-nucleotide polymorphisms (SNPs) that met the criterion for genome-wide significance (p≤5 × 10−8), which was also confirmed via Monte-Carlo simulation. In view of the rather small sample size, however, the results have to be followed-up in larger studies.


Background
We present a genome-wide analysis, that is, an analysis comprising all odd-numbered chromosomes, of the Genetic Analysis Workshop 18 (GAW18) real data. The structure of the GAW18 data is rather complex (longitudinal study, family structure, missing measurements), making it difficult to address all dimensions of the data in a single analysis. Therefore, we restricted our analysis to 142 unrelated individuals available. Moreover, because systolic blood pressure (SBP) was analyzed extensively during the genome-wide association analysis era by large consortia, we decided to investigate the progression of SBP instead.
We address the problem of missing phenotypes and impute missing values for the original sample of 939 individuals. Based on the imputed phenotype data, we conduct a genome-wide analysis of the quantitative trait progression of SBP in time and perform an exhaustive search for SNP-SNP interaction. We assess the validity of the approach in the genome-wide setting and confirm our top-ranking findings via Monte Carlo simulation. Finally, we compare results obtained with the imputed phenotypes to the respective results obtained with the original phenotype data.

Imputation of missing data in longitudinal studies
In a longitudinal study repeated measurements and records related to a trait under investigation are collected at different time points for a fixed group of individuals. It is often not possible to collect complete data sets because of early dropouts, changed health conditions of subjects, or other reasons. The first step in the analysis of repeated measurements with missing values should be the verification, whether the data is missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR) [1]. An observation is said to be MCAR if the missingness is independent of all observed and unobserved assessments. A more relaxed assumption about the missing data mechanism is MAR, where missingness is independent of all unobserved values, although it may be dependent on the observed values. A process that is neither MCAR nor MAR is called MNAR [1][2][3]. Monotone missing data refers to data from a longitudinal study, where the subjects who left the study before it was finished never returned again.
Standard methods to treat missing data, for example, complete case analysis and simple imputation methods (last-observed-carried-forward, mean, etc), are not appropriate for a valid inference without assumption that the data is MCAR, or when this assumption is violated. A proper analysis of missing data requires a knowledge or estimate of the mechanism that generated the missing data [4]. A test for MCAR versus MAR can be found in Little [5]. Unfortunately, one cannot determine whether missingness is MNAR or MAR based solely on the given data [4].
In analyzing the GAW18 data we were dealing with quantitative, non-monotone data from the longitudinal study. For available phenotypic variables we introduced the following notations: AGE i for age, SBP i for SBP, MED i for intake of antihypertensive medication, and SMK i for smoking status at the initial measurement i = 0 and up to 3 follow-ups (i = 1,2,3).
We analyzed the quantitative trait progression of SBP in time, ΔSBP/Δt, which was defined as the difference between 2 values of SBP recorded at the last and the first examinations attended by an individual, divided by the time Δt in years elapsed between these 2 examinations.
We suspected that the data was not MCAR and tried to determine whether MAR or MNAR was more proper to assume. We considered a linear regression model, SBP 0 = β 0 + β 1 x 1 + β 2 x 2 + β 3 x 3 , where a variable x i equals 1 if the measure of SBP i is given, and 0 if the measure of SBP i is missing, β i ∈ℝ, i = 0,1,2,3. Using this model to test the hypothesis β 1 = β 2 = β 3 = 0 resulted in the p-value 5.21 × 10 −7 . This led us to conclude that the data is not MCAR, given that the presence or absence of the follow-up measures is influenced by SBP 0 . However, when we included the covariate AGE 0 , the p-value became much higher (p = 0.07), which suggested that age might be a significant predictor of missingness and that the data is likely MAR.
To impute the missing values and create complete phenotype data for each subject, we applied a kind of simple imputation method [2,3], which represents our modification of this well-known technique. For our purposes, we imputed the missing values in AGE i , SBP i , MED i , SMK i variables, i = 0,1,2,3, without considering hypertension status or diastolic blood pressure. To start the imputation we extracted 187 completers, that is, individuals who had participated in all 4 examinations and for whom all measurements had been recorded. We used the information collected for completers to determine the settings of the association analysis, as well as to complete the missing phenotype values for noncompleters.
We suggest the following algorithm to impute the missing values in the GAW18 real phenotype data for the purposes of the analysis of ΔSBP/Δt: 1. Determine the average calendar years of the 4 examinations (here : 1993, 1998, 2003, 2009).
2. Impute the missing values of AGE i , i = 0,1,2,3, using the results of Step 1 and the birth year of every individual.
3. Calculate the correlation coefficients α 1 , α 2 , α 3 of ΔSBP on AGE 0 , MED 0 , SMK 0 , respectively, and β of ΔSBP on SBP 0 for completers. 4. Given a completer C and a non-completer N, define a function where wide hat denotes the normalized value of the corresponding variable for instance, and where I (·) equals 1 if the corresponding measurement for a noncompleter is given, or equals 0 otherwise.
5. For every fixed non-completer determine the "closest" completer by finding the minimal value of the function d on the list of all completers.
6. Impute the missing values of SBP i , MED i , SMK i , i = 0,1,2,3, for a given non-completer by those from the "closest" completer.
Imputation was performed for 752 non-completers. Table 1 provides an example of imputation.

Quality control (GAW18 real genotype data)
First, we filtered 142 individuals from the GAW18 data set who were indicated as being nonrelated. To check for unobserved relatedness among these individuals, we computed a pair-wise identity-by-state (IBS) matrix from the genotype data. For a pair of individuals, their IBSvalue was computed as the portion of alleles identical by state divided by the number of alleles genotyped in both individuals. The mean IBS-value over all pairs of individuals was 0.74; its SD (standard deviation) was 0.0063. Seven pairs had IBS-values higher than 0.84, which was more than 14 SD above the mean, indicating evidence of relatedness. For each of the 7 pairs, the individual with the lower genotyping rate was removed from the analysis. We also excluded 8 other individuals who had genotype missing rates exceeding 40%, leaving 127 individuals for analysis.
Finally, we removed 42,519 SNPs because of either low genotyping rate (<99%), deviations from Hardy-Weinberg equilibrium (p<1 × 10 −6 ), or low minor allele frequency (MAF<1%). That left 429,516 quality controlled SNPs for analysis, and the overall genotyping rate of the remaining data was 99.9%.
For single-marker analysis, we applied the standard linear regression test with 1 degree of freedom (DF) for quantitative traits and used the implementation in INTERSNP [6].
For 2-marker analysis, we used a test for allelic interaction. We compared the sum of standard errors (SSE) of the model M A,I = β 0 + β 1 x 1 + β 2 x 2 + β 1,2 x 1 x 2 against the model M A = β 0 + β 1 x 1 + β 2 x 2 , where β 0 is the intercept parameter x i is the number of copies of minor alleles of SNP i, i = 1,2; for a given individual, x 1 x 2 is the interaction term; and βs are the coefficients to be estimated. Let SSE A,I and SSE A be SSEs of the respective models, and let n be the number of individuals in the analysis. Then, the test statistic t = (SSE A − SSE A,I )/SSE A is F-distributed with (n − 4, 1) DF and yields a test for allelic interaction. Analysis was performed by a parallelized version of INTERSNP, which uses the OpenMP API specification for parallel programming (http:// openmp.org/wp). In the interaction analysis, we applied a stronger MAF criterion of 5% and analyzed 7.14 × 10 10 SNP pairs.
When the sample size is small, low MAF can lead to a small number of "effective" analysis observations and the application of the F-distribution might lead to inflated p-values. To overcome this issue, we validated the topranking p-values of the single-marker analysis via Monte-Carlo simulation. In each permutation replicate, the quantitative trait values were randomly distributed over the individuals of the sample. In total, 10 8 permutation replicates were conducted and for a given SNP, its pointwise Monte-Carlo p -value was computed as k/10 8 , where k is the number of permutation replicates with a test statistic t larger than that of the real data. For computation time reasons, it was not possible to validate the results of the interaction analysis via simulation, since at least 10 13 replicates would have been needed.

Results
The analysis of ΔSBP/Δt yielded a genome-wide inflation factor of λ = 0.979, indicating that, overall, our analysis can be judged as valid in the sense that we do not have evidence for inflated p-values caused either by population stratification or by our phenotype imputation method. Table 2 lists the best results of the single-marker analysis. Two SNPs, rs3093642 (p = 1.35 × 10 −9 ) and rs13157168 (p = 2.45 × 10 −8 ), met the criterion for genome-wide significance. These SNPs have rather low MAF, and the results might be artifacts. However, Monte-Carlo simulation with 10 8 permutation replicates confirmed the findings; none of the replicates had a  more extreme test statistic than the real data. Nevertheless, the top SNP rs3093642 should be treated with caution because its p-value for deviation from Hardy-Weinberg equilibrium was rather close to the predefined exclusion criterion. The second SNP rs13157168 is an intronic variant in the RASA1 gene. Association of RASA1 with variable capillary and arteriovenous malformations was reported by Boon et al [7]. Note that the p-values p r of the single-marker analysis performed on SNPs on the real phenotypes are markedly higher than the corresponding p-values p i for the imputed phenotypes. SNPs with an at least suggestive level of evidence for association, unfortunately, were not found among the top ranks. Table 3 presents the top results of the interaction analysis. 2 SNPs from the table, rs4686358 and rs7987982, were previously reported in connection with potentially related phenotypes. The variant rs4686358 is located close to rs35964523 (≈130 kilobases [kb]), a SNP that was implicated in total cholesterol levels (p = 7.9 × 10 −6 ) [8]. SNP rs7987982 is an intronic variant of the COL4A1 gene. Variants in COL4A1 are significantly associated   with brain small vessel disease, in which stiffness of blood vessels is affected [9].

Discussion
In longitudinal studies, repeated measurements related to a trait under investigation are performed subsequently within a particular time period. The attractiveness of longitudinal studies is the ability to capture the dynamics of traits. Investigating the progression of disease by using this additional dimension of time might reveal insights on underlying disease mechanisms. Given the nature and the running time of a longitudinal study, it is expected that some measurements might not be available, whereas others might be missed. Imputation of missing values helps to enrich data and to use maximum information, although it has own weaknesses and it is not self-evident that it leads to improved power.
In our setting, the imputation of missing values in the GAW18 phenotype data set helped to streamline the definition of the trait under investigation (ΔSBP/Δt). In particular, the imputation reduced the impact of outlier SBP values, and made it possible to define the values of the trait in a more uniform time period. This, in turn, helped to reduce the potentially negative impact of short time periods on the values of the trait.
The method of choice to check the validity of imputation is to carry out a sensitivity analysis to investigate the dependence of the results on the particular imputation method. In our case, the sensitivity analysis could have been carried out by the multiple imputation technique. Unfortunately, this technique is time-consuming and requires software implementation; consequently, it was not applied in our work. This is a potential limitation of our approach. An argument in favor of our imputation method is that the mean of SBP values within an examination is preserved by the imputation.

Conclusions
Our analysis strategy helped to identify some suggestive association findings. At this point, however, the results do not prove that the analysis using imputed phenotype data is superior to that without imputation. To prove this statement, it would be necessary to know whether some of the SNPs we have identified represent "true" associations. Those have to be investigated in studies with independent data. In summary, our work can be seen as a suggestion for longitudinal data analysis, which was quite reasonable in the sense that it did not show genome-wide inflation or increased type I error in general.