 Proceedings
 Open Access
 Published:
Analysis of the progression of systolic blood pressure using imputation of missing phenotype values
BMC Proceedings volume 8, Article number: S83 (2014)
Abstract
We present a genomewide association study of a quantitative trait, "progression of systolic blood pressure in time," in which 142 unrelated individuals of the Genetic Analysis Workshop 18 real genotype data were analyzed. Information on systolic blood pressure and other phenotypic covariates was missing at certain time points for a considerable part of the sample. We observed that the dropout process causing missingness is not independent of the initial systolic blood pressure; that is, the data is not missing completely at random. However, after the adjustment for age, the impact of systolic blood pressure on dropouts was no longer significant. Therefore, we decided to impute missing phenotype values by using information from individuals with complete phenotypic data. Progression of systolic blood pressure (∆SBP/∆t) was defined based on the imputed phenotypes and analyzed in a genomewide fashion. We also conducted an exhaustive genomewide search for interaction between singlenucleotide polymorphisms (7.14 × 10^{10} tests) under an allelic model.
The suggested data imputation and the association analysis strategy proved to be valid in the sense that there was no evidence of genomewide inflation or increased type I error in general. Furthermore, we detected 2 singlenucleotide polymorphisms (SNPs) that met the criterion for genomewide significance (p≤5 × 10^{−8}), which was also confirmed via MonteCarlo simulation. In view of the rather small sample size, however, the results have to be followedup in larger studies.
Background
We present a genomewide analysis, that is, an analysis comprising all oddnumbered chromosomes, of the Genetic Analysis Workshop 18 (GAW18) real data. The structure of the GAW18 data is rather complex (longitudinal study, family structure, missing measurements), making it difficult to address all dimensions of the data in a single analysis. Therefore, we restricted our analysis to 142 unrelated individuals available. Moreover, because systolic blood pressure (SBP) was analyzed extensively during the genomewide association analysis era by large consortia, we decided to investigate the progression of SBP instead.
We address the problem of missing phenotypes and impute missing values for the original sample of 939 individuals. Based on the imputed phenotype data, we conduct a genomewide analysis of the quantitative trait progression of SBP in time and perform an exhaustive search for SNPSNP interaction. We assess the validity of the approach in the genomewide setting and confirm our topranking findings via MonteCarlo simulation. Finally, we compare results obtained with the imputed phenotypes to the respective results obtained with the original phenotype data.
Methods
Imputation of missing data in longitudinal studies
In a longitudinal study repeated measurements and records related to a trait under investigation are collected at different time points for a fixed group of individuals. It is often not possible to collect complete data sets because of early dropouts, changed health conditions of subjects, or other reasons. The first step in the analysis of repeated measurements with missing values should be the verification, whether the data is missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR) [1]. An observation is said to be MCAR if the missingness is independent of all observed and unobserved assessments. A more relaxed assumption about the missing data mechanism is MAR, where missingness is independent of all unobserved values, although it may be dependent on the observed values. A process that is neither MCAR nor MAR is called MNAR [1–3]. Monotone missing data refers to data from a longitudinal study, where the subjects who left the study before it was finished never returned again.
Standard methods to treat missing data, for example, complete case analysis and simple imputation methods (lastobservedcarriedforward, mean, etc), are not appropriate for a valid inference without assumption that the data is MCAR, or when this assumption is violated. A proper analysis of missing data requires a knowledge or estimate of the mechanism that generated the missing data [4]. A test for MCAR versus MAR can be found in Little [5]. Unfortunately, one cannot determine whether missingness is MNAR or MAR based solely on the given data [4].
In analyzing the GAW18 data we were dealing with quantitative, nonmonotone data from the longitudinal study. For available phenotypic variables we introduced the following notations: AGE_{ i } for age, SBP_{ i } for SBP, MED_{ i } for intake of antihypertensive medication, and SMK_{ i } for smoking status at the initial measurement i = 0 and up to 3 followups (i = 1,2,3).
We analyzed the quantitative trait progression of SBP in time, ∆SBP/∆t, which was defined as the difference between 2 values of SBP recorded at the last and the first examinations attended by an individual, divided by the time ∆t in years elapsed between these 2 examinations.
We suspected that the data was not MCAR and tried to determine whether MAR or MNAR was more proper to assume. We considered a linear regression model, SBP_{0} = β_{0} + β_{1}x_{1} + β_{2}x_{2} + β_{3}x_{3}, where a variable x_{ i } equals 1 if the measure of SBP_{ i } is given, and 0 if the measure of SBP_{ i } is missing, β_{ i }∈ℝ, i = 0,1,2,3. Using this model to test the hypothesis β_{1} = β_{2} = β_{3} = 0 resulted in the pvalue 5.21 × 10^{−7}. This led us to conclude that the data is not MCAR, given that the presence or absence of the followup measures is influenced by SBP_{0}. However, when we included the covariate AGE_{0}, the pvalue became much higher (p = 0.07), which suggested that age might be a significant predictor of missingness and that the data is likely MAR.
To impute the missing values and create complete phenotype data for each subject, we applied a kind of simple imputation method [2, 3], which represents our modification of this wellknown technique. For our purposes, we imputed the missing values in AGE_{ i }, SBP_{ i }, MED_{ i }, SMK_{ i } variables, i = 0,1,2,3, without considering hypertension status or diastolic blood pressure. To start the imputation we extracted 187 completers, that is, individuals who had participated in all 4 examinations and for whom all measurements had been recorded. We used the information collected for completers to determine the settings of the association analysis, as well as to complete the missing phenotype values for noncompleters.
We suggest the following algorithm to impute the missing values in the GAW18 real phenotype data for the purposes of the analysis of ∆SBP/∆t:

1.
Determine the average calendar years of the 4 examinations (here: 1993, 1998, 2003, 2009).

2.
Impute the missing values of AGE_{ i } , i = 0,1,2,3, using the results of Step 1 and the birth year of every individual.

3.
Calculate the correlation coefficients α_{1} , α_{2} , α_{3} of ∆SBP on AGE_{0}, MED_{0}, SMK_{0}, respectively, and β of ∆SBP on SBP_{0} for completers.

4.
Given a completer C and a noncompleter N, define a function where wide hat denotes the normalized value of the corresponding variable for instance, and where I_{ (·) } equals 1 if the corresponding measurement for a noncompleter is given, or equals 0 otherwise.
$$\begin{array}{c}d\left(C,N\right):={\sum}_{i=0}^{3}[\left\frac{{\alpha}_{1}}{\beta}\right{I}_{AG{E}_{i}^{N}}{\left({\hat{AGE}}_{i}^{N}{\hat{AGE}}_{i}^{C}\right)}^{2}+{I}_{SB{P}_{i}^{N}}{\left({\hat{SBP}}_{i}^{N}{\hat{SBP}}_{i}^{C}\right)}^{2}+\\ \left\frac{{\alpha}_{2}}{\beta}\right{I}_{ME{D}_{i}^{N}}{\left({\hat{MED}}_{i}^{N}{\hat{MED}}_{i}^{C}\right)}^{2}+\left\frac{{\alpha}_{3}}{\beta}\right{I}_{SM{K}_{i}^{N}}\left({\hat{SMK}}_{i}^{N}{\hat{SMK}}_{i}^{C}{)}^{2}\right]\end{array}$$(1)$${\hat{AGE}}_{i}^{N}=\frac{AG{E}_{i}^{N}{\mu}_{i}^{N}}{{\sigma}_{i}^{N}},{\mu}_{i}^{N}={\sum}_{N}\frac{AG{E}_{i}^{N}}{N},{\sigma}_{i}^{N}=\sqrt{{{\sum}_{N}\left(AG{E}_{i}^{N}\right)}^{2}{\left({\mu}_{i}^{N}\right)}^{2}}$$(2) 
5.
For every fixed noncompleter determine the "closest" completer by finding the minimal value of the function d on the list of all completers.

6.
Impute the missing values of SBP_{ i }, MED_{ i } , SMK_{ i } , i = 0,1,2,3, for a given noncompleter by those from the "closest" completer.
Imputation was performed for 752 noncompleters. Table 1 provides an example of imputation.
Quality control (GAW18 real genotype data)
First, we filtered 142 individuals from the GAW18 data set who were indicated as being nonrelated. To check for unobserved relatedness among these individuals, we computed a pairwise identitybystate (IBS) matrix from the genotype data. For a pair of individuals, their IBSvalue was computed as the portion of alleles identical by state divided by the number of alleles genotyped in both individuals. The mean IBSvalue over all pairs of individuals was 0.74; its SD (standard deviation) was 0.0063. Seven pairs had IBSvalues higher than 0.84, which was more than 14 SD above the mean, indicating evidence of relatedness. For each of the 7 pairs, the individual with the lower genotyping rate was removed from the analysis. We also excluded 8 other individuals who had genotype missing rates exceeding 40%, leaving 127 individuals for analysis.
Finally, we removed 42,519 SNPs because of either low genotyping rate (<99%), deviations from HardyWeinberg equilibrium (p<1 × 10^{−6}), or low minor allele frequency (MAF<1%). That left 429,516 quality controlled SNPs for analysis, and the overall genotyping rate of the remaining data was 99.9%.
Statistical analysis
For the association analysis of ∆SBP/∆t reasonable covariates would be AGE_{0}*, SBP_{0}*, MED_{0}*, SMK_{0}*, ∆MED, ∆SMK, and SEX. Here, for every individual, the variables AGE_{0}*, SBP_{0}*, MED_{0}*, SMK_{0}*are defined to be equal to the first nonmissing AGE_{ i } , SBP_{ i }, MED_{ i } , SMK_{ i } values, respectively; i = 0,1,2,3; ∆MED and ∆SMK are defined analogously to ∆SBP. Note that when using imputed phenotype data, AGE_{0}*= AGE_{0} , SBP_{0}*=SBP_{0}, etc. From the plausible covariates listed above, we retained for the association analysis only those covariates that were significantly associated with ∆SBP/∆t in the sample of completers: AGE_{0}*, SBP_{0}*, MED_{0}*, and SEX.
For singlemarker analysis, we applied the standard linear regression test with 1 degree of freedom (DF) for quantitative traits and used the implementation in INTERSNP [6].
For 2marker analysis, we used a test for allelic interaction. We compared the sum of standard errors (SSE) of the model M^{A,I}= β_{0} + β_{1}x_{1} + β_{2}x_{2} + β_{1,2}x_{1}x_{2} against the model M ^{A} = β_{0} + β_{1}x_{1} + β_{2}x_{2}, where β_{0} is the intercept parameter x_{ i } is the number of copies of minor alleles of SNP i, i = 1,2; for a given individual, x_{1}x_{2} is the interaction term; and βs are the coefficients to be estimated. Let SSE^{A,I} and SSE^{A} be SSEs of the respective models, and let n be the number of individuals in the analysis. Then, the test statistic t = (SSE^{A} − SSE^{A,I})/SSE^{A} is Fdistributed with (n − 4, 1) DF and yields a test for allelic interaction. Analysis was performed by a parallelized version of INTERSNP, which uses the OpenMP API specification for parallel programming (http://openmp.org/wp). In the interaction analysis, we applied a stronger MAF criterion of 5% and analyzed 7.14 × 10^{10} SNP pairs.
When the sample size is small, low MAF can lead to a small number of "effective" analysis observations and the application of the Fdistribution might lead to inflated pvalues. To overcome this issue, we validated the topranking pvalues of the singlemarker analysis via MonteCarlo simulation. In each permutation replicate, the quantitative trait values were randomly distributed over the individuals of the sample. In total, 10^{8} permutation replicates were conducted and for a given SNP, its pointwise MonteCarlo pvalue was computed as k/10^{8}, where k is the number of permutation replicates with a test statistic t larger than that of the real data. For computation time reasons, it was not possible to validate the results of the interaction analysis via simulation, since at least 10^{13} replicates would have been needed.
Results
The analysis of ∆SBP/∆t yielded a genomewide inflation factor of λ = 0.979, indicating that, overall, our analysis can be judged as valid in the sense that we do not have evidence for inflated pvalues caused either by population stratification or by our phenotype imputation method. Table 2 lists the best results of the singlemarker analysis. Two SNPs, rs3093642 (p = 1.35 × 10^{−9}) and rs13157168 (p = 2.45 × 10^{−8}), met the criterion for genomewide significance. These SNPs have rather low MAF, and the results might be artifacts. However, MonteCarlo simulation with 10^{8} permutation replicates confirmed the findings; none of the replicates had a more extreme test statistic than the real data. Nevertheless, the top SNP rs3093642 should be treated with caution because its pvalue for deviation from HardyWeinberg equilibrium was rather close to the predefined exclusion criterion. The second SNP rs13157168 is an intronic variant in the RASA1 gene. Association of RASA1 with variable capillary and arteriovenous malformations was reported by Boon et al [7]. Note that the pvalues p_{ r } of the singlemarker analysis performed on SNPs on the real phenotypes are markedly higher than the corresponding pvalues p_{ i } for the imputed phenotypes. SNPs with an at least suggestive level of evidence for association, unfortunately, were not found among the top ranks.
Table 3 presents the top results of the interaction analysis. 2 SNPs from the table, rs4686358 and rs7987982, were previously reported in connection with potentially related phenotypes. The variant rs4686358 is located close to rs35964523 (≈130 kilobases [kb]), a SNP that was implicated in total cholesterol levels (p = 7.9 × 10^{−6}) [8]. SNP rs7987982 is an intronic variant of the COL4A1 gene. Variants in COL4A1 are significantly associated with brain small vessel disease, in which stiffness of blood vessels is affected [9].
Discussion
In longitudinal studies, repeated measurements related to a trait under investigation are performed subsequently within a particular time period. The attractiveness of longitudinal studies is the ability to capture the dynamics of traits. Investigating the progression of disease by using this additional dimension of time might reveal insights on underlying disease mechanisms. Given the nature and the running time of a longitudinal study, it is expected that some measurements might not be available, whereas others might be missed. Imputation of missing values helps to enrich data and to use maximum information, although it has own weaknesses and it is not selfevident that it leads to improved power.
In our setting, the imputation of missing values in the GAW18 phenotype data set helped to streamline the definition of the trait under investigation (∆SBP/∆t). In particular, the imputation reduced the impact of outlier SBP values, and made it possible to define the values of the trait in a more uniform time period. This, in turn, helped to reduce the potentially negative impact of short time periods on the values of the trait.
The method of choice to check the validity of imputation is to carry out a sensitivity analysis to investigate the dependence of the results on the particular imputation method. In our case, the sensitivity analysis could have been carried out by the multiple imputation technique. Unfortunately, this technique is timeconsuming and requires software implementation; consequently, it was not applied in our work. This is a potential limitation of our approach. An argument in favor of our imputation method is that the mean of SBP values within an examination is preserved by the imputation.
Conclusions
Our analysis strategy helped to identify some suggestive association findings. At this point, however, the results do not prove that the analysis using imputed phenotype data is superior to that without imputation. To prove this statement, it would be necessary to know whether some of the SNPs we have identified represent "true" associations. Those have to be investigated in studies with independent data. In summary, our work can be seen as a suggestion for longitudinal data analysis, which was quite reasonable in the sense that it did not show genomewide inflation or increased type I error in general.
References
 1.
Rubin DB: Inference and missing data. Biometrika. 1976, 63: 581592. 10.1093/biomet/63.3.581.
 2.
Fielding S, Fayers PM, McDonald A, McPherson G, Campbell MK: RECORD Study Group:Simple imputation methods were inadequate for missing not at random (MNAR) quality of life data. Health Qual Life Outcomes. 2008, 6: 5710.1186/14777525657.
 3.
Kenward MG, Lesaffre E, Molenberghs G: An application of maximum likelihood and generalized estimating equations to the analysis of ordinal data from a longitudinal study with cases missing at random. Biometrics. 1994, 50: 945953. 10.2307/2533434.
 4.
Ibrahim JG, Chu H, Chen MH: Missing data in clinical studies: issues and methods. J ClinOncol. 2012, 30: 32973303. 10.1200/JCO.2011.38.7589.
 5.
Little RJA: A test of missing completely at random for multivariate data with missing values. J Am Stat Assoc. 1988, 83: 11981202. 10.1080/01621459.1988.10478722.
 6.
Herold C, Steffens M, Brockschmidt FF, Baur MP, Becker T: INTERSNP: genomewide interaction analysis guided by a priori information. Bioinformatics. 2009, 25: 32753281. 10.1093/bioinformatics/btp596.
 7.
Boon LM, Mulliken JB, Vikkula M: RASA1: variable phenotype with capillary and arteriovenous malformations. Curr Opin Genet Dev. 2005, 15: 265269. 10.1016/j.gde.2005.03.004.
 8.
Barber MJ, Mangravite LM, Hyde CL, Chasman DI, Smith JD, McCarty CA, Li X, Wilke RA, Rieder MJ, Williams PT, et al: Genomewide association of lipidlowering response to statins in combined study populations. PLoS One. 2010, 5: e976310.1371/journal.pone.0009763.
 9.
Tarasov KV, Sanna S, Scuteri A, Strait JB, Orrù M, Parsa A, Lin PI, Maschio A, Lai S, Piras MG, et al: COL4A1 is associated with arterial stiffness by genomewide association scan. Circ Cardiovasc Genet. 2009, 2: 151158. 10.1161/CIRCGENETICS.108.823245.
Acknowledgements
T. Vaitsiakhovich is supported by the "Deutsche Forschungsgemeinschaft," project BE3828/32. The GAW18 whole genome sequence data were provided by the T2DGENES Consortium, which is supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 8 Supplement 1, 2014: Genetic Analysis Workshop 18. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S1. Publication charges for this supplement were funded by the Texas Biomedical Research Institute.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
TV designed and performed the imputation of the missing phenotypes. TV and DD carried out the analysis of unrelated individuals. MA worked on the preprocessing of the data. CH provided support in using INTERSNP software. TB conceived of the study and suggested main lines of the analysis. AL coordinated the study, participated in its design and improvement. All authors read and approved the final manuscript.
Tatsiana Vaitsiakhovich, Dmitriy Drichel contributed equally to this work.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Vaitsiakhovich, T., Drichel, D., Angisch, M. et al. Analysis of the progression of systolic blood pressure using imputation of missing phenotype values. BMC Proc 8, S83 (2014). https://doi.org/10.1186/175365618S1S83
Published:
Keywords
 Systolic Blood Pressure
 Genetic Analysis Workshop
 Genotyping Rate
 GAW18 Data
 Initial Systolic Blood Pressure