- Open Access
Analysis of the progression of systolic blood pressure using imputation of missing phenotype values
BMC Proceedings volume 8, Article number: S83 (2014)
We present a genome-wide association study of a quantitative trait, "progression of systolic blood pressure in time," in which 142 unrelated individuals of the Genetic Analysis Workshop 18 real genotype data were analyzed. Information on systolic blood pressure and other phenotypic covariates was missing at certain time points for a considerable part of the sample. We observed that the dropout process causing missingness is not independent of the initial systolic blood pressure; that is, the data is not missing completely at random. However, after the adjustment for age, the impact of systolic blood pressure on dropouts was no longer significant. Therefore, we decided to impute missing phenotype values by using information from individuals with complete phenotypic data. Progression of systolic blood pressure (∆SBP/∆t) was defined based on the imputed phenotypes and analyzed in a genome-wide fashion. We also conducted an exhaustive genome-wide search for interaction between single-nucleotide polymorphisms (7.14 × 1010 tests) under an allelic model.
The suggested data imputation and the association analysis strategy proved to be valid in the sense that there was no evidence of genome-wide inflation or increased type I error in general. Furthermore, we detected 2 single-nucleotide polymorphisms (SNPs) that met the criterion for genome-wide significance (p≤5 × 10−8), which was also confirmed via Monte-Carlo simulation. In view of the rather small sample size, however, the results have to be followed-up in larger studies.
We present a genome-wide analysis, that is, an analysis comprising all odd-numbered chromosomes, of the Genetic Analysis Workshop 18 (GAW18) real data. The structure of the GAW18 data is rather complex (longitudinal study, family structure, missing measurements), making it difficult to address all dimensions of the data in a single analysis. Therefore, we restricted our analysis to 142 unrelated individuals available. Moreover, because systolic blood pressure (SBP) was analyzed extensively during the genome-wide association analysis era by large consortia, we decided to investigate the progression of SBP instead.
We address the problem of missing phenotypes and impute missing values for the original sample of 939 individuals. Based on the imputed phenotype data, we conduct a genome-wide analysis of the quantitative trait progression of SBP in time and perform an exhaustive search for SNP-SNP interaction. We assess the validity of the approach in the genome-wide setting and confirm our top-ranking findings via Monte-Carlo simulation. Finally, we compare results obtained with the imputed phenotypes to the respective results obtained with the original phenotype data.
Imputation of missing data in longitudinal studies
In a longitudinal study repeated measurements and records related to a trait under investigation are collected at different time points for a fixed group of individuals. It is often not possible to collect complete data sets because of early dropouts, changed health conditions of subjects, or other reasons. The first step in the analysis of repeated measurements with missing values should be the verification, whether the data is missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR) . An observation is said to be MCAR if the missingness is independent of all observed and unobserved assessments. A more relaxed assumption about the missing data mechanism is MAR, where missingness is independent of all unobserved values, although it may be dependent on the observed values. A process that is neither MCAR nor MAR is called MNAR [1–3]. Monotone missing data refers to data from a longitudinal study, where the subjects who left the study before it was finished never returned again.
Standard methods to treat missing data, for example, complete case analysis and simple imputation methods (last-observed-carried-forward, mean, etc), are not appropriate for a valid inference without assumption that the data is MCAR, or when this assumption is violated. A proper analysis of missing data requires a knowledge or estimate of the mechanism that generated the missing data . A test for MCAR versus MAR can be found in Little . Unfortunately, one cannot determine whether missingness is MNAR or MAR based solely on the given data .
In analyzing the GAW18 data we were dealing with quantitative, non-monotone data from the longitudinal study. For available phenotypic variables we introduced the following notations: AGE i for age, SBP i for SBP, MED i for intake of antihypertensive medication, and SMK i for smoking status at the initial measurement i = 0 and up to 3 follow-ups (i = 1,2,3).
We analyzed the quantitative trait progression of SBP in time, ∆SBP/∆t, which was defined as the difference between 2 values of SBP recorded at the last and the first examinations attended by an individual, divided by the time ∆t in years elapsed between these 2 examinations.
We suspected that the data was not MCAR and tried to determine whether MAR or MNAR was more proper to assume. We considered a linear regression model, SBP0 = β0 + β1x1 + β2x2 + β3x3, where a variable x i equals 1 if the measure of SBP i is given, and 0 if the measure of SBP i is missing, β i ∈ℝ, i = 0,1,2,3. Using this model to test the hypothesis β1 = β2 = β3 = 0 resulted in the p-value 5.21 × 10−7. This led us to conclude that the data is not MCAR, given that the presence or absence of the follow-up measures is influenced by SBP0. However, when we included the covariate AGE0, the p-value became much higher (p = 0.07), which suggested that age might be a significant predictor of missingness and that the data is likely MAR.
To impute the missing values and create complete phenotype data for each subject, we applied a kind of simple imputation method [2, 3], which represents our modification of this well-known technique. For our purposes, we imputed the missing values in AGE i , SBP i , MED i , SMK i variables, i = 0,1,2,3, without considering hypertension status or diastolic blood pressure. To start the imputation we extracted 187 completers, that is, individuals who had participated in all 4 examinations and for whom all measurements had been recorded. We used the information collected for completers to determine the settings of the association analysis, as well as to complete the missing phenotype values for non-completers.
We suggest the following algorithm to impute the missing values in the GAW18 real phenotype data for the purposes of the analysis of ∆SBP/∆t:
Determine the average calendar years of the 4 examinations (here: 1993, 1998, 2003, 2009).
Impute the missing values of AGE i , i = 0,1,2,3, using the results of Step 1 and the birth year of every individual.
Calculate the correlation coefficients α1 , α2 , α3 of ∆SBP on AGE0, MED0, SMK0, respectively, and β of ∆SBP on SBP0 for completers.
Given a completer C and a non-completer N, define a function where wide hat denotes the normalized value of the corresponding variable for instance, and where I (·) equals 1 if the corresponding measurement for a non-completer is given, or equals 0 otherwise.(1)(2)
For every fixed non-completer determine the "closest" completer by finding the minimal value of the function d on the list of all completers.
Impute the missing values of SBP i , MED i , SMK i , i = 0,1,2,3, for a given non-completer by those from the "closest" completer.
Imputation was performed for 752 non-completers. Table 1 provides an example of imputation.
Quality control (GAW18 real genotype data)
First, we filtered 142 individuals from the GAW18 data set who were indicated as being nonrelated. To check for unobserved relatedness among these individuals, we computed a pair-wise identity-by-state (IBS) matrix from the genotype data. For a pair of individuals, their IBS-value was computed as the portion of alleles identical by state divided by the number of alleles genotyped in both individuals. The mean IBS-value over all pairs of individuals was 0.74; its SD (standard deviation) was 0.0063. Seven pairs had IBS-values higher than 0.84, which was more than 14 SD above the mean, indicating evidence of relatedness. For each of the 7 pairs, the individual with the lower genotyping rate was removed from the analysis. We also excluded 8 other individuals who had genotype missing rates exceeding 40%, leaving 127 individuals for analysis.
Finally, we removed 42,519 SNPs because of either low genotyping rate (<99%), deviations from Hardy-Weinberg equilibrium (p<1 × 10−6), or low minor allele frequency (MAF<1%). That left 429,516 quality controlled SNPs for analysis, and the overall genotyping rate of the remaining data was 99.9%.
For the association analysis of ∆SBP/∆t reasonable covariates would be AGE0*, SBP0*, MED0*, SMK0*, ∆MED, ∆SMK, and SEX. Here, for every individual, the variables AGE0*, SBP0*, MED0*, SMK0*are defined to be equal to the first non-missing AGE i , SBP i , MED i , SMK i values, respectively; i = 0,1,2,3; ∆MED and ∆SMK are defined analogously to ∆SBP. Note that when using imputed phenotype data, AGE0*= AGE0 , SBP0*=SBP0, etc. From the plausible covariates listed above, we retained for the association analysis only those covariates that were significantly associated with ∆SBP/∆t in the sample of completers: AGE0*, SBP0*, MED0*, and SEX.
For single-marker analysis, we applied the standard linear regression test with 1 degree of freedom (DF) for quantitative traits and used the implementation in INTERSNP .
For 2-marker analysis, we used a test for allelic interaction. We compared the sum of standard errors (SSE) of the model MA,I= β0 + β1x1 + β2x2 + β1,2x1x2 against the model M A = β0 + β1x1 + β2x2, where β0 is the intercept parameter x i is the number of copies of minor alleles of SNP i, i = 1,2; for a given individual, x1x2 is the interaction term; and βs are the coefficients to be estimated. Let SSEA,I and SSEA be SSEs of the respective models, and let n be the number of individuals in the analysis. Then, the test statistic t = (SSEA − SSEA,I)/SSEA is F-distributed with (n − 4, 1) DF and yields a test for allelic interaction. Analysis was performed by a parallelized version of INTERSNP, which uses the OpenMP API specification for parallel programming (http://openmp.org/wp). In the interaction analysis, we applied a stronger MAF criterion of 5% and analyzed 7.14 × 1010 SNP pairs.
When the sample size is small, low MAF can lead to a small number of "effective" analysis observations and the application of the F-distribution might lead to inflated p-values. To overcome this issue, we validated the top-ranking p-values of the single-marker analysis via Monte-Carlo simulation. In each permutation replicate, the quantitative trait values were randomly distributed over the individuals of the sample. In total, 108 permutation replicates were conducted and for a given SNP, its point-wise Monte-Carlo p-value was computed as k/108, where k is the number of permutation replicates with a test statistic t larger than that of the real data. For computation time reasons, it was not possible to validate the results of the interaction analysis via simulation, since at least 1013 replicates would have been needed.
The analysis of ∆SBP/∆t yielded a genome-wide inflation factor of λ = 0.979, indicating that, overall, our analysis can be judged as valid in the sense that we do not have evidence for inflated p-values caused either by population stratification or by our phenotype imputation method. Table 2 lists the best results of the single-marker analysis. Two SNPs, rs3093642 (p = 1.35 × 10−9) and rs13157168 (p = 2.45 × 10−8), met the criterion for genome-wide significance. These SNPs have rather low MAF, and the results might be artifacts. However, Monte-Carlo simulation with 108 permutation replicates confirmed the findings; none of the replicates had a more extreme test statistic than the real data. Nevertheless, the top SNP rs3093642 should be treated with caution because its p-value for deviation from Hardy-Weinberg equilibrium was rather close to the predefined exclusion criterion. The second SNP rs13157168 is an intronic variant in the RASA1 gene. Association of RASA1 with variable capillary and arteriovenous malformations was reported by Boon et al . Note that the p-values p r of the single-marker analysis performed on SNPs on the real phenotypes are markedly higher than the corresponding p-values p i for the imputed phenotypes. SNPs with an at least suggestive level of evidence for association, unfortunately, were not found among the top ranks.
Table 3 presents the top results of the interaction analysis. 2 SNPs from the table, rs4686358 and rs7987982, were previously reported in connection with potentially related phenotypes. The variant rs4686358 is located close to rs35964523 (≈130 kilobases [kb]), a SNP that was implicated in total cholesterol levels (p = 7.9 × 10−6) . SNP rs7987982 is an intronic variant of the COL4A1 gene. Variants in COL4A1 are significantly associated with brain small vessel disease, in which stiffness of blood vessels is affected .
In longitudinal studies, repeated measurements related to a trait under investigation are performed subsequently within a particular time period. The attractiveness of longitudinal studies is the ability to capture the dynamics of traits. Investigating the progression of disease by using this additional dimension of time might reveal insights on underlying disease mechanisms. Given the nature and the running time of a longitudinal study, it is expected that some measurements might not be available, whereas others might be missed. Imputation of missing values helps to enrich data and to use maximum information, although it has own weaknesses and it is not self-evident that it leads to improved power.
In our setting, the imputation of missing values in the GAW18 phenotype data set helped to streamline the definition of the trait under investigation (∆SBP/∆t). In particular, the imputation reduced the impact of outlier SBP values, and made it possible to define the values of the trait in a more uniform time period. This, in turn, helped to reduce the potentially negative impact of short time periods on the values of the trait.
The method of choice to check the validity of imputation is to carry out a sensitivity analysis to investigate the dependence of the results on the particular imputation method. In our case, the sensitivity analysis could have been carried out by the multiple imputation technique. Unfortunately, this technique is time-consuming and requires software implementation; consequently, it was not applied in our work. This is a potential limitation of our approach. An argument in favor of our imputation method is that the mean of SBP values within an examination is preserved by the imputation.
Our analysis strategy helped to identify some suggestive association findings. At this point, however, the results do not prove that the analysis using imputed phenotype data is superior to that without imputation. To prove this statement, it would be necessary to know whether some of the SNPs we have identified represent "true" associations. Those have to be investigated in studies with independent data. In summary, our work can be seen as a suggestion for longitudinal data analysis, which was quite reasonable in the sense that it did not show genome-wide inflation or increased type I error in general.
Rubin DB: Inference and missing data. Biometrika. 1976, 63: 581-592. 10.1093/biomet/63.3.581.
Fielding S, Fayers PM, McDonald A, McPherson G, Campbell MK: RECORD Study Group:Simple imputation methods were inadequate for missing not at random (MNAR) quality of life data. Health Qual Life Outcomes. 2008, 6: 57-10.1186/1477-7525-6-57.
Kenward MG, Lesaffre E, Molenberghs G: An application of maximum likelihood and generalized estimating equations to the analysis of ordinal data from a longitudinal study with cases missing at random. Biometrics. 1994, 50: 945-953. 10.2307/2533434.
Ibrahim JG, Chu H, Chen MH: Missing data in clinical studies: issues and methods. J ClinOncol. 2012, 30: 3297-3303. 10.1200/JCO.2011.38.7589.
Little RJA: A test of missing completely at random for multivariate data with missing values. J Am Stat Assoc. 1988, 83: 1198-1202. 10.1080/01621459.1988.10478722.
Herold C, Steffens M, Brockschmidt FF, Baur MP, Becker T: INTERSNP: genome-wide interaction analysis guided by a priori information. Bioinformatics. 2009, 25: 3275-3281. 10.1093/bioinformatics/btp596.
Boon LM, Mulliken JB, Vikkula M: RASA1: variable phenotype with capillary and arteriovenous malformations. Curr Opin Genet Dev. 2005, 15: 265-269. 10.1016/j.gde.2005.03.004.
Barber MJ, Mangravite LM, Hyde CL, Chasman DI, Smith JD, McCarty CA, Li X, Wilke RA, Rieder MJ, Williams PT, et al: Genome-wide association of lipid-lowering response to statins in combined study populations. PLoS One. 2010, 5: e9763-10.1371/journal.pone.0009763.
Tarasov KV, Sanna S, Scuteri A, Strait JB, Orrù M, Parsa A, Lin PI, Maschio A, Lai S, Piras MG, et al: COL4A1 is associated with arterial stiffness by genome-wide association scan. Circ Cardiovasc Genet. 2009, 2: 151-158. 10.1161/CIRCGENETICS.108.823245.
T. Vaitsiakhovich is supported by the "Deutsche Forschungsgemeinschaft," project BE3828/3-2. The GAW18 whole genome sequence data were provided by the T2D-GENES Consortium, which is supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 8 Supplement 1, 2014: Genetic Analysis Workshop 18. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S1. Publication charges for this supplement were funded by the Texas Biomedical Research Institute.
The authors declare that they have no competing interests.
TV designed and performed the imputation of the missing phenotypes. TV and DD carried out the analysis of unrelated individuals. MA worked on the pre-processing of the data. CH provided support in using INTERSNP software. TB conceived of the study and suggested main lines of the analysis. AL coordinated the study, participated in its design and improvement. All authors read and approved the final manuscript.
Tatsiana Vaitsiakhovich, Dmitriy Drichel contributed equally to this work.
About this article
Cite this article
Vaitsiakhovich, T., Drichel, D., Angisch, M. et al. Analysis of the progression of systolic blood pressure using imputation of missing phenotype values. BMC Proc 8, S83 (2014). https://doi.org/10.1186/1753-6561-8-S1-S83
- Systolic Blood Pressure
- Genetic Analysis Workshop
- Genotyping Rate
- GAW18 Data
- Initial Systolic Blood Pressure