- Open Access
Analysis of the progression of systolic blood pressure using imputation of missing phenotype values
- Tatsiana Vaitsiakhovich†1Email author,
- Dmitriy Drichel†2,
- Marina Angisch1,
- Tim Becker2, 1,
- Christine Herold2 and
- André Lacour2Email author
© Vaitsiakhovich et al.; licensee BioMed Central Ltd. 2014
- Published: 17 June 2014
We present a genome-wide association study of a quantitative trait, "progression of systolic blood pressure in time," in which 142 unrelated individuals of the Genetic Analysis Workshop 18 real genotype data were analyzed. Information on systolic blood pressure and other phenotypic covariates was missing at certain time points for a considerable part of the sample. We observed that the dropout process causing missingness is not independent of the initial systolic blood pressure; that is, the data is not missing completely at random. However, after the adjustment for age, the impact of systolic blood pressure on dropouts was no longer significant. Therefore, we decided to impute missing phenotype values by using information from individuals with complete phenotypic data. Progression of systolic blood pressure (∆SBP/∆t) was defined based on the imputed phenotypes and analyzed in a genome-wide fashion. We also conducted an exhaustive genome-wide search for interaction between single-nucleotide polymorphisms (7.14 × 1010 tests) under an allelic model.
The suggested data imputation and the association analysis strategy proved to be valid in the sense that there was no evidence of genome-wide inflation or increased type I error in general. Furthermore, we detected 2 single-nucleotide polymorphisms (SNPs) that met the criterion for genome-wide significance (p≤5 × 10−8), which was also confirmed via Monte-Carlo simulation. In view of the rather small sample size, however, the results have to be followed-up in larger studies.
- Systolic Blood Pressure
- Genetic Analysis Workshop
- Genotyping Rate
- GAW18 Data
- Initial Systolic Blood Pressure
We present a genome-wide analysis, that is, an analysis comprising all odd-numbered chromosomes, of the Genetic Analysis Workshop 18 (GAW18) real data. The structure of the GAW18 data is rather complex (longitudinal study, family structure, missing measurements), making it difficult to address all dimensions of the data in a single analysis. Therefore, we restricted our analysis to 142 unrelated individuals available. Moreover, because systolic blood pressure (SBP) was analyzed extensively during the genome-wide association analysis era by large consortia, we decided to investigate the progression of SBP instead.
We address the problem of missing phenotypes and impute missing values for the original sample of 939 individuals. Based on the imputed phenotype data, we conduct a genome-wide analysis of the quantitative trait progression of SBP in time and perform an exhaustive search for SNP-SNP interaction. We assess the validity of the approach in the genome-wide setting and confirm our top-ranking findings via Monte-Carlo simulation. Finally, we compare results obtained with the imputed phenotypes to the respective results obtained with the original phenotype data.
Imputation of missing data in longitudinal studies
In a longitudinal study repeated measurements and records related to a trait under investigation are collected at different time points for a fixed group of individuals. It is often not possible to collect complete data sets because of early dropouts, changed health conditions of subjects, or other reasons. The first step in the analysis of repeated measurements with missing values should be the verification, whether the data is missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR) . An observation is said to be MCAR if the missingness is independent of all observed and unobserved assessments. A more relaxed assumption about the missing data mechanism is MAR, where missingness is independent of all unobserved values, although it may be dependent on the observed values. A process that is neither MCAR nor MAR is called MNAR [1–3]. Monotone missing data refers to data from a longitudinal study, where the subjects who left the study before it was finished never returned again.
Standard methods to treat missing data, for example, complete case analysis and simple imputation methods (last-observed-carried-forward, mean, etc), are not appropriate for a valid inference without assumption that the data is MCAR, or when this assumption is violated. A proper analysis of missing data requires a knowledge or estimate of the mechanism that generated the missing data . A test for MCAR versus MAR can be found in Little . Unfortunately, one cannot determine whether missingness is MNAR or MAR based solely on the given data .
In analyzing the GAW18 data we were dealing with quantitative, non-monotone data from the longitudinal study. For available phenotypic variables we introduced the following notations: AGE i for age, SBP i for SBP, MED i for intake of antihypertensive medication, and SMK i for smoking status at the initial measurement i = 0 and up to 3 follow-ups (i = 1,2,3).
We analyzed the quantitative trait progression of SBP in time, ∆SBP/∆t, which was defined as the difference between 2 values of SBP recorded at the last and the first examinations attended by an individual, divided by the time ∆t in years elapsed between these 2 examinations.
We suspected that the data was not MCAR and tried to determine whether MAR or MNAR was more proper to assume. We considered a linear regression model, SBP0 = β0 + β1x1 + β2x2 + β3x3, where a variable x i equals 1 if the measure of SBP i is given, and 0 if the measure of SBP i is missing, β i ∈ℝ, i = 0,1,2,3. Using this model to test the hypothesis β1 = β2 = β3 = 0 resulted in the p-value 5.21 × 10−7. This led us to conclude that the data is not MCAR, given that the presence or absence of the follow-up measures is influenced by SBP0. However, when we included the covariate AGE0, the p-value became much higher (p = 0.07), which suggested that age might be a significant predictor of missingness and that the data is likely MAR.
To impute the missing values and create complete phenotype data for each subject, we applied a kind of simple imputation method [2, 3], which represents our modification of this well-known technique. For our purposes, we imputed the missing values in AGE i , SBP i , MED i , SMK i variables, i = 0,1,2,3, without considering hypertension status or diastolic blood pressure. To start the imputation we extracted 187 completers, that is, individuals who had participated in all 4 examinations and for whom all measurements had been recorded. We used the information collected for completers to determine the settings of the association analysis, as well as to complete the missing phenotype values for non-completers.
Determine the average calendar years of the 4 examinations (here: 1993, 1998, 2003, 2009).
Impute the missing values of AGE i , i = 0,1,2,3, using the results of Step 1 and the birth year of every individual.
Calculate the correlation coefficients α1 , α2 , α3 of ∆SBP on AGE0, MED0, SMK0, respectively, and β of ∆SBP on SBP0 for completers.
- 4.Given a completer C and a non-completer N, define a function where wide hat denotes the normalized value of the corresponding variable for instance, and where I (·) equals 1 if the corresponding measurement for a non-completer is given, or equals 0 otherwise.(1)(2)
For every fixed non-completer determine the "closest" completer by finding the minimal value of the function d on the list of all completers.
Impute the missing values of SBP i , MED i , SMK i , i = 0,1,2,3, for a given non-completer by those from the "closest" completer.
Imputation of the missing phenotype data
Quality control (GAW18 real genotype data)
First, we filtered 142 individuals from the GAW18 data set who were indicated as being nonrelated. To check for unobserved relatedness among these individuals, we computed a pair-wise identity-by-state (IBS) matrix from the genotype data. For a pair of individuals, their IBS-value was computed as the portion of alleles identical by state divided by the number of alleles genotyped in both individuals. The mean IBS-value over all pairs of individuals was 0.74; its SD (standard deviation) was 0.0063. Seven pairs had IBS-values higher than 0.84, which was more than 14 SD above the mean, indicating evidence of relatedness. For each of the 7 pairs, the individual with the lower genotyping rate was removed from the analysis. We also excluded 8 other individuals who had genotype missing rates exceeding 40%, leaving 127 individuals for analysis.
Finally, we removed 42,519 SNPs because of either low genotyping rate (<99%), deviations from Hardy-Weinberg equilibrium (p<1 × 10−6), or low minor allele frequency (MAF<1%). That left 429,516 quality controlled SNPs for analysis, and the overall genotyping rate of the remaining data was 99.9%.
For the association analysis of ∆SBP/∆t reasonable covariates would be AGE0*, SBP0*, MED0*, SMK0*, ∆MED, ∆SMK, and SEX. Here, for every individual, the variables AGE0*, SBP0*, MED0*, SMK0*are defined to be equal to the first non-missing AGE i , SBP i , MED i , SMK i values, respectively; i = 0,1,2,3; ∆MED and ∆SMK are defined analogously to ∆SBP. Note that when using imputed phenotype data, AGE0*= AGE0 , SBP0*=SBP0, etc. From the plausible covariates listed above, we retained for the association analysis only those covariates that were significantly associated with ∆SBP/∆t in the sample of completers: AGE0*, SBP0*, MED0*, and SEX.
For single-marker analysis, we applied the standard linear regression test with 1 degree of freedom (DF) for quantitative traits and used the implementation in INTERSNP .
For 2-marker analysis, we used a test for allelic interaction. We compared the sum of standard errors (SSE) of the model M A,I = β0 + β1x1 + β2x2 + β1,2x1x2 against the model M A = β0 + β1x1 + β2x2, where β0 is the intercept parameter x i is the number of copies of minor alleles of SNP i, i = 1,2; for a given individual, x1x2 is the interaction term; and βs are the coefficients to be estimated. Let SSE A,I and SSE A be SSEs of the respective models, and let n be the number of individuals in the analysis. Then, the test statistic t = (SSE A − SSE A,I )/SSE A is F-distributed with (n − 4, 1) DF and yields a test for allelic interaction. Analysis was performed by a parallelized version of INTERSNP, which uses the OpenMP API specification for parallel programming (http://openmp.org/wp). In the interaction analysis, we applied a stronger MAF criterion of 5% and analyzed 7.14 × 1010 SNP pairs.
When the sample size is small, low MAF can lead to a small number of "effective" analysis observations and the application of the F-distribution might lead to inflated p-values. To overcome this issue, we validated the top-ranking p-values of the single-marker analysis via Monte-Carlo simulation. In each permutation replicate, the quantitative trait values were randomly distributed over the individuals of the sample. In total, 108 permutation replicates were conducted and for a given SNP, its point-wise Monte-Carlo p-value was computed as k/108, where k is the number of permutation replicates with a test statistic t larger than that of the real data. For computation time reasons, it was not possible to validate the results of the interaction analysis via simulation, since at least 1013 replicates would have been needed.
Single-marker analysis of ∆SBP/∆t on unrelated individuals
β ± σ
p i MC
p r MC
2.68 ± 0.41
3.83 ± 0.64
1.54 ± 0.29
1.91 ± 0.37
2.17 ± 0.43
1.82 ± 0.37
1.69 ± 0.35
0.90 ± 0.19
1.52 ± 0.32
1.43 ± 0.31
2.07 ± 0.44
2.25 ± 0.48
2.67 ± 0.58
Two-marker interaction analysis of ∆SBP/∆t on unrelated individuals
β ± σ
p i INT
p r INT
3.27 ± 0.39
1.73 ± 0.21
2.97 ± 0.36
1.72 ± 0.21
4.06 ± 0.51
2.85 ± 0.37
2.51 ± 0.32
2.04 ± 0.27
3.41 ± 0.45
3.18 ± 0.42
2.91 ± 0.38
3.70 ± 0.49
3.92 ± 0.52
In longitudinal studies, repeated measurements related to a trait under investigation are performed subsequently within a particular time period. The attractiveness of longitudinal studies is the ability to capture the dynamics of traits. Investigating the progression of disease by using this additional dimension of time might reveal insights on underlying disease mechanisms. Given the nature and the running time of a longitudinal study, it is expected that some measurements might not be available, whereas others might be missed. Imputation of missing values helps to enrich data and to use maximum information, although it has own weaknesses and it is not self-evident that it leads to improved power.
In our setting, the imputation of missing values in the GAW18 phenotype data set helped to streamline the definition of the trait under investigation (∆SBP/∆t). In particular, the imputation reduced the impact of outlier SBP values, and made it possible to define the values of the trait in a more uniform time period. This, in turn, helped to reduce the potentially negative impact of short time periods on the values of the trait.
The method of choice to check the validity of imputation is to carry out a sensitivity analysis to investigate the dependence of the results on the particular imputation method. In our case, the sensitivity analysis could have been carried out by the multiple imputation technique. Unfortunately, this technique is time-consuming and requires software implementation; consequently, it was not applied in our work. This is a potential limitation of our approach. An argument in favor of our imputation method is that the mean of SBP values within an examination is preserved by the imputation.
Our analysis strategy helped to identify some suggestive association findings. At this point, however, the results do not prove that the analysis using imputed phenotype data is superior to that without imputation. To prove this statement, it would be necessary to know whether some of the SNPs we have identified represent "true" associations. Those have to be investigated in studies with independent data. In summary, our work can be seen as a suggestion for longitudinal data analysis, which was quite reasonable in the sense that it did not show genome-wide inflation or increased type I error in general.
T. Vaitsiakhovich is supported by the "Deutsche Forschungsgemeinschaft," project BE3828/3-2. The GAW18 whole genome sequence data were provided by the T2D-GENES Consortium, which is supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 8 Supplement 1, 2014: Genetic Analysis Workshop 18. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S1. Publication charges for this supplement were funded by the Texas Biomedical Research Institute.
- Rubin DB: Inference and missing data. Biometrika. 1976, 63: 581-592. 10.1093/biomet/63.3.581.View ArticleGoogle Scholar
- Fielding S, Fayers PM, McDonald A, McPherson G, Campbell MK: RECORD Study Group:Simple imputation methods were inadequate for missing not at random (MNAR) quality of life data. Health Qual Life Outcomes. 2008, 6: 57-10.1186/1477-7525-6-57.PubMed CentralView ArticlePubMedGoogle Scholar
- Kenward MG, Lesaffre E, Molenberghs G: An application of maximum likelihood and generalized estimating equations to the analysis of ordinal data from a longitudinal study with cases missing at random. Biometrics. 1994, 50: 945-953. 10.2307/2533434.View ArticlePubMedGoogle Scholar
- Ibrahim JG, Chu H, Chen MH: Missing data in clinical studies: issues and methods. J ClinOncol. 2012, 30: 3297-3303. 10.1200/JCO.2011.38.7589.View ArticleGoogle Scholar
- Little RJA: A test of missing completely at random for multivariate data with missing values. J Am Stat Assoc. 1988, 83: 1198-1202. 10.1080/01621459.1988.10478722.View ArticleGoogle Scholar
- Herold C, Steffens M, Brockschmidt FF, Baur MP, Becker T: INTERSNP: genome-wide interaction analysis guided by a priori information. Bioinformatics. 2009, 25: 3275-3281. 10.1093/bioinformatics/btp596.View ArticlePubMedGoogle Scholar
- Boon LM, Mulliken JB, Vikkula M: RASA1: variable phenotype with capillary and arteriovenous malformations. Curr Opin Genet Dev. 2005, 15: 265-269. 10.1016/j.gde.2005.03.004.View ArticlePubMedGoogle Scholar
- Barber MJ, Mangravite LM, Hyde CL, Chasman DI, Smith JD, McCarty CA, Li X, Wilke RA, Rieder MJ, Williams PT, et al: Genome-wide association of lipid-lowering response to statins in combined study populations. PLoS One. 2010, 5: e9763-10.1371/journal.pone.0009763.PubMed CentralView ArticlePubMedGoogle Scholar
- Tarasov KV, Sanna S, Scuteri A, Strait JB, Orrù M, Parsa A, Lin PI, Maschio A, Lai S, Piras MG, et al: COL4A1 is associated with arterial stiffness by genome-wide association scan. Circ Cardiovasc Genet. 2009, 2: 151-158. 10.1161/CIRCGENETICS.108.823245.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.