The Genetic Analysis Workshop 16 Problem 3: simulation of heritable longitudinal cardiovascular phenotypes based on actual genome-wide single-nucleotide polymorphisms in the Framingham Heart Study.

The Genetic Analysis Workshop (GAW) 16 Problem 3 comprises simulated phenotypes emulating the lipid domain and its contribution to cardiovascular disease risk. For each replication there were 6,476 subjects in families from the Framingham Heart Study (FHS), with their actual genotypes for Affymetrix 550 k single-nucleotide polymorphisms (SNPs) and simulated phenotypes. Phenotypes are simulated at three visits, 10 years apart. There are up to 6 "major" genes influencing variation in high- and low-density lipoprotein cholesterol (HDL, LDL), and triglycerides (TG), and 1,000 "polygenes" simulated for each trait. Some polygenes have pleiotropic effects. The locus-specific heritabilities of the major genes range from 0.1 to 1.0%, under additive, dominant, or overdominant modes of inheritance. The locus-specific effects of the polygenes ranged from 0.002 to 0.15%, with effect sizes selected from negative exponential distributions. All polygenes act independently and have additive effects. Individuals in the LDL upper tail were designated medicated. Subjects medicated increased across visits at 2%, 5%, and 15%. Coronary artery calcification (CAC) was simulated using age, lipid levels, and CAC-specific polymorphisms. The risk of myocardial infarction before each visit was determined by CAC and its interactions with smoking and two genetic loci. Smoking was simulated to be commensurate with rates reported by the Centers for Disease Control. Two hundred replications were simulated.


Background
The Framingham Heart Study (FHS) is a rich platform for the study of cardiovascular disease and the application of novel, imaginative analytic strategies. For Genetic Analysis Workshop (GAW) 16, we use a semi-simulated approach using actual genotypes from the 500 k Affymetrix platform and the 50 k candidate gene chip and building phenotypes on the observed genetic variation. Because blood lipid levels are a major risk factor in the development of cardiovascular disease [1], we modeled disease risk on the lipid pathway, including both genetic and environmental determinants. The FHS has reported that long-term averages of low-density lipoprotein (LDL), high-density lipoprotein (HDL), and triglyceride (TG) levels were highly heritable (0.66, 0.69, and 0.58, respectively) [2]. Several familial studies also have reported heritabilities for LDL of 0.50, HDL of 0.54, and TG of 0.39 [3]. Dyslipidemia, as a fundamental component of the atherosclerotic process, is a medically correctable risk factor with established efficacious treatments for reducing risk of coronary heart disease [4]. Thus, we included in our simulation the use and effects of dyslipidemic medications, which have an important role in shaping lipid profiles. This simulation builds in the long tradition of previous simulations for Genetic Analysis Workshops [5,6].

Methods
The FHS pedigrees, distributed as GAW16 Problem 2, formed the basis of our simulation [7]. In total, there were 6,476 subjects who had genotypes and simulated phenotypes. After the simulations began, additional FHS subjects provided broad consent for data sharing; these additional subjects were not included in the simulations. To ensure comparable data to that which was simulated, we provided a file that defined precisely which subjects were included and their relationships within families. The~550 k measured single-nucleotide polymorphism (SNP) genotypes, distributed for GAW16 Problem 2 from both the genome-wide scan and the additional candidate gene platform (GeneChip® Human Mapping 500 k Array Set (Nsp and Sty), and the 50 k Human Gene Focused Panel) comprised the genotypes for GAW16 Problem 3. Novel fictitious phenotypes were simulated for subjects.
Although family members of the FHS attended various exams at different times, depending on the generation, we modeled our study as if all subjects were recruited at one time, calculated the family member's relative ages at one particular exam, and then assigned a simulated age for everyone at three time points, with 10-year intervals. The mean age in years (range) for the simulation, by generation and visit, is shown in Table 1.
The simulation model is depicted in Figure 1. There are up to six "major" genes for the lipid phenotypes HDL, LDL, and TG, and 1,000 polygenes for each trait. Several polygenes have pleiotropic effects (i.e., several of these polygenes affect two or three or trait combinations simultaneously). The identity and effects of the major genes are documented in Table 2. The locus-specific heritabilities of the major genes range from 0.1-1.0% under additive (AA:AB:BB, 0:0.5:1), dominant (AA:AB:BB, 1:1:0), or overdominant (AA: AB:BB, 0:1:0; heterozygotes show higher effect than the two homozygotes) modes of inheritance, with minor allele frequencies at least 5%, with one exception (b4), for which the minor allele frequency was 1%. We simulated an overdominant effect (g1) because there appears to be evidence supporting this possibility and this mode of inheritance is rarely, if ever, modeled. The gene a4 is pleiotropic for HDL and TG and interacts with b5 in determining LDL ( Figure 1). The interaction accounts for 0.7% of the trait variance, and b5 has no marginal effect on any phenotype. The locus-specific effects of the polygenes were on average an order of magnitude smaller, ranging from 0.002-0.15%, with effect sizes extracted from negative exponential distributions. All polygenes act independently and have additive effects. HDL, TG, and LDL share 40% of their polygenes in common, and HDL and TG share an additional 20%. The specific identities of the polygenes, their locations, and their generating effect sizes are provided in the Additional Files 1, 2, 3 corresponding to HDL, LDL, and TG. A group of 39 polygenes influencing HDL were clustered within 0.5 Mb on chromosome 11; otherwise, the polygenes for each trait are randomly distributed throughout the genome. The overall effect of each trait-specific polygenic component was scaled to achieve the target total trait heritabilities of 60%, 55%, and 40% for HDL, LDL, and TG, respectively. The remaining variance is uncorrelated among family members, with the exception of a simulated dietary effect (variable: diet) on TG levels that accounts for a correlation of 0.05 among family members, regardless of their coefficient of relationship. The phenotypes generated from this genetic model were scaled to the empirically derived means and variances where, for example,μ (HDL|age 5 year interval, sex) represents the mean of HDL in FHS, given a 5-year age interval and sex;σ (HDL|age 5 year interval, sex) is standard deviation of HDL in FHS, given a 5-year age interval and sex; h a1 is the square root of simulated heritability for the a 1 SNP (as described in Table 2); a a1 is a simulated effect that reflects in part the penetrance of the a 1 SNP; sign is a random integer number that takes values (-1) or (+1) with the purpose of randomly changing the contribution direction of polygenes; apoly represents an instance of each of the 1,000 SNPs effects (k = 1 to 1,000), selected as polygenes for HDL; hapoly is an instance of the of square roots of heritabilities for 1,000 SNPs selected as polygenes for HDL; a ε represents the environmental effect that contributes to HDL; and h ε is  Figure 1 shows simulated phenotypes emulating the lipid domain (HDL, LDL, TG, and CHOL) and its contribution to cardiovascular disease risk (CAC and MI). Simulated major genes are symbolized with Greek letters. There are 1,000 polygenes for each trait HDL, LDL, and TG, several of them with pleiotropic effects. Continued lines and arrows show causality/interaction (I); dashed lines show pharmacogenetic effects only for subjects treated with medication, where response was dependent on the subjects' genotypes. Environmental factors such as diet, smoking, and medication were modeled in the simulation. the square root of HDL variance explained by environmental causes.
As individuals progressed to the next visit 10 years later, their phenotypes were scaled by the appropriate age-sex means and variance, but there are no genes governing longitudinal trends per se. Instead, we simulated the complicating effects of medication. The simulated value for LDL at each visit for each subject was checked, and individuals in the upper tail of the distribution were simulated as medicated. The proportion of subjects that are medicated increased across visits to comprise 2%, 5%, and 15% of the subjects in Visits 1, 2, and 3, respectively. These proportions were estimated from the FHS data, and reflected the secular increase in the proportion of individuals being treated for elevated cholesterol levels. The response to treatment is governed by two loci (δ1 and δ2) as pharmacogenetic processes. where ME is a joint genetic effect from an epistatic interaction between τ1 and τ2, the effect of τ1 is purely epistatic (i.e., τ1 displays only a minimal main effect) while τ2 displays an additional measurable additive main effect; PE is the joint effect from τ3 and τ4, a pair of purely epistatic SNPs, each with no main effect; Het is an effect from τ5, a SNP that displays heterosis (overdominance); and ε is the residual variation not explained by the factors mentioned above. The term ε, 300 times a random draw from a normal distribution with mean 0 and variance 1 (300 × N(0,1)), represents the sum of normal deviations from the mean of each of the modeled genetic effects and "noise" from unmeasured environmental and genetic effects. Because CAC cannot be negative, CAC AI = 0 if the generated value was negative. The models for the effects on CAC AI due to the ME and PE genotypes are illustrated in Tables 3 and 4. The minor allele frequency (MAF) for each of the four SNPs τ1-τ4 is~0.5. SNP τ5, which determines the Het effect, has a MAF of 0.2. SNP τ5 genotype 1/1 (common homozygote) increases CAC AI on average by 25, genotype 1/3 decreases CAC AI by 100, and genotype 3/3 increases CAC AI by 400. CAC is derived from CAC AI by using a piecewise linear age adjustment: subjects under 20 years have not developed measurable levels of CAC, CAC buildup is linear from the ages of 20 to 60, and for subjects older than 60, CAC = CAC AI . Table 5 lists estimates of the proportion of the variability of CAC attributable to each of the genetic factors averaged over the 200 replicate datasets.
Whether a subject smoked during the period before a visit influenced the risk of a myocardial infarction (MI). At first visit, men had a 27% chance to be smokers and women had a 23% chance. Each smoker had an 8% chance of permanently quitting smoking before each subsequent visit. The resulting smoking rates are commensurate with rates reported by the Centers for Disease Control for 1998. The risk of an MI before each visit is determined by CAC and its interactions with smoking and two genetic loci, 1 and 2. No MIs were fatal in our data. Smoking and 1 have an interactive effect on risk of MI. The effect of smoking is to constrict blood vessels, thus increasing the risk that CAC will lead to an MI. The risk of MI for a smoker with the most common 1 genotype (3/3) is the same as that of an equivalent non-smoker whose CAC is 10% higher. The risk of MI for a smoker with either of the other 1 genotypes is the same as that of a non-smoker whose CAC is 40% higher. The 1 genotype has no effect on risk of MI in non-smokers. Carrying the most common 2 genotype (3/3) has the same effect on risk of MI as reducing CAC by 5%. The effect of any other genotype is the same as increasing CAC by 5%. The final model for MI risk is   the 50 k SNPs in the Gene Focused Panel based on desired MAF, completeness of genotyping, and lack of linkage disequilibrium between the SNPs. The specific identities of the SNPs τ1-τ5, 1 and 2, and their chromosomes are listed in Table 6. All the authors contributed equally.