Are quantitative trait-dependent sampling designs cost-effective for analysis of rare and common variants?

Use of trait-dependent sampling designs in whole-genome association studies of sequence data can reduce total sequencing costs with modest losses of statistical efficiency. In a quantitative trait (QT) analysis of data from the Genetic Analysis Workshop 17 mini-exome for unrelated individuals in the Asian subpopulation, we investigate alternative designs that sequence only 50% of the entire cohort. In addition to a simple random sampling design, we consider extreme-phenotype designs that are of increasing interest in genetic association analysis of QTs, especially in studies concerned with the detection of rare genetic variants. We also evaluate a novel sampling design in which all individuals have a nonzero probability of being selected into the sample but in which individuals with extreme phenotypes have a proportionately larger probability. We take differential sampling of individuals with informative trait values into account by inverse probability weighting using standard survey methods which thus generalizes to the source population. In replicate 1 data, we applied the designs in association analysis of Q1 with both rare and common variants in the FLT1 gene, based on knowledge of the generating model. Using all 200 replicate data sets, we similarly analyzed Q1 and Q4 (which is known to be free of association with FLT1) to evaluate relative efficiency, type I error, and power. Simulation study results suggest that the QT-dependent selection designs generally yield greater than 50% relative efficiency compared to using the entire cohort, implying cost-effectiveness of 50% sample selection and worthwhile reduction of sequencing costs.


Background
We assume an existing cohort of unrelated individuals, well phenotyped for a quantitative trait (QT) that is approximately normally distributed. At present, the cost of whole-genome sequencing of an entire cohort to test for association remains prohibitive, and selection of a subset of informative individuals for sequencing is a reasonable strategy to reduce costs. In the absence of relevant genetic information, such as family history or single-nucleotide polymorphism (SNP) association identified through genome-wide association studies, one might hypothesize that individuals at the extremes of the QT distribution would be more informative for the detection of association with sequence variants. For example, individuals with high QT values are more likely to carry high-risk variants, whereas those with low values do not. In contrast, taking a simple random sample from the cohort will tend to select individuals from the central part of the trait distribution and can fail to sample individuals with low-frequency variants. Because ascertainment by the phenotype needs to be taken into account, simple selection from the tails of the trait distribution complicates inference. The resulting trait distribution is bimodal, violating the usual linear regression assumptions and can produce distorted effect estimates and type I errors greater or less than nominal. In addition, the resulting sample is not representative of the original cohort, making it more difficult to design validation studies in nonselected populations. Finally, if both extremely high and extremely low QT values are "pathological" then excluding "normal" individuals would reduce the ability to detect protective variants associated with favorable QT values. As an alternative, we propose a sampling design in which all individuals have a nonzero probability of being selected into the sample but in which individuals with extreme phenotypes have a proportionately larger probability of being selected. We account for differential sampling in the association analysis by inverse probability weighting (IPW) with standard survey methods for variance estimation and hypothesis testing. The purpose of our analyses of the Genetic Analysis Workshop 17 (GAW17) mini-exome data is to compare alternative sampling strategies with respect to relative efficiency. We also examine type I error and power of hypothesis testing.

Sampling designs
We consider the analysis of sequencing data for the entire cohort as the ideal against which to compare sampling designs in which only 50% of the cohort is sequenced. Four alternative sampling designs are summarized in Table 1. Extreme-phenotype alternative designs 3 and 4 are systematic and do not actually involve any random selection; they differ only in whether individuals with "normal" values are chosen for sequencing. Design 2 involves simple random sampling (SRS) and thus is expected to be unbiased compared to the entire cohort, but it has reduced power and is subject to sampling variation. Under design 5, which is similarly subject to sampling variation, we specify the sampling probability for each individual according to the value of their quantitative trait Y i . For individual i, we use the distance from the median of the trait distribution (Y med ) in the original cohort: In particular, we specify the IPW weight as: where the sampling probabilities under design 5 are: or: which we refer to as 5.1 and 5.2, respectively. These probabilities are scaled by the maximum distance from the median, with: where Y max and Y min are the maximum and minimum of the Y i , respectively. In design 5.2 extreme observations are selected more frequently.

Statistical models and hypothesis testing
For common variants, it is feasible to analyze one SNP at a time. For rare variants, however, methods that accumulate multiple SNP genotypes across a region are necessary to improve the power to detect association. We calculate a rare variant score, defined as the total count of rare alleles within a gene [1,2]. Association analyses of a QT under designs 1 and 2 are conducted using standard linear regression methods with the QT as the dependent variable. To account for unequal sampling probabilities under design 5, we apply IPW in the linear regression model parameter estimation and use the R function svyglm from the survey package to estimate appropriate variances and construct a test statistic [3]. However, because we select observations according to values of QT, in designs 3, 4, and 5 the validity of standard linear regression methods is questionable. We hypothesize that reversing the direction of the regression is a potentially more robust alternative. As described in other contexts, reversing the regression has the advantage of conditioning out the phenotype. When analyzing Application to GAW17 data We analyze quantitative traits Q1 and Q4 in replicates 1 to 200 of the GAW17 simulated unrelated mini-exome data [4]. With knowledge of the generating model for Q1, we chose the SNPs in the FLT1 gene to compare methods under a true alternative hypothesis of genetic association and use analysis of Q4 for comparisons under the null hypothesis. To reduce heterogeneity and potential for population stratification bias, we chose the Asian subpopulations for study: Chinese + Japanese (n = 321). We excluded SNPs in the FLT1 gene that were monomorphic within the Asian subpopulation. Of the 20 remaining SNPs, 18 were classified as rare (all minor allele frequencies [MAFs] were less than 3.0%); but according to the generating model, only 6 of these had functional variants. We calculated a rare variant score (the count of rare alleles) from the 18 rare SNPs. The total count of rare alleles per person ranged from 0 to 3 (with corresponding frequencies of 252, 54, 11, and 4). One SNP with a common functional variant, C13S523, was analyzed separately (MAF = 8.72% with no rare homozygotes).
QT-dependent sampling and QT analysis for Q1 and for Q4 were based on residuals from a linear regression on Age, Sex and Smoking status. We conducted an association analysis under each of the five sampling designs listed in Table 1 in all 200 replicates. Results are reported for replicate 1 alone to illustrate application of the methods in a single data set. We summarized distributions of regression coefficients (means and variances), standard error estimates (means), and test statistics (type I error and power) across replicates.

Results
Analysis of Q1 with FLT1 SNPs in replicate 1 Association of Q1 with FLT1 was detected at near genome-wide significance levels for both the common SNP and the rare variant score (Table 2) using the entire Asian subpopulation. The logistic regression test statistics are nearly always smaller than the linear regression test statistics, as expected, because logistic regression is less efficient for a normally distributed trait. Under linear regression analysis, association signals for all of the 50% designs were attenuated, with the most attenuation for designs 2 and 5 and the least attenuation for extreme-phenotype designs 3 and 4. However, the regression coefficients for designs 3 and 4 appear to be inflated. Under logistic regression analysis, design 5 had the least signal attenuation, and under Poisson regression it had less signal attenuation when extreme phenotypes were selected more frequently, as in design 5.2.

Evaluation of test statistics in replicates 1-200
We evaluated type I error at the nominal 5% level by means of association analysis of Q4 with the FLT1 rare variant score and common causal variant. With 200 replicates, the 95% confidence interval is roughly 0.05 ± 0.03. For the rare variant score, the linear regression type I error tended to be less than 5% (Table 3), except for design 5.2, in which the empirical standard deviation was larger than the mean standard error (SE), yielding an elevated type I error. In Poisson regression, type I To graphically convey power differences, we constructed pairwise comparison plots of design 1 test statistics for Q1 and the alternative designs in the 200 replicates. We also calculated mean values of the ratio of the design 1 test statistic over each of the test statistics for the alternative designs that we referred to as the mean ratio of test statistics (MRT). For the rare variant score (Figure 1), loss of power under design 2 is readily apparent in both linear and Poisson regressions. The scatter observed in designs 2, 5.1, and 5.2 can be explained partly by the additional variation associated with the random sampling component. The test statistics for extreme-phenotype designs 3 and 4 were strongly correlated with those for design 1, with high  MRT values. Under Poisson regression, MRT values for designs 5.1, and 5.2 were similarly high. For the common variant (data not shown), discrepancies with design 1 test statistics were generally similar to those for the rare variant analysis, with equivalent performance for all designs under logistic regression but substantial loss of power.

Evaluation of estimation efficiency using replicates 1-200
Under the null analysis of the rare variant score for Q4, all estimates were close to being unbiased (Table 3). For Q1, however, when extreme-phenotype values were oversampled and this was ignored (under designs 3 and 4), the mean coefficients in the linear regression were 50% larger than those under design 1 (Figure 2). In contrast, the mean coefficients for Poisson regression were similar under designs 1, 3 and 4. Notably, when we incorporated the sampling probability of each individual according to the value of their QT, as in sampling design 5, the mean coefficient was attenuated in the linear regression but close in value to that for the entire sample in Poisson regression. Similar findings were obtained for common variant linear and logistic regressions (data not shown).
To compare designs in terms of precision of estimation, we evaluated relative efficiency (RE), defined as 100 times the ratio of the empirical variance of design 1 to that of the alternative design (Table 4). This measure does not depend on SE estimation but may be affected by differences in the expected value of the regression coefficient, so we considered the null case of Q4, in which estimates were approximately unbiased, and the alternative case of Q1. In comparison to analysis of the entire cohort, we observed only 40-50% RE under SRS design 2. Under extreme-phenotype designs 3 and 4, RE was low for linear regression but high for logistic and Poisson regression. In contrast, linear regression REs for Q1 and Q4 were highest for sampling design 5.1, with an RE of 88% for Q4. However, when extreme observations were selected with high probability, as in design 5.2, RE was high for logistic and Poisson regression and comparable to that of designs 3 and 4.

Discussion
As expected, use of a 50% SRS design is not a costeffective approach for reducing sequencing costs: The empirical relative efficiency of estimation was less than 50% in the GAW17 simulations we examined. However, designs 3-5, in which RE generally exceeded 50% when conditioning on the QT and could be substantially higher, appeared to offer a net benefit. The more extreme IPW sampling design 5.2, which is similar to designs 3 and 4, behaves similarly, at least for the regression analyses that condition on the phenotype. Yilmaz and Bull BMC Proceedings 2011, 5(Suppl 9):S111 http://www.biomedcentral.com/1753-6561/5/S9/S111 However, the simple IPW linear regression analysis that we applied here may be less than optimal in finite samples under the alternative hypothesis. Although RE results for design 5 are encouraging, for hypothesis testing the source of parameter underestimation under the alternative hypothesis needs to be addressed and SE estimation improved. Evaluations reported here are limited, and it is quite possible that sampling and/or weighting schemes could be improved.
In our extreme-phenotype implementation of designs 3 and 4, we used the observed QT values, in contrast to more common case-control approaches. Because the linear regression assumption of normal residuals is grossly violated, the observed exaggeration of regression coefficients is not surprising. The lack of elevated type I error may be ascribed to the use of symmetric tail selection, but for rare variant analysis there is a clear loss of efficiency. In the GAW17 simulation, in which rare and common variant effects were generated under an additive model, differences between designs 3 and 4 were trivial. In other settings, design 4 may be more robust.
Although it has not been commonly applied, Poisson regression of allele counts in the rare variant score appears to be useful as a general approach for singlegene analysis in many designs, provided that the tendency toward elevated type I error is resolved. For design 5, Poisson regression avoids the complications of weighting, because the model conditions on the QT value.

Conclusions
There is established literature on optimal design for continuous outcomes in experimental settings and survey samples, and sampling has been well exploited in the context of time-to-event outcomes in epidemiological designs, such as the case-cohort study [5]. In comparison to using the entire cohort, extreme sampling incurs some loss of efficiency and power to detect association, but it is far more efficient than simple random sampling of the same number of individuals. Although our comparisons in the GAW17 data do not conform to a wholly realistic setting in terms of sample size and volume of sequencing data, they do suggest that QTdependent sampling can be quite effective in reducing sequencing costs with developing prospects for wholegenome sequencing.