Using founders only and using all pedigree members may lead to different association results
As seen in Additional file 1, a significant association for a SNP-eQT pair using founders only may lose its significance when all pedigree members are used (e.g., the DDX17-rs243404 pair). This observation is somewhat surprising, because there are several reasons to believe the opposite. First, extra samples in the 194-sample data set are offspring of those in the 56-sample data set, so if there is relatedness in their eQTs, it will only reinforce the association signal. Second, with a larger sample size in the 194-sample data set, we would expect a smaller p-value, instead of a larger, insignificant one. For the 31 SNP-eQT pairs listed in Additional file 1, 25 pairs' p-values are increased in the all-pedigree-member data set (using the naive approach, assuming all samples are independent), despite a tripling of the sample size.
Checking robustness of result by t-tests using dominant and recessive models
Because the linear regression used here implies an additive model, we also carry out two t-tests by grouping samples with the heterozygous genotype to those with one of the homozygous genotypes. For founders, the only SNP-eQT pair with p-value smaller than 4.4 × 10-6(0.01/2263) is CSTB-rs157334 (p-value = 1.4 × 10-6). When all pedigree members are used, CSTB-rs157334 is still the only pair that is significant at this level (p-value = 1.6 × 10-10).
Correcting for sample relatedness by random mixed models
When the MM1 mixed model is applied to the 194-sample data, only two SNP-eQT pairs remain significant at the 4.4 × 10-6 level: CSTB-rs157334 and HSD17B12-rs1334334. When MM2 mixed model is applied, only CSTB-rs157334 exhibits a p-value close to that level (5.3 × 10-6). Interestingly, this is the only cis-acting pair among those listed in Additional file 1. If an eQT has stronger dependence on a SNP genotype and a lower variation from pedigree to pedigree, we expect the significant SNP-eQT association to survive the application of a mixed model, at least with the random effect on the intercept. However, some SNP-eQT pairs lose the test significance in both MM1 and MM2. The MM2 model describes a situation in which not only the eQT varies among pedigrees, but also its degree of dependence on SNP genotype changes with pedigrees. MM2 is a less realistic model than MM1.
Age/generation effect on eQT
To investigate why all-pedigree-member data set may exhibit a different regression result from the founders-only data set, we examined whether eQT changes from generation to generation. This is a simplified version of examining the age effect, as generation 1 (founders) are older than the second and the third generations. Under the GC regression model described in the Methods section, three p-values are listed in Additional file 1: these are for testing zero coefficients for the genotype (b), generation 2 (c
2), and generation 3 (c
3). Note that in this regression, pedigree member information is discarded.
Five SNP-eQT pairs in Additional file 1 show significant genotype association at the 4.4 × 10-6 level after accounting for the generation effect: the two pairs mentioned above as well as RPS26-rs720428, CTSH-rs1021639, and GSTM1-rs1039337. On the other hand, significant association with the generation variable has been observed for these eQTs: DDX17 (4.8 × 10-15) on generation 2; PTPN22 (2.8 × 10-20), CGI-96 (8.4 × 10-15), HLA-DRB1 (1.7 × 10-8), ZNF85 (3.5 × 10-8), IL16 (2.2 × 10-6), and CSTB (6.6 × 10-6) on generation 3. Interestingly, even though CSTB has a significant association with generation 3, it has an even stronger association with the genotype of rs157334.
A very simple check on whether founders tend to have different expression levels from non-founders is to calculate their percentile value with respect to other samples in the pedigree. For example, for a 14-member pedigree, each member has a percentile value ranging from 1/14 to 1, in an increment of 1/14, and is determined by their expression level of a particular eQT. For 28 eQTs (the probe sets that represent the "consensus sequence" of DDX17 and HLA-DRB1 are selected, whereas those that are "example sequences" are discarded), their average, standard deviation, and median percentile values are listed in Table 1. It is clear that for some eQTs, founders are indeed a biased subset of the pedigree. Most of the examples mentioned above for having a significant generation dependence is regression model also show up in Additional file 1 for having higher or lower averaged founder percentiles: DDX17 (68%), PTPN22 (32%), CGI-96 (76%), HLA-DRB1 (36%), ZNF85 (35%), and CSTB (67%).
To view more directly the relative location of founders' expression with respect to other pedigree members, Figure 1 shows the box-and-whisker plot of the 28 eQTs, with the founders marked by crosses.
Yet another simple check for the founder effect on expression is the two-way ANOVA with pedigree as one factor and founder/non-founder as the second factor. Our result shows that 1) pedigree-dependence of the expression is significant for most eQTs (with the exception of four to five eQTs); 2) founder-dependence of expression is significant for nine eQTs at the p-value = 0.001 level. These nine eQTs are CSTB, RPS26, DDX17, CGI96, TM7SF3, IL16, ZNF85, PTPN22, and HLA-DRB1, consistent with the result from the percentile value calculation; 3) for the founder-pedigree interaction term, four eQTs are significant at the p-value = 0.001 level.
One-stage versus two-stage analysis
Although it does not apply to the GAW data, in a practical setting, genome-wide genotyping information may only be available for the first stage of a study. In this situation, one may select SNPs that show promising association signals, and only type these SNPs for the rest of the pedigree (second stage) in order to save cost. It is also suggested by Van Steen et al. that two-stage design also helps to ease the multiple testing problem [14]. If the whole genome genotyping information is available for all pedigree members, one should use all samples in the analysis, while correcting the sample correlation by appropriate procedures (such as the mixed model procedure discussed here).
To test where a two-stage design would lead us, we imagine a hypothetical situation in which we do not have the genotyping information for non-founders. Then we would first exhaustively perform all possible genotype-expression linear regression analyses for the 56 founders. There are more than 8 millions possible SNP-eQT pairs, leading to an equal number of p-values. From these results, for each eQT, the minimum p-values of all 2263 SNPs can be recorded. There are 47 eQTs that have a minimum p-value lower than 4.4 × 10-6. Now we assume in the second-stage that these selective SNPs are typed for non-founders. For the corresponding eQT-SNP pair, another linear regression analysis on all 194 members can be carried out, as well as a linear regression analysis with the generation variable as covariates. If we do that, among these 47 eQT-SNP pairs, only two remain significant at the 4.4 × 10-6 level for the 194-sample data set: CSTB-rs157334 and HSD17B12-rs1334334. Interestingly, these two pairs are the only overlap between the 47-pair set and the 31-pair set listed in Additional file 1. Furthermore, the two pairs are also the only ones showing significant association with genotype after the generation effect has been removed.
Because the eQTs listed in Table 1 of Cheung et al. [2] (and Additional file 1 here) are selected based on a larger data set and, more importantly, extra information (e.g., linkage signal), they stand at a better chance to be true positives. The fact that the only two eQTs whose significant association with a SNP survive the test using all pedigree members cautions us on the practice of not using all available data, but only relying on a subset of the data set.