A combined SNP data set with both of the 1500 ASP families and the 2000 unrelated control subjects is analyzed for each replicate. Because methods such as ours are widely used on SNP data, we analyze the SNP data only (9187 SNPs). Due to computational constraints, we only test the mixed model with estimated kinship (below) using a random subset of 50 ASP families and 200 controls. The Matvec package [14] was used in the analysis of combined data sets. The R package [15] and the glmmPQL library was used for generalized linear mixed models with estimated kinship. For the purposes of illustration, we focus on results for Replicate 1. We then present results across all 100 replicates in order to assess power.
In Figure 1 we show the results for both simple logistic regression and mixed-model analysis on chromosomes 6 and 11, respectively. In each case, the phenotype is RA (rheumatoid arthritis) affection status. We used the SNP genotypes only. The former does not explicitly allow for the relatedness between individuals, whereas the latter does. However, we see that for this data both analyses find strong signals in the correct locations. Indeed, the results of the two methods are extremely similar, due to the lack of population structure in the simulated data. Note that analysis of all regions other than chromosomes 6 (Loci C and D), 11 (Locus F), and 18 (Locus E) found no signal (results not shown.) The horizontal line corresponds to a p-value of 1 × 10-4, chosen because an examination of chromosomes with no functional loci rarely contained p-values below this number (see discussion of significance below). The results from models using estimated kinship give the same pattern.
These plots suggest that cryptic relatedness is not a problem for these data. To further examine this, in Figure 2, we look at the cumulative distribution function [cdf] of p-values across all null SNPs (i.e., SNPs on all chromosomes with no functional loci), for Replicate 1. This is similar to the idea in Papanicolaou et al. [16]. Compared to situations in which phenotype and genome-wide genotype are heavily correlated, where the cdf lies above the diagonal (e.g., Zhao et al. [8]), here the p-values are distributed close to the expected way. In fact they lie slightly below the diagonal, indicating under-dispersion. Rather than correcting the under-dispersion of the p-values, the mixed-model seems to exaggerate that tendency.
All replicates show the same pattern for the p-value distributions. Positive correlation between loci (i.e., linkage disequilibrium) can lead to under- or over-dispersion of p-values. We tested this possibility by thinning the data by choosing 1/10 of the SNPs. Now, the SNPs are no longer correlated (more than 95% of r2 values between neighboring SNPs are less than 0.02). The dispersion pattern remains the same. Thus, correlation between loci is not the cause. The apparent tendency of a mixed-model analysis to exaggerate the under-dispersion has, to our knowledge, not been documented previously, and deserves further investigation to discover whether it is a consequence of the particular simulation scheme used for this data. Nonetheless, it is reassuring that signals are found on Loci C, D, E, and F despite this. Indeed, while mixed models are known to help when cryptic relatedness is present, our results indicate that they do not hurt a great deal when it is not.
To examine overall power, we analyzed all 100 replicates for the cases discussed above. We find that our method consistently finds the functional Loci C, D, E, and F in these regions. A formal power study would require a less heuristic method of assessing significance than an arbitrary (albeit supported by empirical results) definition of p < 10-4. One scheme is to proceed via a permutation-based test in which replicate data sets are created, their phenotypes permuted, and the distribution of smallest p-value is observed. Computational requirements prohibit the use of such a scheme here. In any case, the use of large numbers of genome-wide SNPs to get the null distribution and then derive an empirical p-value cut-off (10-4 above), is probably sufficient. All 100 replicates have p-value less than 10-4 at the SNP nearest to the functional Loci C, D, E, and F for the mixed-model analysis.
Finally, we assess the accuracy with which kinship is estimated using a pool of genome-wide markers. We find that results are good for the GAW15 simulated data. For kinship values within families that are supposed to be 0.25, estimates vary from 0.2 to 0.35. For values supposed to be 0, estimates vary from 0 to 0.05. Unfortunately, the simulated scheme appears not to have included any ancestry relationships between control individuals, or across case families. In reality, all individuals are related to some degree, although the significance of that relatedness will vary with individuals and with the analysis being performed. While our results show that kinship can be estimated in this study, a more comprehensive exploration of this issue, involving data with a more complex relatedness structure, would be of some interest.