The algorithm was developed based on the following observations. For high-density SNP markers (e.g., 300 K or 500 K SNP arrays), it is likely that nearby SNPs are in linkage disequilibrium (LD). In a two-stage design, usually a liberal significance level α (such as 0.05 without the Bonferroni correction) in stage one is used to ensure that no true signals will be filtered out. On average, M α SNPs will be selected to stage two, where M is the total number of markers in stage one which is 300 K or 500 K. However, most of the M α SNPs are false positives with respect to the disease in study. Furthermore, if a SNP shows a moderate association with the disease and has been selected in stage one, it is highly likely that its nearby SNPs that are in high LD with it will also be selected to stage two. In other words, many of the M α SNPs may also have high LD. Therefore, I propose to apply a clustering algorithm to all SNPs that have been selected from stage one to explore the dependence relationship among the M α SNPs. More specifically, all the M α SNPs are first ranked according to their significance levels. Starting from the SNP with the highest rank (smallest p-value), all of the SNPs that are highly correlated with it (with the pairwise LD D' larger than a predefined threshold) will be grouped as a cluster conditional on the requirement that they are within a certain physical distance (which is a parameter). The cluster will be represented by the SNP with the highest rank. The process will continue in the decreasing order of SNP ranking for all SNPs that have not yet been clustered, until all the SNPs have been processed. At the end, the algorithm returns a set of clusters, each represented by a SNP with the highest rank within its cluster. A SNP can only be grouped to a nearby representer (defined by the distance threshold) to eliminate false signals of LD that can occur between two SNPs by chance. SNPs in a cluster are not necessarily consecutive. Clearly, the above clustering algorithm can reduce the number of SNPs to be considered in stage two and its effectiveness depends upon correlations among SNPs, as well as the two parameters. Joint analysis assuming heterogeneity is adopted in this study because it has higher power than replication-based analysis and it requires fewer assumptions. A proper significance level has to be derived for such an analysis. In general, suppose a liberal significance level α with the critical value c1 is used in stage one. Let X1 denote the χ2 test statistic based on samples in stage one. Only markers with X1 > c1 will be further considered in stage two. For a marker to be genotyped in stage two, let X2 denote the test statistic using samples from stage two. Under the null hypothesis of no association, X1 and X2 are independent and follow χ2 distributions with 1 degree of freedom. For the joint analysis, the statistic X is equal to the summation of X1 and X2. Notice that X and X1 are not independent even under the null distribution. Let f(x) and F(x) denote the probability density function and the cumulative distribution function of χ2 distribution with 1 degree of freedom, the significance level of X with a value c can be calculated based on the following formula through numerical methods:
I applied the above clustering algorithm within a two-stage design using the joint analysis on the simulated data sets of Problem 3. All analyses were carried out with knowledge of true disease gene locations. I first tested the above algorithm on the dense SNP set on chromosome 6, which contains the HLA-DRB1 locus and Locus D. The total number of SNPs is 17,820, with an average inter-marker interval of 10 kbp, which corresponds to a 300 K array. As a comparison, I also applied the algorithm on the SNP data of chromosome 18 that mimic a 10 K SNP chip set. SNP data on chromosome 1 were used to evaluate the type I errors. I first constructed data sets for a case-control study with a two-stage design. For each data set, only one affected child was randomly chosen as a case subject from each nuclear family with an affected sib pair. One child is selected as a control subject from each normal family. Therefore, all cases and controls are independent. Because some alleles around the HLA-DRB1 locus have very strong effects on the disease status, only a very small fraction of cases and controls were randomly selected for testing from all subjects (1500 cases and 2000 controls). Let n denote the total number of subjects tested in stage one and stage two together, where an equal number of cases and controls were tested. For chromosome 6, n took the values of 100, 200, and 300. Let f denote the fraction of the number of subjects in stage one, and f took the values of 0.3, 0.4, and 0.5 in this experiment. I assumed only nf subjects were genotyped for all m SNPs in stage one. The Pearson χ2 statistic was used to select a subset of k SNPs for stage two based on a significance level of 0.05 without adjustments. The clustering algorithm was then applied to the k SNPs with a LD threshold D' = 0.8 and a distance threshold of 100 kbp for chromosome 6. For each parameter combination, 100 independent replicates were randomly sampled from the original data sets. I have investigated and compared the power, costs, significance levels, and prediction errors (the distances from the predicted locations to the true gene location) of three methods, namely, the one-stage design using all data, the two-stage design without clustering, and the two-stage design with clustering. For chromosome 18 and chromosome 1, because the total number of markers on each chromosome is much smaller than the number of SNPs on chromosome 6, and the effect of Locus E on chromosome 18 is much smaller than the HLA-DRB1 locus, a different set of parameters has been used (e.g., n = 750, 1000, 1250; and the distance threshold for clustering is 5 Mbp).