Suppose we have data for N SNPs on M individuals. Let L(i) denote the location (in kb) of SNP i. Let gk, idenote the genotype data for individual k at SNP i, where g
is defined as the number of copies of the minor allele at this locus. (Our methods also generalize naturally to microsatellite data.) We proceed from left to right along each chromosome, assessing similarity between pairs of genotypes around each SNP. Suppose we are considering a pair of individuals j1, j2 in a region centered around SNP I, on chromosome C. Intuitively speaking, at each location I, our algorithm calculates a similarity measure equal to the maximum possible haplotype shared length around I. More formally, we define a function f
(j1, j2) as: 2 if gj1, i= gj2, i; 1 if |gj1, i-gj2, i| = 1; and 0 if |gj1, i-gj2, i| = 2. f
(j1, j2) simply counts whether both, one, or neither of the pairs of haplotypes for j1, and j2 can match at SNP i. We wish to stop recording shared lengths on a given haplotype as soon as a mismatch is found, so we further define F
(j1, j2) as
For each pair of individuals j1, j2 we define the similarity around I as S
(j1, j2), where
C(L) [C(R)] denotes the left- [right-]most SNP on chromosome C. We explored a range of other haplotype similarity measures, with similar results (not shown).
We make several observations about this procedure. First, it is very quick to compute. Second, at each location, it calculates a similarity S
(·,·) based upon phasing each pair of genotypes such that the possible haplotype sharing between those two genotypes is maximized, and this is performed independently across all pairs of individuals. This is an approximation, in that our measure of identical-by-state shared length must be greater than or equal to that which is the case for the (unobserved) true haplotypes. In principle, we might calculate the assignment of phase that maximizes the total similarity score across all pairs of individuals, but this would be highly computational intensive, and completely intractable for data sets of the dimensions that are now becoming typical in genome-wide studies. Third, we are appealing to the intuition that failure to explicitly allow for dependencies across individuals when assigning haplotype phase will not introduce too much noise into the analysis, but will allow us to efficiently exploit cladistic analysis techniques that are known to perform well in haplotype-based contexts. We now give further details of our analysis.
At each SNP I, we form an M × M distance matrix using the reciprocals of the similarity measures S
(·,·). We then apply a hierarchical agglomerative clustering algorithm to construct a tree relating all individuals in the sample. Next, we use an iterative scheme to break this tree into a number of clusters (clades). Suppose that the current iteration has the tree broken into c clusters. At each step we explore all possible points at which the tree can be further broken. A p-value for each potential breakpoint is calculated by considering the clusters that exist after breaking the tree at this point. We define a factor variable, W, that takes a single value for all individuals within each cluster, but whose value differs across clusters. We then apply a Kruskal-Wallis non-parametric analysis of variance to both the new topology, with c + 1 clusters, and the old topology, with c clusters, to see whether W better explains the phenotype of interest. Assuming at least one potential breakpoint results in a p-value that is smaller than that which was obtained from the representation with c clusters, we accept the breakpoint with the smallest such p-value. We iterate this process until no decrease in p-value is obtained. We then record the p-value obtained at this locus and repeat the scheme at every other locus. The intuition here is that if we are at a location near a mutation that influences the phenotype, it is likely that we will be able to break the tree into a number of pieces that explain the phenotype (with each cluster corresponding, perhaps, to a unique occurrence of that mutation).
This procedure results in a p-value for each SNP. It remains to assess the significance of the results. We expect markers near a trait locus will have small p-values. A traditional way to determine a significance cut-off for these p-values is to use a permutation scheme. For example, for a data set of interest, we create 100 new data sets in which we maintain the genotype data but permute phenotypes across individuals. We then analyze these data sets using our method and record the value of the lowest p-value observed in each case. A variety of other permutation tests are possible, depending on the hypothesis one wishes to test. It is common practice to follow a procedure such as the one we use here in order to attest the significance of the smallest p-value observed in the region under consideration (in our case, a chromosome, although one could perform a genome-wide permutation test if desired). We use the set of lowest observed p-values to define a significance cut-off for the original, observed data. Such a scheme is a computationally intensive method to employ. While the permutation test for a single replicate can be completed in about 12 hours, and is this highly tractable for real applications, use of the permutation test across all 100 replicates takes around 50 days per chromosome. Thus, we present permutation results only for chromosome 18 (the case in which significance of results is most in doubt – see below) . For other analyses, we focus solely on a presentation of results showing the mean distance between the trait locus and the SNP with the smallest p-value (see Results and Discussion).
A further complication is that of computational efficiency. While our method, which is implemented in C code, is able to analyze samples of a few hundred individuals in around 10 minutes, it would prove impossible to analyze the full set of cases and controls, across all 100 replicates in the time available. Instead, we chose to implement the following scheme. For each analysis of a given chromosome, for a particular phenotype of interest, we construct 10 data sets consisting of 100 'cases' and 100 'controls' (sampling without replacement). The definition of 'case' and 'control' depends on the phenotype of interest, and is given in the Results. We analyze each of these 10 data sets, record the p-value for each locus in each analysis, and report the average p-value across the 10 analyses as the final 'score' for that locus.
Due to space requirements, the issue of missing values is not directly examined in this paper. However, it would be relatively straightforward to extend our methods using schemes that impute missing values. Alternatively, we could adjust the haplotype similarity measure f
(j1, j2) so that it includes a score, perhaps based upon allele frequencies at the locus, for genotypes with missing values at a given locus.