Two-step multiple-marker strategy
We propose a flexible multiple-marker analytical approach for genome-wide association studies made up of two steps. In the first step, the local score method is applied to case-control data in order to detect and rank candidate regions across the genome. It serves as a screening tool. In the second step, these candidate regions are tested for association with the studied phenotype in a sample of family data using FBAT-LC [5] and the p-values obtained are then corrected for multiple testing. Each of these two steps is independent from each other and can be modified according to the type of data collected. We chose to make full use of Problem 3 data, which included both case-control and family data, thus guiding the choice of the test statistics suited to these data.
Step 1: Detecting candidate regions
The local score method used the Pearson chi-square statistic applied to the case-control genotypic contingency table for each marker to produce a sequence of scores [7].
Let X= (X
i
)i=1,...,nbe a sequence of real random variables. In our context Xrepresents a sequence of statistics of association attached to each marker along the genome. The statistic:
defines the local score assigned to X. In practice, it corresponds to the value of the region with the maximal sum of scores X
i
. Consequently, the variables X
i
must be negative on average otherwise the best region would easily span the entire sequence. This definition is restrained to the highest-scoring region. The next high-scoring ones are potentially interesting as well because the data set may contain more than one trait locus (TL). We define the kth best region as the local score of the initial sequence disjoint from the preceding k - 1 best regions. In this case H(1) > ... > H(k) are the scores of the k first and distinct highest-scoring genomic regions. Advantages over simple-marker strategies arise from the ability of this statistic to identify a set of candidate genomic regions that may contain genes involved in the disease.
The algorithm of the local score approach includes the three following procedures: i) producing the initial sequence X: we assign to each marker a statistic of association (X
i
) corresponding, in our case, to the Pearson chi-square test of case-control marker genotype frequencies. A constraint of this strategy is to have Xnegative on average; that does not happen with positive statistics such as Pearson chi-square, so a constant δ must be subtracted from the whole signal X. In this study, δ corresponds to the value of statistic X
i
at the classical 5% level and we let X'= X- δ; ii) identifying the highest-scoring region: a simple approach to get the local score from X ' consists of comparing the value of for all possible regions [a; b] but excluding those regions spanning different chromosomes; iii) identifying the next high-scoring regions by using an iterative algorithm: find the highest-scoring region, remove it from X', and apply the algorithm again until there are no more positive local scores in the sequence. At the end, the number of tests has been reduced from M markers to N candidate genomic regions ranked according to their local scores.
Step 2: Testing candidate regions for association
The new FBAT extension proposed by Xu et al. [5] was used to analyze the regions selected in Step 1 in the family data. This method allows testing multiple markers simultaneously without haplotype reconstruction, and provides significance levels. In brief, the FBAT-LC test proposed by Xu et al. [5] is based on a linear combination of single-marker FBAT test statistics using data-driven weights, where marker weight derivation is based on the "conditional mean model" [8]. The FBAT test for each bi-allelic marker is carried out for only one allele. When assuming an additive model, this test does not depend on the selected allele. Finally, for the p-values obtained for all candidate regions, different corrections for multiple testing were compared: no correction, Benjamini and Hochberg correction, and Bonferroni correction. A region was considered significant if the corrected p-value was less than 5%.
Performance of the multiple-marker two-step strategy and comparison with the single-marker approach
We assessed the ability of our strategy to reveal regions containing the trait loci by comparing the results obtained from the analysis of all Problem 3 case-control and family data replicates with the answers that were provided. Because the local score was applied to case-control data in Step 1 and FBAT-LC to ASP data in Step 2, we formed 50 replicates of association-study data sets, each set being made of two independent samples: one replicate of case-control data and one replicate of family data. Each case-control data replicate included 1500 cases (one case drawn at random from each ASP) and 2000 controls genotyped for the 9187 SNPs. Each family data replicate included 1500 ASPs genotyped for all SNPs belonging to the candidate regions selected in Step 1.
To evaluate the performance of our strategy, we first identified, in each replicate, the true positive and the true negative regions among those selected in Step 1, a region being defined as positive if it contained at least one of the two flanking markers of any hidden trait locus. We then derived the three following quantities: 1) sensitivity, which is the proportion of true-positive regions that were correctly identified by FBAT-LC test; 2) specificity, which is the proportion of true-negative regions that were correctly identified by FBAT-LC test; 3) the false-discovery rate (FDR), which is the proportion of false positives among the declared significant results. An average estimate and standard deviation of each of these three quantities were computed over the 50 replicates of family data. The estimates of these quantities were compared according to the correction applied to the FBAT-LC p-values. The average proportion of trait loci detected by our two-step approach over the 50 replicates of association-study data sets was also derived.
We then conducted a two-stage single-marker analysis to be compare with our multiple-marker strategy. All 9187 genotyped SNPs were ranked according to the p-values associated with the Pearson chi-square test applied to the case-control genotypic contingency table. A number, M, of markers with the smallest p-values to be analyzed in Step 2 was selected. M was equal to the average number of markers belonging to the regions selected by the local score method over 50 replicates. In Step 2, a single-marker FBAT was applied to each of the M selected markers and p-values were either not corrected or corrected using either Benjamini and Hochberg or Bonferroni corrections. To be comparable with the above definition of a true-positive region, true-positive markers among the M selected markers were those flanking each trait locus. Estimates of the same performance indicators, as defined above, were derived over the 50 replicates of association-study data sets.