Inference of disease associations with unmeasured genetic variants by combining results from genome-wide association studies with linkage disequilibrium patterns in a reference data set.

Results from whole-genome association studies of many common diseases are now available. Increasingly, these are being incorporated into meta-analyses to increase the power to detect weak associations with measured single-nucleotide polymorphisms (SNPs). Imputation of genotypes at unmeasured loci has been widely applied using patterns of linkage disequilibrium (LD) observed in the HapMap panels, but there is a need for alternative methods that can utilize the pooled effect estimates from meta-analyses and explore possible associations with SNPs and haplotypes that are not included in HapMap.By a weighted average technique, we show that association results for common SNPs in an observed data set can be scaled and combined to infer the effect of a genetic variant that has been measured only in an independent reference data set. We show that the ratio p(R-1)/[1 + p(R-1)], where R is the relative risk associated with a measured or unmeasured allele of frequency p, is appropriately scaled by 1/D' and weighted in proportion to r2, both common measures of LD being derived from the reference data set.We illustrate this computationally simple method by combining the results of a genome-wide association screen from the North American Rheumatoid Arthritis Consortium with LD measures from the British 1958 Birth Cohort, and explore the validity of underlying assumptions about the generalizability of LD from one population to another, and from healthy subjects to subjects with clinical disease.

By a weighted average technique, we show that association results for common SNPs in an observed data set can be scaled and combined to infer the effect of a genetic variant that has been measured only in an independent reference data set. We show that the ratio p(R-1)/[1 + p(R-1)], where R is the relative risk associated with a measured or unmeasured allele of frequency p, is appropriately scaled by 1/D' and weighted in proportion to r 2 , both common measures of LD being derived from the reference data set.
We illustrate this computationally simple method by combining the results of a genome-wide association screen from the North American Rheumatoid Arthritis Consortium with LD measures from the British 1958 Birth Cohort, and explore the validity of underlying assumptions about the generalizability of LD from one population to another, and from healthy subjects to subjects with clinical disease.

Background
The HLA allele DRB1*04 has been shown to be more strongly associated with rheumatoid arthritis than nearby tagging single-nucleotide polymorphisms (SNPs) [1]. We propose here a method of inferring the effects of an unmeasured genetic variant (such as DRB1*04) using linkage disequilibrium (LD) measures from an independent reference data set (the British 1958 Page 1 of 6 (page number not for citation purposes)

Open Access
Birth Cohort) to scale and weight the associations of rheumatoid arthritis with tagSNPs in the North American Rheumatoid Arthritis Consortium (NARAC) data set, supplied as Problem 1 for the Genetic Analysis Workshop 16.

Methods
The British 1958 Birth Cohort (B58C) compromises all infants in England, Wales, and Scotland born in one week in 1958. During a follow-up in 2002 to 2004 [2], a cell-line-backed DNA collection was established as a nationally representative reference set for genetic casecontrol studies. Field protocols and consent forms were approved by the South East England Multi-Centre Research Ethics Committee. Genome-wide data from the Illumina HumanHap550 Beadarray on 1430 members of the B58C was deposited by the Wellcome Trust Sanger Institute [3]. In addition, data on HLA typing using Dynal technologies were deposited by the Diabetes & Inflammation Laboratory, Cambridge [4]. Further details about these deposits and the B58C DNA Collection are published online [5].
The NARAC data set [6], provided as Problem 1 for the Genetic Analysis Workshop 16, consists of individuallevel genotype data for 550,000 SNPs tested on the Illumina HumanHap550 Bead Array linked to case/ control status for rheumatoid arthritis as well as the HLA alleles at the DRB1 locus. We derived a numerical score to represent the number of DRB1*04 alleles for each individual in both the NARAC and B58C data sets. Association tests between the rheumatoid arthritis case/ control status and each tagSNP in the MHC region (chromosome 6; 27-33 Mb [7]) were generated using Stata™ 9.2.
We tested whether combining the LD patterns from the B58C and the association patterns from NARAC provide an unbiased estimate the effect of the DRB1*04 allele, using only neighboring tagSNPs (± 300 kb) that are common to both data sets. This application has the advantage that the DRB1*04 allele (our target variant) has been measured directly in both data sets, permitting validity checks between our inferred relative risk estimate and the observed effect in the NARAC case-control study.
We define Φ obs = q(R obs -1)/[1 + q(R obs -1)], where R obs is the relative risk of disease associated with each copy of a measured allele of frequency q. The expression for Φ is similar to that for the population-attributable risk fraction (PARF) in epidemiological studies. However, in this genetic application, because each individual has two chromosomes, the PARF of a variant is Φ(2-Φ). In the Appendix we show that when the tagSNP is more common than the target variant, Φ obs /D' is an unbiased estimator of Φ true , where Φ true is the equivalent parameter at the unmeasured target locus and D' is the conventional measure of LD between the measured tagSNP and the target variant in an undiseased population. We also show (see Appendix) that, for any given target variant and study design, the variance of Φ obs for each tagSNP under the null hypothesis is inversely proportional to the LD measure r 2 , relating the measured tagSNP to the unmeasured target variant in the undiseased population.
We calculated a weighted average of the values of Φ infer = Φ obs /D' across all tagSNPs in the selected 600-kb region, and derived an inferred relative risk from this pooled estimate of Φ. In this paper we explore the effect of different assumptions about the generalizability of LD measures and selection of tagSNPs upon this point estimate. An empirical variance of this pooled estimate under the null hypothesis was derived from multiple random permutations of the B58C data set.

Results
Individually linked HLA-DRB1 diplotypes and Illumina HumanHap550 genotypes were available for 1217 members of B58C and for 1187 controls and 799 cases from NARAC. The frequency of DRB1*04 was 21.1% in the B58C reference set, 16.5% in the NARAC controls and 52.2% in the NARAC cases. Counting chromosomes among cases and controls, the observed odds ratio of rheumatoid arthritis per copy of the DRB1*04 variant was 5.54 (95%CI 4.78-6.41) corresponding to a Φ of 0.4280.
All subsequent data relates to 156 SNPs common to both data sets where the minor allele frequency (MAF) of the tag is greater than that of DRB1*04, also less than 44% in the reference data set and |D| > 0.01, where D is the covariance measure of LD. These restrictions are imposed to remove SNPs where the effect allele could be inconsistent between the B58C data and the NARAC controls data.
For general application, the method needs to apply measures of LD derived from a reference data set to the observed data set where the target variant has not been measured. Figure 1 shows good correspondence for all measures of LD between B58C and NARAC controls. Our inference method assumes that the s measure of LD (see Appendix for definition) is constant between cases and controls. In this example, even where there is a strong association between the target variant and disease, this assumption appears valid, as shown in Figure 2. In most situations, the association of the target variant with tagSNPs in the observed case-control data would be unknown and estimated from the reference data set (e.g., B58C). Using D' and r 2 from the B58C data, the pooled estimate of Φ is 0.4530, which is very close to the value of 0.4280 derived directly in the NARAC case-control data.
The variance of the pooled estimate of Φ is greater than the sum of the weights, because many of the neighboring tagSNPs are intercorrelated. The precision of this pooled estimate was estimated by performing 1000 random permutations of the B58C data set to derive the variance of the pooled estimate under the null hypothesis. These results are summarized in Table 1. Inclusion of all 156 SNPs generated a pooled estimate of similar magnitude and precision to that from the three most closely associated SNPs. On the other hand, if no closely associated tagSNPs had been available, results with similar magnitude and only slight loss of precision could be obtained from the majority of SNPs with r 2 < 0.3. Indeed, in this example, from a region of high LD (D' > 0.5), useful information could be derived even from tagSNPs with r 2 < 0.1.

Conclusion
Comparing B58C to the NARAC controls, there was close correspondence in LD patterns. Within the NARAC data, however, the LD measures r 2 and t in the region of interest were markedly different in cases and controls, reflecting the strong association between the target variant (DRB1*04) and rheumatoid arthritis. When compared with imputation approaches, the inference method has the advantage of working only with the LD pattern among controls, although it does assume generalizability of the sensitivity parameter s from controls to cases. This latter assumption was supported in the NARAC data. The inference approach that we describe can, with modification, be applied to continuous outcomes and offers an alternative to imputation methods that will be particularly attractive for target variants that are not part of standard panels such as HapMap, and for exploring further genome-wide sets of association measures that have been derived from metaanalysis of a particular disease or quantitative trait.   Within a total population, the LD between each tagSNP and an unmeasured target variant may be represented by Table 2, based on relative counts of chromosomes. The parameter s represents the proportion of the target variant that also has the tagSNP variant (akin to sensitivity in screening terminology). Another measure, equivalent to the positive predictive value, is t (= sp/q), the proportion of chromosomes with the tagSNP allele that also have the target variant. We seek to infer a value for the true relative risk (R) from the measured relative risk (R obs ) associated with each tagSNP.

List of abbreviations used
We define a transformation of R, as follows: We show below that Φ obs /D' is an unbiased estimator of Φ true . In We infer values of Φ infer = Φ obs /D' for each tagSNP and derive a weighted average, with weights inversely proportional to their variance under the null hypothesis. An approximate variance for Φ infer can be derived by the Delta method by considering Φ obs as a function of p, q, D' and b = ln(R obs ), as follows, noting that: Inverse-variance weights proportional to the LD measure r 2 are therefore appropriate because p (the minor allele frequency for the target variant) is the same for all tag SNPs and c and N are fixed by the study design.