Identifying regions of disease-related variants in admixed populations with the summation partition approach

We propose a new method for identifying disease-related regions of single nucleotide variants in recently admixed populations. We use principal component analysis to derive both global and local ancestry information. We then use the summation partition approach to search for disease-related regions based on both rare variants and the local ancestral information of each region. We demonstrate this method using individuals with high systolic blood pressure from a sample of unrelated Mexican American subjects provided in the 19th Genetic Analysis Workshop.


Background
Genome-wide association studies commonly use admixture mapping to search for disease-related regions in the genome of recently admixed populations. Admixture mapping refers to methods that trace the ancestral origin of genetic loci and then determine whether the ancestry providing that loci is in disequilibrium with the disease. These methods assume disease-related regions of the genome occur at different rates depending on whether they were inherited from one of the ancestral populations.
Hypertension is a serious disease, affecting 1 in 3 adult Americans and leading to over sixty thousand deaths in 2009 [1]. Yet, hypertension does not afflict Americans equally across race and ethnic lines. African Americans are twice as likely to have high blood pressure or to be taking antihypertensive medication than are Mexican Americans (~40 % vs.~20 %) according to the Center for Disease Control [2]. The non-Hispanic white population is in the middle, with roughly 30 % having hypertension or being on antihypertensive medication.
Research also suggests individuals have a genetic predisposition to hypertension [3], and its disparate incidence across race and ethnic lines appears to make it a strong candidate for admixture mapping. While several methods exist for deriving this ancestral information, principal component analysis (PCA) has proven to be a simple, yet powerful, tool for reducing the variation of high-dimensional data and extracting ancestry related information from admixed samples [4,5]. We use the loading scores from principal components to derive both global and local ancestral information for this reason.
Apart from admixture mapping, new research finds that rare variants have an important role in explaining complex diseases such as high blood pressure. We presume these rare variants contain additional diseaserelated information not present in common variants of the same region. The summation partition approach (SPA) [6] has been used successfully to collapse small groups of rare variants and investigate each group's association with a disease. The advantage of SPA is that it can be more powerful than other collapsing techniques, especially when working with rare variants in relatively small data sets [6].
In this paper, we propose a new method, local ancestrysummation partition approach (LA-SPA), which combines the power of admixture mapping in leveraging local ancestry structures with the power of SPA in leveraging the information in rare variants. We demonstrate the LA-SPA method using individuals with high systolic blood pressure from the 19th Genetic Analysis Workshop (GAW19) Mexican American unrelated sample data set.

Methods
We explore our LA-SPA method using GAW19 genomewide association data from 1851 Mexican Americans [7]. Our data set contains 428,574 single-nucleotide variants (SNVs) from odd-numbered chromosomes. Missing data was relatively uncommon and therefore imputed by sampling uniformly over the remaining non-missing observations. Phenotype data on systolic blood pressure, diastolic blood pressure, year of examination, age, gender, and medication usage was also available. Medication status took 3 values depending on whether the individual was on hypertension medication, not on hypertension medication, or whether such a status was unknown. Systolic blood pressure is the dependent variable of this analysis.
Our method has 4 stages. In this section, we walk through each stage as we performed it on the GAW19 data.

Stage 1: adjust for covariates and global population structure
We first obtain the residuals from regressing systolic blood pressure on age, gender, medication status, and the loading score of the first principal component of all 428,574 SNVs. These variables explain roughly 25 % of the variation of systolic blood pressure. The loading score represents global ancestry, and we only use the first principal component in this stage as additional components were deemed unnecessary (see Discussion below for more details). We interpret the residuals as containing information on systolic blood pressure in excess of confounding global variables. We treat the residuals as our new quantitative phenotype and denote it as Y.
Step 2: divide variants into regions We group consecutive SNVs into regions of 500. There were 862 regions in total. We then stratified the SNVs in each region into either "common" or "rare" SNV groups based on their minor allele frequency (MAF >0.05 or MAF <0.05). Roughly 80 % of SNVs in each region are rare variants.
Step 3: calculate the local ancestry-summation partition approach statistic We perform PCA on the group of common variants in each segment separately and recorded the values of the loading scores of the first 3 principal components. Local ancestry was estimated by performing the k-means algorithm (k = 3) on these components. We interpret the result as corresponding to the 3 possible ancestral origins of each region: white, African, and Native American. We then test the association between the adjusted phenotype from stage 1 and the rare SNV group of each segment by the local ancestral origin of the region using a variation of SPA. The following is a brief explanation of the SPA algorithm [6] and our variation.
Consider a region with K rare variants. The marginal SPA test statistic is defined as: where n i is the total number of i th rare variants in all subjects, Y i À is the averaged phenotype for subjects having at least 1 rare variant at the i th SNV position and Y is the sample average. The value of I 1 reflects the strength of the association between the group of rare variants and the residuals from stage 1.
To jointly model local ancestral information and rare variants, we propose an LA-SPA test statistic, I A , defined as: where n i,j is the total number of rare variants observed in subjects from j th local ancestry and Y i;j À is the mean phenotype of individuals with rare variants i in the j th local ancestral cluster, and Y is defined as before. I A is a modification of I 1 , similar to the I 2 of Fan [6]. It reflects the strength of the association between phenotypes and a region of rare variants, partitioned by the local ancestry of the region.

Step 4: calculate p values by permutation
Permutation is used to evaluate the significance of each test statistic. For I A , 2 p values are generated reflecting permutation within and between ancestral origins. We refer to them as the global p value, which reflects information between common and rare variants, and the local p value, which reflects information within the rare variants grouped by the common variants. We used 10,000 permutations.

Results
The number of rare variants and the 3 p values from the test statistic of each group of variants were retained after performing SPA twice (first for I A and then I 1 ; I 1 was calculated for comparison purposes). P values are plotted in Fig. 1 and regions with relatively low I A p values are listed in Table 1. We believe these regions contain additional, disease-related information and warrant more in-depth analysis despite that they are not significant when adjusting for multiple comparisons. This includes regions on chromosome 3 and 19, which, to our knowledge, have not yet been linked to hypertension or diabetes. Regions with significant I A test statistics but not significant I 1 test statistics are also interesting since those regions became significant only after local ancestral information was taken into account. This includes regions on Chromosomes 9, 11, 17 and 19, which are the last four regions listed in Table 1. In the future, better results can be obtained by increasing the sample size so that the number of variants in each region can be reduced and the local ancestral information of each region can be calculated more accurately.

Discussion
There are several points related to this approach worth mentioning. First, when performing PCA on all SNVs to measure global ancestry, we initially retained all of the loading scores of the leading principal components (PCs). To decide how many PCs to use, we assessed the stability of the loading scores by bootstrapping. We randomly sampled 10,000 SNVs with replacement 1000  times, and computed the scores based on each bootstrapped SNV set. We found that the loading scores of the first PC were highly correlated across different bootstrap runs. In contrast, the loading scores corresponding to other PCs were highly variable. We concluded that the first PC from genome-wide collections of SNVs consistently represents the global ancestry information. Second, the decision to divide variants into regions of 500 was made after recognizing the tradeoff between genome resolution and the power to detect local ancestral information. Because local ancestry information is inferred from common variants, using regions of 500 SNVs yields an average of 100 common variants in each region that can be used to reliably infer local ancestral information. To judge the sensitivity of our method to this choice, we also performed the analysis with regions of 250 and 750 SNVs, which yield an average of 50 and 150 common variants respectively to infer local ancestral information. Of the 8 regions identified with regions of size 500, 3 regions on chromosomes 1, 7, and 17 were identified with regions of size 250 and 750. However, 3 regions containing cytogenetic locations associated with hypertension were identified only using regions of size 500 and not size 250 or 750. We believe that regions of size 500 are ideal as in regions of size 250 there is insufficient ancestry information and in regions of size 750 there is a loss in genomic resolution.
Finally, the choice of MAF = 0.05 as the cutoff for determining whether a variant was rare was somewhat arbitrary. We thought that 0.05 was low enough that the information contained within the rare variants would not be overwhelmed by the local ancestral information of the region. Considering multiple MAF cutoffs and observing whether the significance of the region changes would strengthen our method.

Conclusions
The LA-SPA is a novel approach to variant selection that combines the power of admixture mapping with the power of SPA. When the subjects of a study are known to have come from an admixed population, exploiting ancestral structure can boost the signal of disease-related rare variants and allow for a more effective screening of the genome.