Statistical methods
Based on the linear regression model of Cokerham and Zeng [2] (also see Cordell [3]), we propose a linear mixed model for detecting epistatic interactions for quantitative traits using family-based data:
y = μ + a1x1 + d1z1 + a2x2 + d2z2 + i
aa
x1x2 + i
ad
x1z2 + i
da
z1x2 + i
dd
z1z2 + Wβ + v + ε.
This model assumes diallelic marker loci and that y is a normally distributed quantitative gene expression phenotype from related individuals, W is a vector of fixed covariates such as sex effects, β is the corresponding vector of coefficients, v is the random polygenic effect within a family, the vector of polygenic effects in each family follows multi-normal distribution N(0, 2A) where A is the kinship matrix and is the variance associated with vectors of polygenic effects, a
i
and d
i
are the additive and dominant effects, and x
i
and z
i
are dummy variables related to the genotypes at the locus i. For example, for a diallelic locus, we might set x
i
= 1 and z
i
= -0.5 for genotype BB, x
i
= 0 and z
i
= 0.5 for genotype Bb, and x
i
= -1 and z
i
= -0.5 for genotype bb, respectively. i
aa
, i
ad
, i
da
, and i
dd
are additive-additive, additive-dominant, and dominant-dominant interaction effects between the two loci, respectively, corresponding to epistatic interaction effects, and ε is the residual error, following normal distribution N(0, ). Significant interaction effects imply presence of epistasis.
To detect epistasis, for each gene expression phenotype, we ran Model (1) in SOLAR for each pair of single-nucleotide polymorphisms (SNPs) in the selected candidate regions (see Description of the data set for more details). The number of tests for each gene expression phenotype ranges from 6 to 820, depending on the marker density and size of the candidate regions selected for the epistasis search. For each gene expression phenotype, individual p-values were adjusted using false-discovery rate (FDR) under the general dependency assumption [10] within each phenotype. FDR-adjusted p-values equal to or less than 0.05 (FDR ≤ 0.05) are considered to be significant.
Simulation study
We simulated a data set based on the pedigree structure from CEPH family data, which has 14 three-generation families of 194 individuals. We considered two unlinked diallelic markers in our analysis with allele frequency of 0.5. Marker genotypes for the grandparents were generated assuming Hardy-Weinberg equilibrium at each locus. Genotypes for parents and children were simulated conditional on their parental genotypes following Mendel's law. As an example we evaluated the power (true negative rate) and type I error (false-positive rate) of the proposed method in identifying additive-additive epistatic effect i
aa
. Phenotypes of each individual was generated based on the Model (1), where a1 = 0.2, d1 = 0.01, a2 = 0.2, d2 = 0.01, i
ad
= i
da
= i
dd
= 0, β = 0.1 (the vector W only contains sex), . When evaluating power and type I error we set i
aa
= 0.7 and i
aa
= 0, respectively. These values were chosen based on the estimated values from analysis of selected 27 traits in the CEPH family data. We plotted receiver operating characteristic (ROC) curve by calculating specificity and sensitivity as we varied the nominal threshold for determining the significant epistasis, where:
1 - specificity = (false positive)/(true negative + false positive)
sensitivity = (true positive)/(true positive + false negative).
Description of the dataset
The CEPH family data provided by GAW15 includes 3554 Affymetrix® gene expressions measured for 194 individuals from 14 three-generation CEPH families. In addition, 2882 autosomal and X-linked SNPs were typed for these individuals.
The software package pedStat as distributed in Merlin version 1.0.1 [11] was used to check for Mendelian inconsistence, genotyping proportions, and heterozygosity of SNPs. SNP markers with minor allele frequencies less than 1% (equivalent to heterozygousity < 1.98%), markers with greater than 30% missing genotypes, and markers with only two of the three possible genotypes were removed from analysis, which left 2436 SNPs for our analysis.
We limited our analysis to the 27 gene expression phenotypes with the strongest linkage evidence of cis effects (Table 1 from Cheung et al [12]). Their Table 1 listed one to two peak markers for each phenotype, where a peak marker is the SNP with the most significant finding in the genome-wide association analysis (GWA) for this phenotype. Fourteen of the 27 gene expression phenotypes exhibited cis regulation (with cis peak markers) by the GWA analysis. For the phenotype PPAT there are two peak markers that point to both cis and trans regulation for this gene. For the remaining 12 phenotypes, the peak markers are trans markers. In our study, for each of the 27 phenotypes we selected a 15-Mb candidate region centered on the target gene location. If a trans peak marker was identified in the GWA analysis, we also selected an additional 15-Mb candidate region centred on that marker. We analyzed the epistatic effects for all possible combinations of the SNPs within the candidate regions.