Simulated data without population stratification
The RA data set was simulated according to familial patterns and other environmental effects. Each of the 100 replicates has 1500 nuclear families consisting of one affected sibling pair (ASP) and their parents, and 2000 unrelated unaffected individuals as controls. Markers include 730 microsatellite markers, 9187 evenly distributed SNPs on 22 autosomal chromosomes, and 17,820 dense SNPs on chromosome 6. In the analysis, we used the first 200 families and the first 200 people of the 2000 controls. To include unrelated cases in the analysis, we randomly picked one of the two affected siblings from the next 200 families. Our final data set includes 200 families, 200 unrelated cases, and 200 controls. Among the 200 selected families, there were 56 families with a single parent and two families with both parents affected. In the most general setting, we form one group of all affected individuals consisting of affected siblings, affected parents, and unrelated cases, which was compared with a group of all unaffected individuals consisting of unaffected siblings, unaffected parents, and unrelated controls. Depending on the number of affected parents, there are three possible groupings for a family with r affected siblings with genotype x1,..., x
r
; s unaffected siblings with genotype y1,..., y
s
, and parents with genotype x
m
and x
f
Here, genotypes x and y denote the number of a particular allele whose allele frequency is p. Suppose in the data there are l families with both unaffected parents, m families with one affected parent (say the mother), n families with both affected parents, and additionally unrelated cases wi, i = 1,..., u, and v controls z
i
, i = 1,..., v. The allele frequencies of the two groups are given by:
We then use a normal test statistic , which is a generalization of Risch and Teng's result [2]. In particular, Var(p
a
- p
u
) = Var(p
a
) + Var(p
u
) - 2Cov(p
a
, p
u
). Assuming Hardy-Weinberg equilibrium, each term is given below:
And p is the estimated average allele frequency of all subjects in the data. For our final data, r = 2; s = 0; l = 140; m = 56; n = 2, and u = v = 200.
In the presence of population stratification
In the situation of population stratification, we suggest an approach to adjust the genotype data using principal components before the above procedures are applied. Unfortunately, the RA data was simulated without a population stratification effect, therefore we only give brief idea of this method here. The rationale of this approach is that across the genome there should be a consistent pattern among allele frequency differences, and that pattern is summarized by principal components to which many markers contribute. We sketch the procedures below. Details may be found in Price et al. [3]. First, pick founders from each family and all unrelated case-controls. Denote the genotype at the ith locus for jth individual by g
ij
, i = 1,..., M and j = 1,..., N. Let be the sample mean for ith locus and X = (x
ij
) the matrix normalized by subtracting u
i
from each row and dividing by . Second, compute the estimated covariance matrix of all markers , and list the first k largest eigenvalues λ1,..., λ
k
with corresponding eigenvectors v1,..., v
k
The lth eigenvector v
l
= (vl1,..., v
lM
) gives the lth principal component as . Finally, regress genotypes on the markers by , where is the regression coefficient for lth marker and jth individual.